DESIGN.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116

====== Disambiguation Issues ========
Consider two possible parses of 0 + 1 * 5 + 4
    ( ( 0 + ( 1 *   5 ) ) + 4     )
    (   0 + ( 1 * ( 5     + 4 ) ) )

Problem: no two subparses apply to the same range except at the very end, but
at that point it's too hard for us to disambiguate.

I think this is getting too messed up. The semantics we want is just: every
range has exactly one parsing as a certain nonterm.

The easiest way to handle this is to just keep a table:
    (start_idx, symbol) -> [state 1, state 2, ...]
that we've completed at this iteration. When a new completion happens, look up
in that table to see if we've already completed that (start_idx, nonterm) in
this iteration. If so, compare them and overwrite one if desired.

BUT, if the comparison is not total, might have a scenario where we have two
incomparable parses before one perfect parse...

====== Handling disambiguation ========
Basically, the issue occurs when we have the same nonterminal matched multiple
ways on the same range.

Easy-to-express ways of disambiguating:
*** Precedence: given two different top-level productions used, the highest one
    in the grammar is preferred. Problems with precedence:

    1. Not clear what to do when the same top-level production is used.

    2. In the case of, e.g., needing a symbol table, precedence is not good
       enough and can be actively harmful.

Associativity is easy: given two things of the form A + B, just pick the one
where the location of the '+' token is leftmost!

---------------------------------------------------
Issue with queue ordering:
What if we:
    - Add (S -> Y . X, k)
        - Queues (X -> ., k)
    - Add (X -> ., k)
        - Completes (S -> Y . X, k)
    ...
    - Add (C -> Y . X, k)
        - Queues (X -> ., k)
    - Ignore (X -> ., k) because already in set!

But we also can't just not deduplicate; consider the grammar
    S -> S S | ...
it would cause nontermination. it even causes nontermination if we only
deduplicate the queue. When is it safe to deduplicate?

Can we avoid all of this by just disallowing EPSILON in the grammars? Does that
suffice to ensure that completions alway complete states with start_idx <k? I
believe so. And lifting epsilons out can be performed automatically if we
eventually we want to do that.
https://stackoverflow.com/questions/35524183/earley-cannot-handle-epsilon-states-already-contained-in-chart
says we can do it during prediction

With that change, I think it's OK to do full deduplication: any time there's an
equivalent state either already in the set or the queue for that set, drop it.

-------------------------------------------------
Earley parser

Read character by character

Each state is going to be a single production rule + a location pointer into
that production rule's pattern + a starting/origin position

(From the Wiki page:)
1. Add new states matching possible nonterminals at this idx into the string
2. Eat this character where applicable, and add the corresponding states into
   the next-char state set
3. If a production is matched here, bring forward older productions that are
   made possible by them

Triggers work like:
1. Whenever a state is added to S(k), it triggers the addition of states for
   all productions from a certain nonterminal
2. Quickly find all states in S(k) where the location pointer is pointing to
   the given terminal
3. Quickly find all states in S(j) that are waiting on a particular nonterminal

2 and 3 can be solved together: just have a map
    (i, x) -> [states in S(i) waiting on nonterminal x]
Each state is waiting on exactly one thing, so can just do a DLL for the list

(1) is solved by just storing the productions in a table, or precompute an 'add
all' function for each table, etc.

Algo:
    k = 0
    QUEUE[0] = [(START -> ..., 0, 0)]
    for k:
        while QUEUE[k]:
            prod, loc, start_i = QUEUE[k].pop()
            if prod[loc] == END:
                # COMPLETION
                for state in WAITERS[prod.nonterm]:
                    enqueue(increment_loc(state), k)
            elif prod[loc] is NONTERM:
                # PREDICTION
                add_all_rules_for(prod[loc])
            elif prod[loc] is TERM and prod[loc] == CURRENT:
                # SCANNING
                enqueue(increment_loc(prod, loc, start_i), k+1)

To extract the parse tree, we need to add to each state a pointer for each loc
in the production to another state. That pointer is set when we perform the
'completion' action.

Disambiguation:
    Ambiguity pops up when we perform COMPLETION for the same nonterminal
    with the same starting idx and current idx multiple times.