diff options
author | Matthew Sotoudeh <matthew@masot.net> | 2024-02-19 02:25:28 -0800 |
---|---|---|
committer | Matthew Sotoudeh <matthew@masot.net> | 2024-02-19 02:25:28 -0800 |
commit | ffc6388571004b17e3a3dee2511ec99076ee803a (patch) | |
tree | ba14dfc0fc0b1beae949e12b8853fa690d588d2a /DESIGN.txt |
initial dump
Diffstat (limited to 'DESIGN.txt')
-rw-r--r-- | DESIGN.txt | 116 |
1 files changed, 116 insertions, 0 deletions
diff --git a/DESIGN.txt b/DESIGN.txt new file mode 100644 index 0000000..0762ca6 --- /dev/null +++ b/DESIGN.txt @@ -0,0 +1,116 @@ +====== Disambiguation Issues ======== +Consider two possible parses of 0 + 1 * 5 + 4 + ( ( 0 + ( 1 * 5 ) ) + 4 ) + ( 0 + ( 1 * ( 5 + 4 ) ) ) + +Problem: no two subparses apply to the same range except at the very end, but +at that point it's too hard for us to disambiguate. + +I think this is getting too messed up. The semantics we want is just: every +range has exactly one parsing as a certain nonterm. + +The easiest way to handle this is to just keep a table: + (start_idx, symbol) -> [state 1, state 2, ...] +that we've completed at this iteration. When a new completion happens, look up +in that table to see if we've already completed that (start_idx, nonterm) in +this iteration. If so, compare them and overwrite one if desired. + +BUT, if the comparison is not total, might have a scenario where we have two +incomparable parses before one perfect parse... + +====== Handling disambiguation ======== +Basically, the issue occurs when we have the same nonterminal matched multiple +ways on the same range. + +Easy-to-express ways of disambiguating: +*** Precedence: given two different top-level productions used, the highest one + in the grammar is preferred. Problems with precedence: + + 1. Not clear what to do when the same top-level production is used. + + 2. In the case of, e.g., needing a symbol table, precedence is not good + enough and can be actively harmful. + +Associativity is easy: given two things of the form A + B, just pick the one +where the location of the '+' token is leftmost! + +--------------------------------------------------- +Issue with queue ordering: +What if we: + - Add (S -> Y . X, k) + - Queues (X -> ., k) + - Add (X -> ., k) + - Completes (S -> Y . X, k) + ... + - Add (C -> Y . X, k) + - Queues (X -> ., k) + - Ignore (X -> ., k) because already in set! + +But we also can't just not deduplicate; consider the grammar + S -> S S | ... +it would cause nontermination. it even causes nontermination if we only +deduplicate the queue. When is it safe to deduplicate? + +Can we avoid all of this by just disallowing EPSILON in the grammars? Does that +suffice to ensure that completions alway complete states with start_idx <k? I +believe so. And lifting epsilons out can be performed automatically if we +eventually we want to do that. +https://stackoverflow.com/questions/35524183/earley-cannot-handle-epsilon-states-already-contained-in-chart +says we can do it during prediction + +With that change, I think it's OK to do full deduplication: any time there's an +equivalent state either already in the set or the queue for that set, drop it. + +------------------------------------------------- +Earley parser + +Read character by character + +Each state is going to be a single production rule + a location pointer into +that production rule's pattern + a starting/origin position + +(From the Wiki page:) +1. Add new states matching possible nonterminals at this idx into the string +2. Eat this character where applicable, and add the corresponding states into + the next-char state set +3. If a production is matched here, bring forward older productions that are + made possible by them + +Triggers work like: +1. Whenever a state is added to S(k), it triggers the addition of states for + all productions from a certain nonterminal +2. Quickly find all states in S(k) where the location pointer is pointing to + the given terminal +3. Quickly find all states in S(j) that are waiting on a particular nonterminal + +2 and 3 can be solved together: just have a map + (i, x) -> [states in S(i) waiting on nonterminal x] +Each state is waiting on exactly one thing, so can just do a DLL for the list + +(1) is solved by just storing the productions in a table, or precompute an 'add +all' function for each table, etc. + +Algo: + k = 0 + QUEUE[0] = [(START -> ..., 0, 0)] + for k: + while QUEUE[k]: + prod, loc, start_i = QUEUE[k].pop() + if prod[loc] == END: + # COMPLETION + for state in WAITERS[prod.nonterm]: + enqueue(increment_loc(state), k) + elif prod[loc] is NONTERM: + # PREDICTION + add_all_rules_for(prod[loc]) + elif prod[loc] is TERM and prod[loc] == CURRENT: + # SCANNING + enqueue(increment_loc(prod, loc, start_i), k+1) + +To extract the parse tree, we need to add to each state a pointer for each loc +in the production to another state. That pointer is set when we perform the +'completion' action. + +Disambiguation: + Ambiguity pops up when we perform COMPLETION for the same nonterminal + with the same starting idx and current idx multiple times. |