From ffc6388571004b17e3a3dee2511ec99076ee803a Mon Sep 17 00:00:00 2001 From: Matthew Sotoudeh Date: Mon, 19 Feb 2024 02:25:28 -0800 Subject: initial dump --- DESIGN.txt | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 116 insertions(+) create mode 100644 DESIGN.txt (limited to 'DESIGN.txt') diff --git a/DESIGN.txt b/DESIGN.txt new file mode 100644 index 0000000..0762ca6 --- /dev/null +++ b/DESIGN.txt @@ -0,0 +1,116 @@ +====== Disambiguation Issues ======== +Consider two possible parses of 0 + 1 * 5 + 4 + ( ( 0 + ( 1 * 5 ) ) + 4 ) + ( 0 + ( 1 * ( 5 + 4 ) ) ) + +Problem: no two subparses apply to the same range except at the very end, but +at that point it's too hard for us to disambiguate. + +I think this is getting too messed up. The semantics we want is just: every +range has exactly one parsing as a certain nonterm. + +The easiest way to handle this is to just keep a table: + (start_idx, symbol) -> [state 1, state 2, ...] +that we've completed at this iteration. When a new completion happens, look up +in that table to see if we've already completed that (start_idx, nonterm) in +this iteration. If so, compare them and overwrite one if desired. + +BUT, if the comparison is not total, might have a scenario where we have two +incomparable parses before one perfect parse... + +====== Handling disambiguation ======== +Basically, the issue occurs when we have the same nonterminal matched multiple +ways on the same range. + +Easy-to-express ways of disambiguating: +*** Precedence: given two different top-level productions used, the highest one + in the grammar is preferred. Problems with precedence: + + 1. Not clear what to do when the same top-level production is used. + + 2. In the case of, e.g., needing a symbol table, precedence is not good + enough and can be actively harmful. + +Associativity is easy: given two things of the form A + B, just pick the one +where the location of the '+' token is leftmost! + +--------------------------------------------------- +Issue with queue ordering: +What if we: + - Add (S -> Y . X, k) + - Queues (X -> ., k) + - Add (X -> ., k) + - Completes (S -> Y . X, k) + ... + - Add (C -> Y . X, k) + - Queues (X -> ., k) + - Ignore (X -> ., k) because already in set! + +But we also can't just not deduplicate; consider the grammar + S -> S S | ... +it would cause nontermination. it even causes nontermination if we only +deduplicate the queue. When is it safe to deduplicate? + +Can we avoid all of this by just disallowing EPSILON in the grammars? Does that +suffice to ensure that completions alway complete states with start_idx [states in S(i) waiting on nonterminal x] +Each state is waiting on exactly one thing, so can just do a DLL for the list + +(1) is solved by just storing the productions in a table, or precompute an 'add +all' function for each table, etc. + +Algo: + k = 0 + QUEUE[0] = [(START -> ..., 0, 0)] + for k: + while QUEUE[k]: + prod, loc, start_i = QUEUE[k].pop() + if prod[loc] == END: + # COMPLETION + for state in WAITERS[prod.nonterm]: + enqueue(increment_loc(state), k) + elif prod[loc] is NONTERM: + # PREDICTION + add_all_rules_for(prod[loc]) + elif prod[loc] is TERM and prod[loc] == CURRENT: + # SCANNING + enqueue(increment_loc(prod, loc, start_i), k+1) + +To extract the parse tree, we need to add to each state a pointer for each loc +in the production to another state. That pointer is set when we perform the +'completion' action. + +Disambiguation: + Ambiguity pops up when we perform COMPLETION for the same nonterminal + with the same starting idx and current idx multiple times. -- cgit v1.2.3