====== Disambiguation Issues ======== Consider two possible parses of 0 + 1 * 5 + 4 ( ( 0 + ( 1 * 5 ) ) + 4 ) ( 0 + ( 1 * ( 5 + 4 ) ) ) Problem: no two subparses apply to the same range except at the very end, but at that point it's too hard for us to disambiguate. I think this is getting too messed up. The semantics we want is just: every range has exactly one parsing as a certain nonterm. The easiest way to handle this is to just keep a table: (start_idx, symbol) -> [state 1, state 2, ...] that we've completed at this iteration. When a new completion happens, look up in that table to see if we've already completed that (start_idx, nonterm) in this iteration. If so, compare them and overwrite one if desired. BUT, if the comparison is not total, might have a scenario where we have two incomparable parses before one perfect parse... ====== Handling disambiguation ======== Basically, the issue occurs when we have the same nonterminal matched multiple ways on the same range. Easy-to-express ways of disambiguating: *** Precedence: given two different top-level productions used, the highest one in the grammar is preferred. Problems with precedence: 1. Not clear what to do when the same top-level production is used. 2. In the case of, e.g., needing a symbol table, precedence is not good enough and can be actively harmful. Associativity is easy: given two things of the form A + B, just pick the one where the location of the '+' token is leftmost! --------------------------------------------------- Issue with queue ordering: What if we: - Add (S -> Y . X, k) - Queues (X -> ., k) - Add (X -> ., k) - Completes (S -> Y . X, k) ... - Add (C -> Y . X, k) - Queues (X -> ., k) - Ignore (X -> ., k) because already in set! But we also can't just not deduplicate; consider the grammar S -> S S | ... it would cause nontermination. it even causes nontermination if we only deduplicate the queue. When is it safe to deduplicate? Can we avoid all of this by just disallowing EPSILON in the grammars? Does that suffice to ensure that completions alway complete states with start_idx [states in S(i) waiting on nonterminal x] Each state is waiting on exactly one thing, so can just do a DLL for the list (1) is solved by just storing the productions in a table, or precompute an 'add all' function for each table, etc. Algo: k = 0 QUEUE[0] = [(START -> ..., 0, 0)] for k: while QUEUE[k]: prod, loc, start_i = QUEUE[k].pop() if prod[loc] == END: # COMPLETION for state in WAITERS[prod.nonterm]: enqueue(increment_loc(state), k) elif prod[loc] is NONTERM: # PREDICTION add_all_rules_for(prod[loc]) elif prod[loc] is TERM and prod[loc] == CURRENT: # SCANNING enqueue(increment_loc(prod, loc, start_i), k+1) To extract the parse tree, we need to add to each state a pointer for each loc in the production to another state. That pointer is set when we perform the 'completion' action. Disambiguation: Ambiguity pops up when we perform COMPLETION for the same nonterminal with the same starting idx and current idx multiple times.