====== Performance Issues ======== The biggest performance issues involve parsing grammar LIST nonterm LIST , EXPR EXPR With a huge input of A , A , A , A , ... Naively, the parser will try to match an EXPR starting at each of the As. I guess this is because even if you are associating so far like (((A, A), A), A), ... It doesn't know if, later on in the parse, you might want to reassociate 'across the boundary': (((A, A), A), (A, B)), ... At this point, I don't think it's worth trying to do much more general-purpose optimization for this case. Instead, I think we should have callouts to user code to give hints. Two possible hint types: 1. "Parse this region as this parse tree." Easy to use: when you get there, just skip all of those indices and complete that tree with any watchers. (Might need to do prediction at the first index.) Actually, that's probably by far the easiest. ====== More Disambiguation Issues ======== What if we poison X, use it in completion Y, but then overwrite X with a nonpoisoned Z? Then Y will be incorrectly poisoned... ====== More Disambiguation Issues ======== Consider STMTS nonterm .start STMTS STMT STMT vs. STMTS nonterm .start STMT STMTS STMT Swapping this can actually impact what matches happen in STMT :O ====== More Disambiguation Issues ======== Associativity and precedence can have weird interplay E.g., maybe you can get Stmts(Error(1), Stmt(2, 3)) which has better associativity than Stmts(Stmt(1, 2), Stmt(3)) ====== Disambiguation Issues ======== Consider two possible parses of 0 + 1 * 5 + 4 ( ( 0 + ( 1 * 5 ) ) + 4 ) ( 0 + ( 1 * ( 5 + 4 ) ) ) Problem: no two subparses apply to the same range except at the very end, but at that point it's too hard for us to disambiguate. I think this is getting too messed up. The semantics we want is just: every range has exactly one parsing as a certain nonterm. The easiest way to handle this is to just keep a table: (start_idx, symbol) -> [state 1, state 2, ...] that we've completed at this iteration. When a new completion happens, look up in that table to see if we've already completed that (start_idx, nonterm) in this iteration. If so, compare them and overwrite one if desired. BUT, if the comparison is not total, might have a scenario where we have two incomparable parses before one perfect parse... ====== Handling disambiguation ======== Basically, the issue occurs when we have the same nonterminal matched multiple ways on the same range. Easy-to-express ways of disambiguating: *** Precedence: given two different top-level productions used, the highest one in the grammar is preferred. Problems with precedence: 1. Not clear what to do when the same top-level production is used. 2. In the case of, e.g., needing a symbol table, precedence is not good enough and can be actively harmful. Associativity is easy: given two things of the form A + B, just pick the one where the location of the '+' token is leftmost! --------------------------------------------------- Issue with queue ordering: What if we: - Add (S -> Y . X, k) - Queues (X -> ., k) - Add (X -> ., k) - Completes (S -> Y . X, k) ... - Add (C -> Y . X, k) - Queues (X -> ., k) - Ignore (X -> ., k) because already in set! But we also can't just not deduplicate; consider the grammar S -> S S | ... it would cause nontermination. it even causes nontermination if we only deduplicate the queue. When is it safe to deduplicate? Can we avoid all of this by just disallowing EPSILON in the grammars? Does that suffice to ensure that completions alway complete states with start_idx [states in S(i) waiting on nonterminal x] Each state is waiting on exactly one thing, so can just do a DLL for the list (1) is solved by just storing the productions in a table, or precompute an 'add all' function for each table, etc. Algo: k = 0 QUEUE[0] = [(START -> ..., 0, 0)] for k: while QUEUE[k]: prod, loc, start_i = QUEUE[k].pop() if prod[loc] == END: # COMPLETION for state in WAITERS[prod.nonterm]: enqueue(increment_loc(state), k) elif prod[loc] is NONTERM: # PREDICTION add_all_rules_for(prod[loc]) elif prod[loc] is TERM and prod[loc] == CURRENT: # SCANNING enqueue(increment_loc(prod, loc, start_i), k+1) To extract the parse tree, we need to add to each state a pointer for each loc in the production to another state. That pointer is set when we perform the 'completion' action. Disambiguation: Ambiguity pops up when we perform COMPLETION for the same nonterminal with the same starting idx and current idx multiple times.