initial dump

author: Matthew Sotoudeh <matthew@masot.net> 2024-02-19 02:25:28 -0800
committer: Matthew Sotoudeh <matthew@masot.net> 2024-02-19 02:25:28 -0800
commit: ffc6388571004b17e3a3dee2511ec99076ee803a (patch)
tree: ba14dfc0fc0b1beae949e12b8853fa690d588d2a /DESIGN.txt
1 files changed, 116 insertions, 0 deletions
diff --git a/DESIGN.txt b/DESIGN.txt
new file mode 100644
index 0000000..0762ca6
--- /dev/null
+++ b/DESIGN.txt
@@ -0,0 +1,116 @@
+====== Disambiguation Issues ========
+Consider two possible parses of 0 + 1 * 5 + 4
+    ( ( 0 + ( 1 *   5 ) ) + 4     )
+    (   0 + ( 1 * ( 5     + 4 ) ) )
+
+Problem: no two subparses apply to the same range except at the very end, but
+at that point it's too hard for us to disambiguate.
+
+I think this is getting too messed up. The semantics we want is just: every
+range has exactly one parsing as a certain nonterm.
+
+The easiest way to handle this is to just keep a table:
+    (start_idx, symbol) -> [state 1, state 2, ...]
+that we've completed at this iteration. When a new completion happens, look up
+in that table to see if we've already completed that (start_idx, nonterm) in
+this iteration. If so, compare them and overwrite one if desired.
+
+BUT, if the comparison is not total, might have a scenario where we have two
+incomparable parses before one perfect parse...
+
+====== Handling disambiguation ========
+Basically, the issue occurs when we have the same nonterminal matched multiple
+ways on the same range.
+
+Easy-to-express ways of disambiguating:
+*** Precedence: given two different top-level productions used, the highest one
+    in the grammar is preferred. Problems with precedence:
+
+    1. Not clear what to do when the same top-level production is used.
+
+    2. In the case of, e.g., needing a symbol table, precedence is not good
+       enough and can be actively harmful.
+
+Associativity is easy: given two things of the form A + B, just pick the one
+where the location of the '+' token is leftmost!
+
+---------------------------------------------------
+Issue with queue ordering:
+What if we:
+    - Add (S -> Y . X, k)
+        - Queues (X -> ., k)
+    - Add (X -> ., k)
+        - Completes (S -> Y . X, k)
+    ...
+    - Add (C -> Y . X, k)
+        - Queues (X -> ., k)
+    - Ignore (X -> ., k) because already in set!
+
+But we also can't just not deduplicate; consider the grammar
+    S -> S S | ...
+it would cause nontermination. it even causes nontermination if we only
+deduplicate the queue. When is it safe to deduplicate?
+
+Can we avoid all of this by just disallowing EPSILON in the grammars? Does that
+suffice to ensure that completions alway complete states with start_idx <k? I
+believe so. And lifting epsilons out can be performed automatically if we
+eventually we want to do that.
+https://stackoverflow.com/questions/35524183/earley-cannot-handle-epsilon-states-already-contained-in-chart
+says we can do it during prediction
+
+With that change, I think it's OK to do full deduplication: any time there's an
+equivalent state either already in the set or the queue for that set, drop it.
+
+-------------------------------------------------
+Earley parser
+
+Read character by character
+
+Each state is going to be a single production rule + a location pointer into
+that production rule's pattern + a starting/origin position
+
+(From the Wiki page:)
+1. Add new states matching possible nonterminals at this idx into the string
+2. Eat this character where applicable, and add the corresponding states into
+   the next-char state set
+3. If a production is matched here, bring forward older productions that are
+   made possible by them
+
+Triggers work like:
+1. Whenever a state is added to S(k), it triggers the addition of states for
+   all productions from a certain nonterminal
+2. Quickly find all states in S(k) where the location pointer is pointing to
+   the given terminal
+3. Quickly find all states in S(j) that are waiting on a particular nonterminal
+
+2 and 3 can be solved together: just have a map
+    (i, x) -> [states in S(i) waiting on nonterminal x]
+Each state is waiting on exactly one thing, so can just do a DLL for the list
+
+(1) is solved by just storing the productions in a table, or precompute an 'add
+all' function for each table, etc.
+
+Algo:
+    k = 0
+    QUEUE[0] = [(START -> ..., 0, 0)]
+    for k:
+        while QUEUE[k]:
+            prod, loc, start_i = QUEUE[k].pop()
+            if prod[loc] == END:
+                # COMPLETION
+                for state in WAITERS[prod.nonterm]:
+                    enqueue(increment_loc(state), k)
+            elif prod[loc] is NONTERM:
+                # PREDICTION
+                add_all_rules_for(prod[loc])
+            elif prod[loc] is TERM and prod[loc] == CURRENT:
+                # SCANNING
+                enqueue(increment_loc(prod, loc, start_i), k+1)
+
+To extract the parse tree, we need to add to each state a pointer for each loc
+in the production to another state. That pointer is set when we perform the
+'completion' action.
+
+Disambiguation:
+    Ambiguity pops up when we perform COMPLETION for the same nonterminal
+    with the same starting idx and current idx multiple times.
author	Matthew Sotoudeh <matthew@masot.net>	2024-02-19 02:25:28 -0800
committer	Matthew Sotoudeh <matthew@masot.net>	2024-02-19 02:25:28 -0800
commit	ffc6388571004b17e3a3dee2511ec99076ee803a (patch)
tree	ba14dfc0fc0b1beae949e12b8853fa690d588d2a /DESIGN.txt