summaryrefslogtreecommitdiff
path: root/DESIGN.txt
diff options
context:
space:
mode:
authorMatthew Sotoudeh <matthew@masot.net>2024-02-19 02:25:28 -0800
committerMatthew Sotoudeh <matthew@masot.net>2024-02-19 02:25:28 -0800
commitffc6388571004b17e3a3dee2511ec99076ee803a (patch)
treeba14dfc0fc0b1beae949e12b8853fa690d588d2a /DESIGN.txt
initial dump
Diffstat (limited to 'DESIGN.txt')
-rw-r--r--DESIGN.txt116
1 files changed, 116 insertions, 0 deletions
diff --git a/DESIGN.txt b/DESIGN.txt
new file mode 100644
index 0000000..0762ca6
--- /dev/null
+++ b/DESIGN.txt
@@ -0,0 +1,116 @@
+====== Disambiguation Issues ========
+Consider two possible parses of 0 + 1 * 5 + 4
+ ( ( 0 + ( 1 * 5 ) ) + 4 )
+ ( 0 + ( 1 * ( 5 + 4 ) ) )
+
+Problem: no two subparses apply to the same range except at the very end, but
+at that point it's too hard for us to disambiguate.
+
+I think this is getting too messed up. The semantics we want is just: every
+range has exactly one parsing as a certain nonterm.
+
+The easiest way to handle this is to just keep a table:
+ (start_idx, symbol) -> [state 1, state 2, ...]
+that we've completed at this iteration. When a new completion happens, look up
+in that table to see if we've already completed that (start_idx, nonterm) in
+this iteration. If so, compare them and overwrite one if desired.
+
+BUT, if the comparison is not total, might have a scenario where we have two
+incomparable parses before one perfect parse...
+
+====== Handling disambiguation ========
+Basically, the issue occurs when we have the same nonterminal matched multiple
+ways on the same range.
+
+Easy-to-express ways of disambiguating:
+*** Precedence: given two different top-level productions used, the highest one
+ in the grammar is preferred. Problems with precedence:
+
+ 1. Not clear what to do when the same top-level production is used.
+
+ 2. In the case of, e.g., needing a symbol table, precedence is not good
+ enough and can be actively harmful.
+
+Associativity is easy: given two things of the form A + B, just pick the one
+where the location of the '+' token is leftmost!
+
+---------------------------------------------------
+Issue with queue ordering:
+What if we:
+ - Add (S -> Y . X, k)
+ - Queues (X -> ., k)
+ - Add (X -> ., k)
+ - Completes (S -> Y . X, k)
+ ...
+ - Add (C -> Y . X, k)
+ - Queues (X -> ., k)
+ - Ignore (X -> ., k) because already in set!
+
+But we also can't just not deduplicate; consider the grammar
+ S -> S S | ...
+it would cause nontermination. it even causes nontermination if we only
+deduplicate the queue. When is it safe to deduplicate?
+
+Can we avoid all of this by just disallowing EPSILON in the grammars? Does that
+suffice to ensure that completions alway complete states with start_idx <k? I
+believe so. And lifting epsilons out can be performed automatically if we
+eventually we want to do that.
+https://stackoverflow.com/questions/35524183/earley-cannot-handle-epsilon-states-already-contained-in-chart
+says we can do it during prediction
+
+With that change, I think it's OK to do full deduplication: any time there's an
+equivalent state either already in the set or the queue for that set, drop it.
+
+-------------------------------------------------
+Earley parser
+
+Read character by character
+
+Each state is going to be a single production rule + a location pointer into
+that production rule's pattern + a starting/origin position
+
+(From the Wiki page:)
+1. Add new states matching possible nonterminals at this idx into the string
+2. Eat this character where applicable, and add the corresponding states into
+ the next-char state set
+3. If a production is matched here, bring forward older productions that are
+ made possible by them
+
+Triggers work like:
+1. Whenever a state is added to S(k), it triggers the addition of states for
+ all productions from a certain nonterminal
+2. Quickly find all states in S(k) where the location pointer is pointing to
+ the given terminal
+3. Quickly find all states in S(j) that are waiting on a particular nonterminal
+
+2 and 3 can be solved together: just have a map
+ (i, x) -> [states in S(i) waiting on nonterminal x]
+Each state is waiting on exactly one thing, so can just do a DLL for the list
+
+(1) is solved by just storing the productions in a table, or precompute an 'add
+all' function for each table, etc.
+
+Algo:
+ k = 0
+ QUEUE[0] = [(START -> ..., 0, 0)]
+ for k:
+ while QUEUE[k]:
+ prod, loc, start_i = QUEUE[k].pop()
+ if prod[loc] == END:
+ # COMPLETION
+ for state in WAITERS[prod.nonterm]:
+ enqueue(increment_loc(state), k)
+ elif prod[loc] is NONTERM:
+ # PREDICTION
+ add_all_rules_for(prod[loc])
+ elif prod[loc] is TERM and prod[loc] == CURRENT:
+ # SCANNING
+ enqueue(increment_loc(prod, loc, start_i), k+1)
+
+To extract the parse tree, we need to add to each state a pointer for each loc
+in the production to another state. That pointer is set when we perform the
+'completion' action.
+
+Disambiguation:
+ Ambiguity pops up when we perform COMPLETION for the same nonterminal
+ with the same starting idx and current idx multiple times.
generated by cgit on debian on lair
contact matthew@masot.net with questions or feedback