|
Lime Parser Generator 0.1.0
Runtime-extensible LALR(1) parser with SIMD tokenization and LLVM JIT
|
This document explains the LALR(1) parsing algorithm as implemented by the Lime parser generator. It covers the theoretical foundations of LR parsing, the specific variant Lime uses, and the complete pipeline from grammar specification to executable parser tables.
LR parsing is a bottom-up parsing technique for deterministic context-free grammars. The name "LR" stands for:
An LR parser works by maintaining a stack of states and grammar symbols. At each step it consults a table indexed by the current state and the next input token (the "lookahead") to decide whether to:
The parser never backtracks. Every decision is made in constant time by a table lookup. This makes LR parsers extremely efficient: parsing time is linear in the length of the input.
Consider the grammar for simple addition:
Parsing the input 3 + 5:
The parser reads tokens left-to-right and builds the parse tree from the leaves up to the root – hence "bottom-up."
There are several variants of LR parsing, each representing a different trade-off between parser table size and grammar recognition power.
The simplest variant. The parser makes shift/reduce decisions based solely on the current state, without examining any lookahead token. Very few practical grammars are LR(0) because most require at least one token of lookahead to decide between shifting and reducing.
An LR(0) item (also called a configuration in Lime's terminology) is a production rule with a dot indicating how much of the rule has been recognized:
Uses one token of lookahead, but computes the lookahead from the grammar's FOLLOW sets rather than from the specific parsing context. SLR(1) uses the same state machine as LR(0) but adds a lookahead check for reduce actions: a reduce by rule A ::= alpha is valid only when the lookahead is in FOLLOW(A).
SLR(1) is strictly more powerful than LR(0) but still fails on some grammars that are unambiguous and deterministic.
The variant implemented by Lime (and also by yacc and bison). LALR(1) uses the same number of states as SLR(1) – that is, the LR(0) state machine – but computes more precise lookahead sets for each configuration. Instead of using the global FOLLOW set, LALR(1) computes a specific follow set for each configuration in each state, based on the actual parsing contexts that can reach that configuration.
This means LALR(1) can handle strictly more grammars than SLR(1). In practice, LALR(1) handles nearly all programming language grammars.
The key insight is that two configurations that appear identical (same rule, same dot position) but in different states may have different valid lookahead tokens. LALR(1) tracks this distinction.
The most powerful single-token-lookahead variant. Each state tracks the full lookahead set for every item, so two items that differ only in their lookahead sets are placed in separate states. This produces the maximum number of states and the largest tables, but can parse any deterministic context-free language with one token of lookahead.
Canonical LR(1) parsers are rarely used in practice because they produce dramatically more states than LALR(1) for most grammars, with little practical benefit.
LALR(1) occupies the practical sweet spot: it has the same compact state machine as LR(0)/SLR(1) but resolves almost all conflicts that canonical LR(1) can resolve.
LL and LR are the two major families of deterministic parsing. They differ fundamentally in direction: LL parsers work top-down (predicting which production to expand), while LR parsers work bottom-up (recognizing which production was completed).
| Property | LALR(1) | LL(1) / LL(k) |
|---|---|---|
| Parse direction | Bottom-up | Top-down |
| Derivation | Rightmost (reversed) | Leftmost |
| Left recursion | Handled naturally | Must be eliminated |
| Right recursion | Handled naturally | Handled naturally |
| Grammar transformations | Rarely needed | Often needed (left-factor, eliminate left recursion) |
| Lookahead | 1 token (practical) | 1 or k tokens |
| Empty productions | Handled via FIRST/FOLLOW | Require care (FIRST/FOLLOW conflicts) |
| Error recovery | Harder (stack may be deep) | Easier (stack mirrors parse tree) |
| Operator precedence | Declarative (left, right) | Encoded in grammar levels |
| Tool examples | yacc, bison, Lime | ANTLR, JavaCC, recursive descent |
| Grammar class | Strictly larger than LL(1) | Subset of LALR(1) |
Every LL(1) grammar is also LALR(1), but not vice versa. LALR(1) can handle:
expr ::= expr PLUS term. LL parsers cannot handle left recursion at all and require it to be rewritten.The main advantages of LL parsing are:
For generated parsers (as opposed to hand-written ones), LALR(1) is the standard choice. Lime generates LALR(1) parsers.
Lime follows the classical LALR(1) construction algorithm, closely related to the method described in the "Dragon Book" (Aho, Sethi, Ullman). It differs from yacc and bison in several respects:
::= instead of :. This is a syntactic convention borrowed from BNF notation.expr(A) ::= expr(B) PLUS term(C).), which names the semantic value of that symbol for use in the action code. Yacc uses positional $1, $2, etc.destructor directives that specify code to run when a symbol is popped from the stack during error recovery. This helps prevent memory leaks..c file and a single .h file. The template-based output system (via limpar.c) means the generated code is self-contained and easy to integrate.Lime uses some terminology that differs from textbook conventions:
| Lime Term | Textbook Term |
|---|---|
| Configuration | LR item |
| Basis configuration | Kernel item |
| Configuration closure | Item set closure |
| Follow set (on config) | Lookahead set (on item) |
| Propagation link | Lookahead propagation |
The main() function in lime.c (line 2048) orchestrates the parser generation through a sequence of well-defined phases. Each phase builds on the results of the previous one.
Function: Parse() (lime.c) Purpose: Read the grammar file and build the symbol and rule tables.
The parser reads the .y file and processes:
lhs ::= rhs1 rhs2 ... .)token_type, left, right, nonassoc, name, include, etc.)ifdef, ifndef, endif){ ... })Output: A populated struct lime with linked lists of struct rule and a symbol table of struct symbol entries. Each rule records its left-hand side, right-hand side symbols, action code, and line number.
Function: FindRulePrecedences() (lime.c:950) Purpose: Assign a precedence symbol to each production rule.
For rules that do not have an explicit prec directive, the precedence is inherited from the rightmost terminal symbol in the rule's right-hand side that has a defined precedence. This is the same convention used by yacc.
If no terminal in the rule has a defined precedence, the rule has no precedence and cannot participate in precedence-based conflict resolution.
Function: FindFirstSets() (lime.c:979) Purpose: Compute FIRST sets and lambda (nullable) flags for all nonterminals.
The FIRST set of a nonterminal A is the set of terminal symbols that can appear as the first symbol of any string derived from A. A nonterminal is "lambda" (nullable) if it can derive the empty string.
The algorithm uses a fixed-point iteration:
Phase 1 – Compute lambda flags:
Phase 2 – Compute FIRST sets:
Both phases iterate until a fixed point is reached. Since each iteration can only add elements (never remove them) and the sets are bounded, the algorithm always terminates.
FIRST sets are represented as bit vectors (one bit per terminal symbol), using the SetNew(), SetAdd(), and SetUnion() functions in the set.c module. The bit-vector representation makes set union a fast bitwise-OR operation.
Function: FindStates() (lime.c:1042) Purpose: Construct all LR(0) states and establish follow-set propagation links.
This is the core of the LALR(1) construction. It builds the LR(0) state machine (also known as the LR(0) automaton or the "collection of sets of items").
The algorithm works as follows:
for every rule with the start symbol on the left-hand side. The$` end-of-input marker is seeded into the follow set.[A ::= alpha . B beta, ...] and B is a nonterminal, add [B ::= . gamma, FIRST(beta)] for every rule B ::= gamma.beta can derive the empty string (all symbols in beta are lambda), establish a propagation link from the parent configuration to the new configuration.buildshifts().The process continues until no new states are created. The result is a complete LR(0) automaton with propagation links threading through it.
Function: FindLinks() (lime.c:1217) Purpose: Convert backward propagation links into forward links.
During state construction (Step 4), propagation links are recorded as backward links (from new configuration back to the configuration that generated it). This step:
bplp) into forward links (fplp).Forward links are what the follow-set computation needs: when a configuration's follow set changes, the change propagates forward along the links to all dependent configurations.
Function: FindFollowSets() (lime.c:1252) Purpose: Propagate follow sets to a fixed point.
This step implements the LALR(1) follow-set computation. It iterates over all configurations in all states, propagating follow-set information along the forward links established in Step 5:
The fixed-point iteration terminates because follow sets can only grow (elements are added, never removed) and are bounded by the number of terminal symbols.
After this step, every configuration that has its dot at the end of a rule (i.e., is ready to reduce) has a precise LALR(1) follow set – the set of terminals that can legally appear after the rule is reduced, given the specific parsing contexts that can reach this configuration.
Function: FindActions() (lime.c:1290) Purpose: Build the action table and resolve conflicts.
This step examines every configuration in every state to determine the parser actions:
buildshifts() creates a transition from state S to state T on symbol X, it records a SHIFT action for state S on lookahead X.resolve_conflict() function attempts to resolve each conflict using precedence and associativity rules (see Conflict Resolution below).Function: CompressTables() (lime.c:6616) Purpose: Reduce table size through default actions and action merging.
This step applies three optimizations:
Default reduce actions: For each state, find the most common reduce action. Make it the default action for that state (stored in yy_default[]). All other entries for that reduce action are marked NOT_USED, shrinking the per-state action list. If a state would use the wildcard token, no default is applied to preserve correct error detection.
Auto-reduce states: If a state's only possible actions are reduces by a single rule (after defaults are applied), the state is marked as an "auto-reduce" state.
SHIFTREDUCE optimization: Any SHIFT action that transitions to an auto-reduce state is converted to a SHIFTREDUCE action. This combines two operations (shift then immediately reduce) into one, avoiding the overhead of entering and immediately leaving a state:
Single-RHS optimization: If a SHIFTREDUCE targets a rule with a single right-hand-side symbol and no associated C action code, the action is further simplified. The shift-reduce-goto chain collapses to a direct transition to the ultimate target state.
Function: ResortStates() (lime.c:6756) Purpose: Reorder states to minimize table size.
States are sorted so that states with more actions appear first (lower state numbers). Since the action table packing algorithm places entries for lower-numbered states first, putting busy states early gives them the best chance of finding overlapping entries in the packed table.
State 0 is always kept as state 0 (it is the start state).
States that are pure auto-reduce states (with no explicit actions) are placed at the end. These "degenerate" states are counted separately as nxstate (the number of non-degenerate states). The generated arrays yy_shift_ofst[] and yy_reduce_ofst[] only need entries for the first nxstate states, saving space.
Function: ReportTable() (lime.c) Purpose: Generate the final C source code and header file.
This step uses the template file (limpar.c) to produce the output:
acttab module. For each non-degenerate state, the terminal actions and nonterminal actions are inserted into a shared yy_action[] array, with the acttab_insert() function finding the optimal placement (see Table Compression).yy_action[], yy_lookahead[], yy_shift_ofst[], yy_reduce_ofst[], yy_default[], and yy_reduce_rule[].switch statement with a case for each rule that has associated C code.Parse, ParseAlloc, ParseFree, etc.).The state machine (LR(0) automaton) is the core data structure produced by Lime. Understanding its construction is key to understanding LALR(1) parsing.
A configuration is a production rule with a dot. In Lime, this is represented by struct config:
A configuration [A ::= X . Y Z] means "we have seen X and expect to see
Y Z to complete a reduction by rule A ::= X Y Z."
Given a set of basis configurations (also called kernel items), the closure adds all configurations reachable by expanding nonterminals.
If the set contains [A ::= alpha . B beta] where B is a nonterminal, then for every rule B ::= gamma, add [B ::= . gamma] to the set.
Lime implements closure in Configlist_closure() (lime.c:1540). During closure, it also computes partial follow-set information:
The GOTO function maps a (state, symbol) pair to a new state. Given state S and symbol X, GOTO(S, X) is the state whose basis consists of all configurations from S where the dot is advanced past X:
Lime computes GOTO implicitly in buildshifts() (lime.c:1164). For each symbol X that appears after the dot in some configuration of state S, it collects all such configurations, advances their dots, and looks up (or creates) the resulting state.
Two states are considered identical if they have the same basis (kernel) configurations. Lime uses a hash table (State_find()) keyed on the sorted basis to detect duplicate states efficiently.
When a duplicate is found, the propagation links from the new (duplicate) state are merged into the existing state. This is how LALR(1) information flows: the same LR(0) state may be reached from different parsing contexts, and the follow sets are the union of all contexts.
Consider this grammar:
The LR(0) state machine has the following states (showing basis configurations only):
Transitions:
When two or more actions apply to the same (state, lookahead) pair, the parser has a conflict. Lime detects and resolves conflicts in the resolve_conflict() function (lime.c:1379).
Shift/Reduce (SR) conflict: The parser can either shift the lookahead token or reduce by some rule. This is the most common kind of conflict.
Reduce/Reduce (RR) conflict: The parser can reduce by two different rules. This usually indicates a grammar design problem.
Shift/Shift (SS) conflict: Two shift actions for the same lookahead in the same state. This is rare and typically arises only with multi-terminal symbols.
Lime resolves shift/reduce conflicts using the precedence and associativity declared by left, right, and nonassoc directives:
prec directive).Resolution rules:
| Condition | Resolution |
|---|---|
| Shift precedence > reduce precedence | Shift wins |
| Shift precedence < reduce precedence | Reduce wins |
| Equal precedence, RIGHT associativity | Shift wins |
| Equal precedence, LEFT associativity | Reduce wins |
| Equal precedence, NONASSOC | Error action (parse fails) |
| No precedence defined for either | Conflict (reported as warning) |
Example from the classic expression grammar:
The rule expr ::= expr PLUS expr. with lookahead TIMES is resolved as a shift because TIMES has higher precedence than PLUS.
The rule expr ::= expr PLUS expr. with lookahead PLUS is resolved as a reduce because PLUS has left associativity (equal precedence, left-associative means reduce).
Lime resolves reduce/reduce conflicts using precedence when possible. If both rules have precedence symbols, the rule with higher precedence wins. If precedences are equal or undefined, the conflict is reported.
Unlike yacc (which silently resolves RR conflicts by preferring the earlier rule), Lime reports all reduce/reduce conflicts as errors. This is a deliberate design choice to avoid hiding grammar ambiguities.
After resolution, each action is tagged with one of these types:
Resolved conflicts (SH_RESOLVED, RD_RESOLVED) are kept in the action list for reporting purposes but are not emitted into the parser tables. They appear in the .out report file so the grammar author can verify that precedence resolved conflicts as intended.
The generated parser uses a compact table representation that balances lookup speed against memory usage. Understanding this representation is important for understanding the generated code and for performance tuning.
The parser runtime uses five parallel arrays:
| Array | Type | Indexed by | Purpose |
|---|---|---|---|
yy_action[] | int | Computed offset | Action to take (shift/reduce/error) |
yy_lookahead[] | int | Same as yy_action | Expected token for verification |
yy_shift_ofst[] | int | State number | Offset for terminal lookups |
yy_reduce_ofst[] | int | State number | Offset for nonterminal lookups |
yy_default[] | int | State number | Default action when no match |
To find the action for state S and terminal token T:
For nonterminal GOTO lookups (after a reduce), the same scheme is used with yy_reduce_ofst[] instead of yy_shift_ofst[].
The acttab module (lime.c:729) implements the action table packing algorithm. The yy_action[] and yy_lookahead[] arrays are shared among all states. Each state's actions are a contiguous range within the array, but ranges from different states can overlap as long as they do not conflict.
The packing algorithm (in acttab_insert(), lime.c:820) works as follows:
The verification field yy_lookahead[] makes this safe: even if two states share an offset, a false match is detected because the expected lookahead will not match.
For terminal lookups (makeItSafe = true), the offset is chosen so that any lookahead value – even an invalid one from erroneous input – will index within the array bounds. This prevents out-of-bounds reads. For nonterminal lookups (makeItSafe = false), offsets can be more aggressive because nonterminals are generated internally and are always valid.
Before table packing, states are reordered (Step 9) so that states with more actions come first. Since the packing algorithm processes states in order, this gives the busiest states the best chance of finding overlapping positions, leading to a smaller packed table.
Auto-reduce states (states that always reduce by the same rule) are eliminated from the action table entirely. Their behavior is encoded as SHIFTREDUCE actions in the states that transition to them, so they never need to be looked up.
The combination of these techniques typically reduces table size by 50-80% compared to a naive two-dimensional array. For a grammar with N states and T terminal symbols, the naive table would require N*T entries. The packed table is typically O(N + T) entries.
The parser generated by Lime is a push-down automaton implemented as a C function. The generated code follows a standard structure.
The core loop of the generated parser:
Lime's error recovery follows the standard yacc approach:
error token.error token.The destructor directives ensure that semantic values popped during error recovery are properly cleaned up.
| Phase | Complexity | Notes |
|---|---|---|
| Parse grammar | O(G) | G = grammar size |
| FindFirstSets | O(R * T) | R = rules, T = terminals; fixed-point |
| FindStates | O(S * C) | S = states, C = avg configurations/state |
| FindFollowSets | O(S * C * T) | Fixed-point over propagation links |
| FindActions | O(S * C * T) | Adding reduce actions + conflict resolution |
| CompressTables | O(S * A) | A = avg actions/state |
| Table packing | O(S * T * N) | N = table size; finding overlaps |
| Total generation | O(S * C * T) | Dominated by follow-set computation |
| Structure | Size | Notes |
|---|---|---|
| States | O(S) | S typically O(R) to O(R^2) |
| Configurations | O(S * C) | C is rule length dependent |
| FIRST sets | O(N * T) bits | N nonterminals, T terminals |
| Follow sets | O(S * C * T) bits | One set per configuration |
| Packed tables | O(S + T) | After compression |
The generated parser runs in:
The table-driven approach means the generated parser has a very small code footprint and excellent cache behavior. The action table lookup involves reading three array entries (offset, lookahead, action) which are likely to be in the L1 cache for active states.
acttab module.