|
Lime Parser Generator 0.1.0
Runtime-extensible LALR(1) parser with SIMD tokenization and LLVM JIT
|
This guide explains how to develop extensions for the Lemon extensible parser. Extensions add new tokens, grammar rules, and precedence modifications to an existing parser at runtime, without recompiling the base grammar.
An extension is a set of grammar modifications bundled with lifecycle callbacks. The extension system manages registration, loading (activation), conflict resolution, and unloading.
Registration declares the extension to the system without activating it. The extension system copies the name and version strings and assigns a unique ExtensionID.
Loading activates the extension. The system calls your get_modifications callback, passing the current base snapshot so you can inspect the existing grammar. Your callback fills in the modifications array.
Unloading deactivates the extension and removes its contributions. If you provided an on_unload callback, it is called so you can free resources.
When the registry is destroyed, all extensions are unloaded automatically.
Each modification is a GrammarModification struct with a type field and a type-specific payload in the u union.
Adds a new terminal symbol to the parser's token table.
Fields:
name – The symbolic name used in grammar rules (e.g. "ARROW").lexeme – The literal text the tokenizer should recognize. May be NULL if tokenization is handled externally.token_code – Set to -1 to let the system assign one automatically. Set to a specific value to force a particular code (use with caution – may conflict with existing tokens).Adds a new production rule.
Fields:
lhs – The non-terminal being defined. Must already exist in the base grammar or be added by a preceding MOD_ADD_TYPE modification.rhs – NULL-terminated array of symbol names. Can reference terminals (uppercase) or non-terminals (lowercase) from the base grammar or from other modifications in the same extension.nrhs – Count of RHS symbols (not counting the NULL terminator).code – Generator-time C code string. Used when the modification is consumed by a static grammar compile (fed through lime as text). Inside the code block, A refers to the LHS's value slot and B, C, ... refer to the RHS symbols in order.reduce – Optional runtime reduce callback of type LimeReduceFn (see below). When set, the parser invokes this function at reduction time instead of dispatching through the generated reduce() switch. This is the mechanism for extensions loaded into an already-compiled parser (where the switch is fixed).reduce_user – Opaque pointer passed through to reduce as its first argument. Useful for carrying extension-private context.precedence – Override precedence for the rule. Set to -1 to use the default behavior (inherit from the rightmost terminal).Action-source precedence: if reduce is non-NULL, the parser uses it (and code is treated as documentation / generator-time fallback). If both are NULL, the rule reduces with no action.
The callback must write the LHS value into *lhs_out before returning. The extension owns any memory referenced from that value; Lime treats slot payloads as opaque. See docs/API.md for the full contract.
Current status: the reduce dispatch path is type-checked and accepted at registration time, but the parser does not yet invoke running callbacks at parse time – the wiring lands with the runtime LALR rebuild (Task #3). Writing extension code against reduce today still compiles and will light up transparently when dispatch lands. In the meantime, use the subprocess fallback pattern (see lime_modifications_to_grammar_text() in docs/API.md) to validate extension-grammar designs end to end.
Changes the precedence or associativity of an existing or newly-added symbol.
Declares a new non-terminal symbol with its C data type.
Removes an existing production rule.
Use this sparingly – removing rules from the base grammar can break existing SQL syntax.
The base_snapshot parameter lets you inspect the current grammar state. For example, you could check if a token already exists before adding it, or adapt your rules based on the number of existing states.
The modifications array must remain valid until on_unload is called. If you allocated it dynamically, free it in on_unload.
Called when a modification conflicts with one from another extension.
Resolution options:
CONFLICT_KEEP_EXISTING – Discard our modification, keep what's there.CONFLICT_USE_NEW – Replace the existing modification with ours.CONFLICT_MERGE – Provide a merged result (advanced use case).CONFLICT_UNRESOLVED – Signal that the conflict cannot be resolved (may cause load failure depending on policy).If on_conflict is NULL, the system defaults to CONFLICT_KEEP_EXISTING.
Called when the extension is removed. Free any resources allocated during get_modifications.
The file examples/jsonb_extension.c is a complete, compilable extension that adds PostgreSQL-style JSONB operators to the parser. This section walks through every part of the implementation.
| Operator | Token name | Meaning |
|---|---|---|
-> | ARROW | Extract JSON object field by key |
->> | DARROW | Extract JSON object field as text |
@> | CONTAINS | Does left JSON value contain right? |
<@ | WITHIN | Is left JSON value contained by right? |
? | QMARK | Does key exist in JSON object? |
The extension declares its modifications as a static array of GrammarModification structs. This is the simplest approach when the set of modifications is known at compile time.
Tokens – Each operator gets a MOD_ADD_TOKEN entry. All use token_code = -1 for automatic assignment:
Rules – Each operator gets a MOD_ADD_RULE entry. The RHS symbols are declared as separate NULL-terminated arrays:
Note: The ? (exists) operator uses SCONST on the right-hand side instead of a_expr, because it checks for a string key:
Precedence – ARROW and DARROW are set to left-associative at precedence level 3 (matching PostgreSQL's comparison operators):
The total modifications array contains 12 entries: 5 tokens + 5 rules + 2 precedence settings. The count is computed with a macro:
get_modifications – Returns the static array. The base_snapshot parameter is ignored here but could be used to conditionally add modifications:
on_conflict – Takes the conservative approach of keeping the existing modification when conflicts arise:
A more sophisticated implementation might inspect info->existing_mod and info->new_mod to make per-conflict decisions.
on_unload – A no-op since the modifications are statically allocated. A dynamic extension would free its modifications array here:
This struct bundles everything together for register_extension():
The example provides a single-call function that handles both registration and loading, with error reporting:
When an extension is loaded, its modifications go through a pipeline managed by the snapshot modification system:
The base snapshot is never mutated. All modifications produce a new snapshot that can be atomically swapped in. This copy-on-write design means readers using the old snapshot are not affected until they acquire the new one.
Result codes from create_modified_snapshot():
| Code | Meaning |
|---|---|
MODIFY_OK | Snapshot created successfully |
MODIFY_ERR_ALLOC | Memory allocation failure |
MODIFY_ERR_INVALID_MOD | Invalid modification (bad type or fields) |
MODIFY_ERR_CONFLICT | Unresolved conflicts remain |
MODIFY_ERR_BUILD | LALR(1) automaton rebuild failed |
The conflict detection system checks for several categories of conflicts when multiple extensions contribute modifications:
| Conflict Type | Trigger |
|---|---|
CONFLICT_TOKEN_COLLISION | Two extensions add a token with the same name |
CONFLICT_DUPLICATE_RULE | Two extensions add an identical production |
CONFLICT_PRECEDENCE_CLASH | Conflicting precedence for the same symbol |
CONFLICT_SHIFT_REDUCE | Shift/reduce conflict in rebuilt automaton |
CONFLICT_REDUCE_REDUCE | Reduce/reduce conflict in rebuilt automaton |
The first three types are detected during the pre-application scan (detect_conflicts()). The last two are detected during rebuild_automaton() after modifications have been applied to the cloned snapshot.
When a conflict is detected, the system calls the on_conflict callback of the extension that proposed the newer modification. The callback receives a ConflictInfo struct with:
existing_ext – ExtensionID that owns the existing modificationnew_ext – ExtensionID proposing the conflicting modificationexisting_mod – Pointer to the existing GrammarModificationnew_mod – Pointer to the conflicting GrammarModificationIf no on_conflict callback is provided (NULL), the system defaults to CONFLICT_KEEP_EXISTING. Any CONFLICT_UNRESOLVED results cause the load to fail with MODIFY_ERR_CONFLICT.
Only add what your extension needs. Every new token and rule increases the parser's table size and can create conflicts with other extensions.
Set token_code = -1 to avoid collisions with other extensions. Manual token codes should only be used when interoperating with an external tokenizer that requires specific values.
Use the base_snapshot parameter in get_modifications to check for existing symbols before adding them. This prevents duplicate-token conflicts:
Always implement on_conflict if your extension might overlap with others. The default behavior (keep existing) may silently drop your modifications.
If you build your modifications array in get_modifications, always provide an on_unload callback that frees it. Memory leaks from unfree'd modifications will accumulate as extensions are loaded and unloaded.
Pass extension-specific context through the user_data field in ExtensionInfo. This avoids global variables and allows multiple instances of the same extension type with different configurations.
The extension registry is protected by a pthread_rwlock. Multiple threads can call find_extension() and get_loaded_extension_count() concurrently. Registration, loading, and unloading serialize through a write lock.
Your callbacks (get_modifications, on_conflict, on_unload) are called while the registry write lock is held. Do not attempt to call registry functions from within a callback – this will deadlock.
If your extension allocates shared state in get_modifications that is accessed from parser reduction actions, you are responsible for synchronizing access to that state (e.g. with your own mutex).
Extensions are loaded in the order you call load_extension(). When conflicts arise, the "existing" modification is from the earlier-loaded extension and the "new" modification is from the later-loaded one.
If your extension depends on tokens or non-terminals from another extension, load the dependency first. There is no automatic dependency resolution – the loading order is your responsibility.
load_extension().get_modifications returns true and sets both output parameters.If two extensions add the same token name, the second one to load will trigger a conflict. Solutions:
on_conflict to handle it explicitly.JSONB_ARROW instead of ARROW).user_data is still valid at unload time.Each extension adds entries to the action tables. If the combined table size becomes a concern:
MOD_REMOVE_RULE.-s flag to lime to see table statistics.