|
Lime Parser Generator 0.1.0
Runtime-extensible LALR(1) parser with SIMD tokenization and LLVM JIT
|
This document describes the architecture of the extensible SQL parser system built on top of the Lemon LALR(1) parser generator.
The system extends a traditional Lemon-generated parser with runtime grammar modification, SIMD-accelerated tokenization, optional LLVM JIT compilation, and copy-on-write snapshots for safe concurrent access. Extensions can add tokens, rules, and precedence changes without restarting the parser.
| Component | Files | Purpose |
|---|---|---|
| Lemon Generator | lime.c, limpar.c | Reads grammar, produces LALR(1) parser tables |
| Snapshot System | src/snapshot.{h,c} | Immutable, refcounted captures of parser state |
| Snapshot Modification | src/snapshot_modify.{h,c} | Clone-and-modify pipeline for grammar changes |
| Extension System | src/extension.{h,c} | Registration, loading, and lifecycle of extensions |
| Conflict Detection | src/conflict.{h,c} | Detects and resolves grammar modification conflicts |
| Token Table | include/token_table.h, src/token_table.c | Thread-safe, RCU-versioned keyword lookup |
| SIMD Tokenizer | src/tokenize_simd.{h,c}, src/tokenize.c | SIMD character classification for lexing |
| Parse Context | include/parse_context.h, src/parse_context.c | Snapshot-indirected parser runtime |
| JIT Context | include/jit_context.h, src/jit_context.c | LLVM-based compilation of action tables |
| JIT Policy | include/jit_policy.h, src/jit_policy.c | Metrics-driven decision for when to JIT |
| Public API | include/parser.h | Stable external interface to all subsystems |
The snapshot system is the central abstraction that enables safe concurrent access and runtime grammar modification.
A ParserSnapshot (defined in src/snapshot.h) captures the complete state of a parser at a point in time:
yy_action, yy_lookahead, yy_shift_ofst, yy_reduce_ofst, yy_default (the compact arrays that drive parsing)uint_fast32_t for safe concurrent sharingReference counting uses release-acquire semantics:
snapshot_acquire() uses memory_order_relaxed (only needs atomicity)snapshot_release() uses memory_order_release on the decrementatomic_thread_fence(memory_order_acquire) before reading snapshot data, ensuring all prior writes from other threads that released their references are visibleWhen extensions modify the grammar, the base snapshot is never mutated. Instead, create_modified_snapshot() in src/snapshot_modify.c follows this pipeline:
on_conflict callbacksrefcount = 1Active parsers continue using the old snapshot via their pinned references. New parse sessions pick up the new snapshot. This enables zero-downtime grammar updates.
The Lemon parser generator implements a standard LALR(1) construction:
The generated parser uses five parallel arrays:
| Array | Type | Purpose |
|---|---|---|
yy_action | uint16_t[] | Combined shift and reduce actions |
yy_lookahead | uint16_t[] | Expected lookahead for each action entry |
yy_shift_ofst | int16_t[] | Per-state offset into yy_action for shifts |
yy_reduce_ofst | int16_t[] | Per-state offset into yy_action for reduces |
yy_default | uint16_t[] | Default action when no lookahead matches |
Action lookup: for state S and lookahead T, compute i = yy_shift_ofst[S] + T. If yy_lookahead[i] == T, the action is yy_action[i]. Otherwise, fall back to yy_default[S].
Extensions are managed through a thread-safe registry (ExtensionRegistry) protected by a pthread_rwlock. The registry maps ExtensionID values (1-based uint32_t, with 0 reserved for "base grammar") to Extension structs.
Each extension provides up to three callbacks:
| Callback | Required | Purpose |
|---|---|---|
get_modifications | Yes | Returns array of GrammarModification structs |
on_conflict | No | Resolves conflicts with other extensions |
on_unload | No | Frees extension-owned resources |
Extensions express grammar changes through GrammarModification structs with a tagged union:
| Type | Fields | Effect |
|---|---|---|
MOD_ADD_RULE | lhs, rhs[], code, precedence | Add a production rule |
MOD_ADD_TOKEN | name, lexeme, token_code | Add a terminal symbol |
MOD_REMOVE_RULE | lhs, rule_index | Remove a production |
MOD_MODIFY_PRECEDENCE | symbol, precedence, assoc | Change precedence |
MOD_ADD_TYPE | name, datatype | Set non-terminal C type |
detect_conflicts() in src/conflict.c performs pairwise comparison of modifications to find:
Conflicts are collected in a ConflictSet and offered to each extension's on_conflict callback for resolution. Unresolved conflicts cause the modification to fail with MODIFY_ERR_CONFLICT.
The SIMD tokenizer (src/tokenize_simd.{h,c}) classifies characters in parallel using vector instructions. The CharClassVector struct holds bitmasks for alphabetic, digit, and whitespace characters:
| Tier | Width | Platform | Selection |
|---|---|---|---|
| AVX2 | 32 chars | x86_64 | __attribute__((target("avx2"))) + runtime CPUID |
| NEON | 16 chars | ARM | Compile-time __ARM_NEON |
| Scalar | 32 chars | Any | Always available fallback |
get_classify_func() selects the best implementation at runtime. The AVX2 path uses the target attribute so it compiles on any x86_64 toolchain without requiring -mavx2 globally.
The TokenTable (include/token_table.h) provides thread-safe keyword lookup using:
atomic_uint_fast32_t version counter that readers check before and after lookup. If the version changed, a write occurred and the reader retries.pthread_rwlock_t protects mutations (add/remove tokens)TokenDefinition records which ExtensionID added it, enabling remove_tokens_by_extension() for clean unloadingThe JIT subsystem compiles parser action table lookups into native machine code using LLVM's C API. It has three layers:
src/jit_context.c): Manages LLVM module, execution engine, and compiled function pointers. Creates specialized find_shift_action functions for each parser state where the action table logic is baked into the instruction stream rather than driven by table lookups.src/jit_codegen.c): Generates LLVM IR for each parser state. Translates the yy_action/yy_lookahead table structure into a series of comparisons and direct returns, which LLVM then optimizes into efficient branch sequences.src/jit_policy.c): Decides when compilation is worthwhile based on runtime metrics (parse count, cumulative parse time, action lookup frequency).When LLVM is not available (LIME_NO_JIT defined at compile time), all JIT functions compile as stubs:
jit_is_available() returns falsejit_create() returns JIT_ERR_NO_LLVMjit_find_shift_action() falls through to the table-driven pathThe policy uses three thresholds (configurable via JITPolicyConfig):
| Threshold | Default | Purpose |
|---|---|---|
min_parse_count | 50 | Avoid compiling rarely-used grammars |
min_total_parse_time_ns | 10 ms | Ensure enough time spent parsing |
min_avg_lookups_per_parse | 100 | Ensure complex enough grammars |
When all thresholds are met, jit_maybe_compile() spawns a detached background thread that compiles and atomically publishes the JIT context to snap->jit_ctx. Active parsers transparently pick up the JIT path on their next action lookup.
jit_find_shift_action() checks if a compiled function exists for the given state. If yes, it calls the compiled function directly. If no, it falls back to the standard table-driven snap_find_shift_action() from the parse context.
| Component | Mechanism | Readers | Writers |
|---|---|---|---|
| Snapshots | Atomic refcount | Lock-free acquire/release | N/A (immutable) |
| Extension Registry | pthread_rwlock | Concurrent read lock | Exclusive write lock |
| Token Table | RCU versioning + pthread_rwlock | Lock-free with retry | Write lock |
| JIT Metrics | atomic_uint_fast64_t counters | Lock-free | Lock-free |
| JIT Compilation | atomic_int flag + detached thread | Lock-free | Single background thread |
ParseContext holds a reference to one snapshot for its entire lifetime, ensuring table pointers remain valid throughout a parse session.get_modifications and on_unload callbacks execute while the registry write lock is held, so they must not attempt to acquire the registry lock themselves (deadlock).symbols, rules, states) and action table arrays. All are freed when the refcount reaches zero.modifications array. The array is populated by get_modifications() and freed by on_unload(). The registry owns the name and version strings (copies made during registration).TokenDefinition entries and the hash table array. Lexeme strings are copied during add_token().jit_destroy() or automatically when the owning snapshot is released via jit_detach_from_snapshot().Conflict structs and their description strings. Freed by conflict_set_destroy().The project uses standard malloc/calloc/realloc/free throughout the library code. The Lemon parser generator itself (lime.c) uses custom wrappers (lime_malloc, lime_calloc, lime_realloc) that track allocations in a linked list for bulk cleanup via lime_free_all().
The project uses Meson as its build system with a Nix flake for reproducible development environments.
| Target | Type | Contents |
|---|---|---|
lime | Executable | Parser generator (lime.c) |
liblime_parser.a | Static library | Core library (snapshot, extension, tokenizer, etc.) |
libtokenize_simd.a | Static library | SIMD character classification |
liblime_jit.a | Static library | JIT compilation (LLVM or stubs) |
_GNU_SOURCE**: Defined project-wide for POSIX extensions (pthread_rwlock_t, clock_gettime, etc.)LIME_NO_JIT**: Defined when LLVM is not available; activates stub implementations in the JIT library__attribute__((target("avx2"))) per-function, not global compiler flags