|
Lime Parser Generator 0.1.0
Runtime-extensible LALR(1) parser with SIMD tokenization and LLVM JIT
|
This document covers the public C API for the extensible SQL parser library. All public symbols are declared in headers under include/.
Header: include/parser.h
Returns the library version as a NUL-terminated string (e.g. "0.1.0"). The returned pointer is to static storage and must not be freed.
Header: include/parser.h Internal details: src/snapshot.h
A ParserSnapshot captures the complete state of a parser's tables at a point in time. Snapshots are reference-counted and immutable after creation. Multiple threads can share a snapshot safely by acquiring references.
Create a base snapshot by parsing a Lemon grammar file. On success, returns a snapshot with reference count 1 and sets *error to NULL. On failure, returns NULL and sets *error to a malloc'd error message that the caller must free().
Parameters:
grammar_file – Path to a .y grammar file.error – Output pointer for error message on failure.Increment the reference count on a snapshot. Returns the same pointer for convenience. The caller must eventually call lime_snapshot_release(). Passing NULL is safe and returns NULL.
Decrement the reference count. When it reaches zero, the snapshot and all memory it owns (grammar data, action tables, JIT context) are freed. Passing NULL is safe.
Header: include/parse_context.h
A ParseContext wraps a Lemon-generated parser with a pinned snapshot reference. Table lookups are indirected through the snapshot rather than compiled-in static arrays, enabling hot-swapping of parser tables when extensions modify the grammar.
Begin a new parse session pinned to snap. Acquires a reference to the snapshot that is held until parse_end() is called. Returns NULL on allocation failure. snap must not be NULL.
Feed one token to the parser.
Parameters:
ctx – Active parse context.token_code – Integer token type (terminal symbol code). Pass 0 to signal end-of-input.token_value – Opaque pointer to the semantic value. The layout is determined by the parser template's TOKENTYPE.location – Byte offset of the token in the original source, or LIME_LOC_UNKNOWN (-1) if the grammar does not declare locations or the caller does not track positions (e.g. the synthetic end-of-input token). Currently accepted and stored; full propagation into reduce actions lands with the push-parser implementation that replaces the current parse_token() stub. Callers should thread real locations anyway so they are ready.Returns: 0 on success, non-zero on error (syntax error or OOM).
See also: LIME_LOC_UNKNOWN.
End the parse session. Releases the pinned snapshot reference and frees all internal state. Passing NULL is safe.
Return the snapshot pinned by this context. Valid as long as the context is alive.
Sentinel value for the location argument of parse_token(). Pass this when the grammar does not declare locations, or when no meaningful byte offset can be attributed to the token (the synthetic end-of-input marker, runtime-injected tokens, etc.). Guaranteed to be -1 so that integer byte offsets (always >= 0) never collide with it.
These lower-level functions replace direct static array access in the generated parser. They are primarily used internally by the parse engine.
Header: include/tokenize.h
The tokenizer converts SQL text into a stream of tokens using SIMD-accelerated character classification. It automatically selects the fastest available implementation (AVX2 on x86_64, NEON on ARM, or scalar fallback) at runtime.
Create a new tokenizer for the given input buffer.
Parameters:
table – Keyword lookup table for recognizing SQL keywords. Pass NULL for identifier-only mode (all identifiers return TK_IDENTIFIER).input – NUL-terminated SQL input string. Must remain valid for the lifetime of the tokenizer.length – Length of input in bytes, not including the NUL terminator. Important: The buffer must have at least 32 bytes of readable memory past the end (e.g., zero-padded) for SIMD safety.Returns: A new tokenizer, or NULL on allocation failure.
Destroy the tokenizer and free its memory. Passing NULL is safe.
Extract the next token from the input. Returns true if a token was produced, false at end-of-input. On false return, out->type is TK_EOF.
Comments (both -- single-line and /* */ block) are skipped automatically and never returned as tokens.
Peek at the next token without consuming it. Subsequent calls to tokenizer_peek() return the same token. The next call to tokenizer_next() consumes the peeked token.
Return the current byte offset in the input.
Return the current 1-based line number.
Return the current 1-based column number.
The tokenizer uses SIMD instructions to accelerate three hot paths:
The SIMD implementation is selected automatically at runtime via get_classify_func() and requires no user configuration.
Header: include/token_table.h
The token table provides thread-safe keyword lookup using a hash table with RCU-style versioning. Readers are lock-free; writers acquire an internal write lock.
Create a new token table. initial_capacity is the initial number of slots in the hash table. Returns NULL on allocation failure.
Destroy the token table and free all memory.
Look up a token by its string value. This is lock-free for concurrent readers. Returns the token_code if found, or -1 if not found.
Add a token to the table. Acquires the write lock internally. Returns true on success, false on failure (allocation error or duplicate).
Remove all tokens belonging to a given extension. Acquires the write lock and rebuilds hash chains. Returns true on success.
Header: src/tokenize_simd.h
Low-level parallel character classification. Most users should use the Tokenizer API instead; this interface is for advanced users building custom scanners.
Return the best available classification function for the current CPU. Performs runtime CPU feature detection (CPUID on x86, compile-time on ARM).
| Platform | CPU Feature | Function returned |
|---|---|---|
| x86_64 | AVX2 present | classify_simd_avx2 (32 chars) |
| ARM | NEON (baseline on AArch64) | classify_simd_neon (16 chars) |
| Any | Fallback | classify_scalar (32 chars) |
Classify 32 characters starting at input + offset. Always available on every platform. The caller must ensure 32 bytes are readable from input + offset.
AVX2 implementation. Classifies 32 characters in parallel using 256-bit SIMD registers. Only callable on CPUs with AVX2 support – use get_classify_func() for safe dispatch.
NEON implementation. Classifies 16 characters in parallel. Only the lower 16 bits of each mask field are meaningful.
Headers: include/parser.h (public entry points), src/extension.h (internal)
Extensions add grammar modifications (new tokens, rules, precedence changes) to the parser at runtime. Each extension is managed through a thread-safe registry.
Initialize and destroy the global extension registry. Must be called before and after any extension operations, respectively.
ExtensionInfo – Input to register_extension():
Each modification is a GrammarModification struct with a type field and a tagged union u containing the type-specific payload.
| Type | Union Field | Purpose |
|---|---|---|
MOD_ADD_RULE | u.add_rule | Add a new production rule |
MOD_ADD_TOKEN | u.add_token | Add a new terminal token |
MOD_MODIFY_PRECEDENCE | u.modify_prec | Change symbol precedence |
MOD_ADD_TYPE | u.add_type | Add a non-terminal type |
MOD_REMOVE_RULE | u.remove_rule | Remove an existing rule |
u.add_rule carries two fields for the rule's reduction action:
Precedence of the two action-source fields:
reduce | code | Behaviour |
|---|---|---|
| non-NULL | any | Parser invokes reduce(reduce_user, ...) at reduce time. |
| NULL | non-NULL | code is compiled into the parser's generated reduce() switch at generator time. Applicable to grammars fed through lime; not usable from extensions loaded into a pre-compiled parser. |
| NULL | NULL | Rule reduces with no action. |
Current implementation status: reduce-based dispatch is not yet wired through to the push-parser stack (blocks on the runtime rebuild work). The types are stable; extension code written against the contract today will not need changes when dispatch lights up.
When two extensions modify the same grammar element, the on_conflict callback is invoked:
Header: src/mod_serialize.h
Render an array of GrammarModifications as .lime-syntax text that, when concatenated after a base grammar and re-parsed by the lime generator, produces a parser equivalent to applying the modifications. This is the intended mechanism for the "subprocess fallback" pattern that unblocks runtime extension validation while real in-process apply_add_rule() (Task #3) is pending.
Returns: malloc'd NUL-terminated buffer; NULL on allocation failure or bad arguments. Caller owns the buffer.
Round-trip fidelity – not every modification serializes cleanly:
| Case | Behaviour |
|---|---|
MOD_ADD_RULE with .reduce != NULL and .code == NULL | Skipped; counted in *skipped_out. A function pointer has no text form. |
MOD_REMOVE_RULE | Always skipped; concat cannot express removal. Filter the base grammar text if removals must take effect. |
MOD_MODIFY_PRECEDENCE with new_assoc == 0 | Emitted as a comment (no single .lime directive expresses "no associativity"). |
Integer .precedence on MOD_ADD_RULE | Emitted as a /* NOTE */ comment; .lime uses [SYMBOL] markers, not numbers. |
Typical subprocess-fallback usage:
Header: include/jit_context.h
Optional LLVM-based JIT compilation of parser action tables. When LLVM is available, the JIT compiles specialized lookup functions for each parser state, replacing table-driven lookups with direct branch sequences.
When compiled without LLVM (LIME_NO_JIT), all JIT functions degrade to no-ops.
lime_jit_available() returns true if LLVM was linked and initialization succeeds.
lime_jit_compile() compiles and attaches JIT code to a snapshot. Returns 0 on success, non-zero on failure. No-op if already compiled or LLVM is unavailable.
jit_find_shift_action() is the primary runtime dispatch function. If the snapshot has JIT code for the given state, it uses the compiled path; otherwise it falls back to the table-driven lookup.
Header: include/jit_policy.h
Adaptive JIT compilation policy that decides when to compile based on runtime metrics. Tracks per-snapshot usage and triggers compilation when the expected benefit exceeds the cost.
jit_maybe_compile() returns 0 if compilation was triggered, 1 if metrics do not yet warrant compilation, or -1 on error. When background_compile is true, compilation happens on a detached thread.
| Field | Type | Description |
|---|---|---|
version | uint64_t | Monotonically increasing version number |
refcount | atomic_uint_fast32_t | Reference count (starts at 1) |
symbols | struct symbol ** | Array of symbol structs |
nsymbol | uint32_t | Total symbol count |
nterminal | uint32_t | Terminal symbol count |
rules | struct rule * | Linked list of production rules |
nrule | uint32_t | Rule count |
states | struct state ** | Array of parser states |
nstate | uint32_t | State count |
yy_action | uint16_t * | Combined shift+reduce action array |
yy_lookahead | uint16_t * | Parallel lookahead values |
yy_shift_ofst | int16_t * | Per-state shift offset |
yy_reduce_ofst | int16_t * | Per-state reduce offset |
yy_default | uint16_t * | Default action per state |
create_time_ns | uint64_t | Creation timestamp (nanoseconds) |
jit_ctx | void * | Attached JIT context (or NULL) |
Defined in include/tokenize.h. Keyword tokens use positive codes assigned via the TokenTable. Built-in token types use non-positive values:
| Constant | Value | Description | ||
|---|---|---|---|---|
TK_EOF | 0 | End of input | ||
TK_IDENTIFIER | -1 | Unrecognized identifier | ||
TK_INTEGER | -2 | Integer literal (decimal or hex) | ||
TK_FLOAT | -3 | Floating point literal | ||
TK_STRING | -4 | Single-quoted string literal | ||
TK_BLOB | -5 | Blob literal (‘X’...') \ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_LPAREN\ilinebr </td> <td class="markdownTableBodyNone"> -6 \ilinebr </td> <td class="markdownTableBodyNone">(\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_RPAREN\ilinebr </td> <td class="markdownTableBodyNone"> -7 \ilinebr </td> <td class="markdownTableBodyNone">)\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_SEMICOLON\ilinebr </td> <td class="markdownTableBodyNone"> -8 \ilinebr </td> <td class="markdownTableBodyNone">;\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_COMMA\ilinebr </td> <td class="markdownTableBodyNone"> -9 \ilinebr </td> <td class="markdownTableBodyNone">,\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_DOT\ilinebr </td> <td class="markdownTableBodyNone"> -10 \ilinebr </td> <td class="markdownTableBodyNone">.\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_STAR\ilinebr </td> <td class="markdownTableBodyNone"> -11 \ilinebr </td> <td class="markdownTableBodyNone">*\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_PLUS\ilinebr </td> <td class="markdownTableBodyNone"> -12 \ilinebr </td> <td class="markdownTableBodyNone">+\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_MINUS\ilinebr </td> <td class="markdownTableBodyNone"> -13 \ilinebr </td> <td class="markdownTableBodyNone">-\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_SLASH\ilinebr </td> <td class="markdownTableBodyNone"> -14 \ilinebr </td> <td class="markdownTableBodyNone">/\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_PERCENT\ilinebr </td> <td class="markdownTableBodyNone"> -15 \ilinebr </td> <td class="markdownTableBodyNone">%\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_EQ\ilinebr </td> <td class="markdownTableBodyNone"> -16 \ilinebr </td> <td class="markdownTableBodyNone">=or==\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_NE\ilinebr </td> <td class="markdownTableBodyNone"> -17 \ilinebr </td> <td class="markdownTableBodyNone">!=or<>\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_LT\ilinebr </td> <td class="markdownTableBodyNone"> -18 \ilinebr </td> <td class="markdownTableBodyNone"><\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_GT\ilinebr </td> <td class="markdownTableBodyNone"> -19 \ilinebr </td> <td class="markdownTableBodyNone">>\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_LE\ilinebr </td> <td class="markdownTableBodyNone"> -20 \ilinebr </td> <td class="markdownTableBodyNone"><=\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_GE\ilinebr </td> <td class="markdownTableBodyNone"> -21 \ilinebr </td> <td class="markdownTableBodyNone">>=\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_BITAND\ilinebr </td> <td class="markdownTableBodyNone"> -22 \ilinebr </td> <td class="markdownTableBodyNone">&\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_BITOR\ilinebr </td> <td class="markdownTableBodyNone"> -23 \ilinebr </td> <td class="markdownTableBodyNone">|\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_BITNOT\ilinebr </td> <td class="markdownTableBodyNone"> -24 \ilinebr </td> <td class="markdownTableBodyNone">~\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_LSHIFT\ilinebr </td> <td class="markdownTableBodyNone"> -25 \ilinebr </td> <td class="markdownTableBodyNone"><<\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_RSHIFT\ilinebr </td> <td class="markdownTableBodyNone"> -26 \ilinebr </td> <td class="markdownTableBodyNone">>>\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_CONCAT\ilinebr </td> <td class="markdownTableBodyNone"> -27 \ilinebr </td> <td class="markdownTableBodyNone">||\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_DQUOTE_ID\ilinebr </td> <td class="markdownTableBodyNone"> -28 \ilinebr </td> <td class="markdownTableBodyNone">"quoted identifier"\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_BACKTICK_ID\ilinebr </td> <td class="markdownTableBodyNone"> -29 \ilinebr </td> <td class="markdownTableBodyNone">backtick identifier\ilinebr </td> </tr> <tr class="markdownTableRowOdd"> <td class="markdownTableBodyNone">TK_BRACKET_ID\ilinebr </td> <td class="markdownTableBodyNone"> -30 \ilinebr </td> <td class="markdownTableBodyNone">[bracket identifier]\ilinebr </td> </tr> <tr class="markdownTableRowEven"> <td class="markdownTableBodyNone">TK_ILLEGAL` | -31 | Unrecognized character |
The library uses the following conventions for error reporting:
tokenizer_create, parse_begin, create_token_table, lime_snapshot_create) return NULL on failure.add_token, register_extension, load_extension) return false on failure.char **error parameter set it to a malloc'd string on failure. The caller must free() this string.JITStatus enum values. Use jit_status_string() to convert to a human-readable message.tokenizer_destroy, parse_end, lime_snapshot_release) accept NULL safely.Lime's generated parsers accept a caller-supplied allocator via XxxAlloc(void *(*mallocProc)(size_t)) (where Xxx is the parser-name prefix set by name or -P). The matching XxxFree(void *, void (*freeProc)(void*)) uses the caller's free. This is strictly better than Bison's YYMALLOC/YYFREE macro hack: the allocator is passed as a first-class argument rather than baked in at compile time.
The contract the generator relies on:
mallocProc may return NULL on failure, or it may never return (longjmp / throw). If mallocProc returns NULL, the parser enters a failure path and subsequent Parse() calls are no-ops. If it longjmps out, the parser's internal state is left in whatever condition the jump leaves it; the caller must not reuse that parser instance without calling XxxFree first.freeProc is called exactly as many times as mallocProc succeeded – one call per successful allocation – and always on the pointers mallocProc returned.max_align_t. Pointers returned by mallocProc must satisfy the alignment requirements of any C type up to max_align_t (the guarantee malloc(3) gives). Lime never allocates over-aligned objects.XxxAlloc time and then occasionally as the shift stack grows past stack_size. Callers embedding Lime in memory-constrained contexts can set stack_size to a static upper bound to avoid runtime growth.This contract lets a Lime-driven parser hosted inside a language runtime (e.g. one with a memory-context-aware allocator and longjmp-based error handling) delegate allocation to that runtime without macro gymnastics.
| Component | Read | Write |
|---|---|---|
ParserSnapshot | Thread-safe (immutable after creation) | N/A (immutable) |
snapshot_acquire/release | Thread-safe (atomic refcount) | N/A |
ParseContext | Single-thread only | Single-thread only |
Tokenizer | Single-thread only | Single-thread only |
TokenTable lookup | Lock-free (concurrent readers) | Write-locked |
TokenTable add/remove | N/A | Acquires internal lock |
ExtensionRegistry | Read-locked (concurrent) | Write-locked |
JITMetrics | Atomic reads | Atomic updates |
Key points:
ParseContext and Tokenizer is single-threaded. Create one per thread/task.TokenTable supports concurrent readers with lock-free lookups. Writes (adding/removing tokens) serialize internally.