|
Lime Parser Generator 0.1.0
Runtime-extensible LALR(1) parser with SIMD tokenization and LLVM JIT
|
If you only need to generate parsers from grammars — no runtime extension, no SIMD, no JIT — embed two files:
Build the generator:
Generate a parser from your grammar:
That's it. The generated parser.c has no external dependencies beyond the C standard library.
To use runtime extensions, SIMD tokenization, or JIT, link against the Lime extension library.
Create subprojects/lime.wrap:
In your meson.build:
This gives you the full extension framework: include/parser.h, include/extension_registry.h, include/conflict.h, etc.
Build and install Lime:
Then use pkg-config or find the headers in /usr/local/include and the library in /usr/local/lib.
Compile the extension library sources directly:
(Lime does not ship a CMakeLists.txt, but the above pattern works if you add one or compile the sources directly.)
The extension framework requires:
| Library | Required | Purpose |
|---|---|---|
pthreads | Yes | Thread-safe registry, snapshot refcounting |
libdl | Yes (Linux) | dlopen/dlsym for shared library extensions |
libm | Optional | Math functions in benchmarks |
libLLVM | Optional | JIT compilation (link with llvm-config --libs, or --link-static for a self-contained binary; see -Dllvm-static in GETTING_STARTED.md) |
On macOS, dlopen is in libSystem (no -ldl needed).
All public headers are in include/. The main entry points:
Lime ships two tokenizer options and supports a third:
tokenize.c at the repository root** – a SQLite-flavoured lexer that recognises SQL-ish tokens. Provided as a runnable example and as the default driver for Lime's own test and example grammars.src/tokenize.c + src/tokenize_simd.c** – the SIMD-accelerated tokenizer used by the extension framework. Same SQLite-flavoured dialect; the SIMD classifier speeds up whitespace / identifier / number runs on x86_64 (AVX2) and falls back to scalar on other ISAs.parse_token(ctx,
token_code, token_value) is a pure push interface. No part of Lime assumes the tokens came from Lime's built-in lexers. Host languages and embedded runtimes routinely ship a hand-written tokenizer that understands their exact lexical rules (dollar-quoting, language- specific escape syntax, multi-token lookahead, locale-aware case folding, ...) and feed its output into Lime's push parser. This is a fully supported and expected integration pattern.If you use option 3 and need only the parser generator, you can drop src/tokenize*.c from the link line entirely – the generator produces Parse() / XxxAlloc() symbols that have no dependency on any particular tokenizer.
Extensions can be compiled as shared libraries and loaded at runtime:
See examples/calc/ for a complete working example of dlopen-based extension loading, and examples/plugin_template/ for a reusable template.
Add a build step that runs lime on your grammar:
The generated parser includes the template code from limpar.c. If you want to customize the template (e.g., to add tracing or change memory allocation), use -T:
A generated parser exports functions like ParseAlloc, ParseFree, Parse, ParseTrace, plus macros like ParseARG_STORE and ParseCTX_FETCH. If you link multiple generated parsers into one binary, or your project already defines a symbol called Parse, you need to rename these.
Two options:
Per-grammar, via the name directive:
This renames every exported symbol to the prefix SqlParser: SqlParserAlloc, SqlParserFree, SqlParser, etc.
Per-invocation, via the -P command-line flag:
The -P flag overrides name without editing the grammar, which is useful when you don't control the grammar file or want to generate the same grammar under several names.
The two mechanisms are equivalent; pick whichever fits your build better. For example, in a Makefile:
The runtime extension library exports symbols in three naming schemes:
| Prefix | Scope | Examples |
|---|---|---|
lime_*, Lime*, LIME_* | Preferred modern API | lime_arena_create, LimePluginHandle, LIME_PLUGIN_ABI_VERSION |
lemon_* | Legacy (snapshot/registry) | lemon_snapshot_create, lemon_parser_version |
| Unprefixed | Internal/runtime API | parse_begin, Token, Tokenizer, ExtensionRegistry, snapshot_acquire |
If you embed the library directly in your project (Option 3), the unprefixed symbols may collide with existing identifiers in your codebase. Mitigations:
liblime_parser.so / .dylib) so symbol resolution happens at load time rather than link time. Conflicts become local to the library.include/parse_context.h, include/token_table.h, include/conflict.h, etc. You typically only need #include "parser.h" in most of your code and can isolate the lower-level headers to one file.Namespace via preprocessor (advanced). Before including any Lime header, add:
This is awkward, but works if you have a hard collision you cannot resolve otherwise.
The STRAT_* and EXEC_* enum values in include/disambiguation.h and include/execution_policy.h are distinct from the DISAMBIG_* and EXEC_SEQUENTIAL/EXEC_PARALLEL values in include/extension_registry.h. The runtime enums are typed LimeStrategy and LimeExecMode; the metadata enums retain the names DisambiguationStrategy and ExecutionPolicy. Both sets coexist without conflict in the same translation unit.
By default, the generated parser is heap-allocated via ParseAlloc. For applications that parse repeatedly (database query engines, interactive REPLs, language servers), the malloc/free cycle per parse becomes a noticeable fraction of total time. Two patterns eliminate it:
ParseInit(void *rawParser) resets parser state without freeing memory. The allocation covers the whole parser including its stack (YYSTACKDEPTH entries, default 100), so one malloc is enough unless the grammar needs deeper nesting for pathological input.
For a 55-token SQL parse this saves ~25 ns per parse (about 4% on Apple Silicon; more on systems with slower mallocs).
If your grammar has a known maximum stack depth, define stack_size accordingly and compile with -DParse_ENGINEALWAYSONSTACK. This removes ParseAlloc and ParseFree from the generated code — you provide the buffer:
Zero allocations during parsing. Suitable for embedded use, real-time systems, or any situation where you want to avoid malloc entirely.
The destructor directive may still allocate semantic values; control that separately by arena-allocating them in your reduction actions.
Lime's generated parser contains assert() calls in its hot path (the same as Lemon does; inherited from SQLite lineage). In production builds these should be stripped:
Without -DNDEBUG, asserts add ~20-35 ns per parse on this Apple Silicon machine (roughly 5-10% of parse time). With -DNDEBUG they vanish entirely.
The project's meson.build sets b_ndebug=if-release, so NDEBUG is automatically defined for --buildtype=release and plain, but not for the debug or debugoptimized defaults. If you use Lime's meson build directly, pass --buildtype=release for benchmark-quality builds.
ParseContext) are per-thread, not sharedlime generator itself is single-threaded (run once at build time)