|
Lime Parser Generator 0.1.0
Runtime-extensible LALR(1) parser with SIMD tokenization and LLVM JIT
|
This guide covers how to produce rich error messages from a Lime-generated parser — precise source locations, expected-token hints, and recovery to continue parsing after an error.
Lime's built-in tokenizer (include/tokenize.h) already provides everything needed for precise caret positioning:
If you write your own tokenizer instead of using Lime's, you should store the same information in whatever token struct you pass to the parser — the length is what lets you underline the exact run of characters that caused the error, rather than placing a single caret at the start offset.
Every generated parser exports three introspection functions that let your syntax_error handler build context-rich messages:
(The Parse prefix is replaceable via the name directive or the -P command-line flag; the examples below assume the default.)
Returns the string name of a terminal token code, or NULL if the code is out of range. Names come from the yyTokenName[] table that the generator already emits for trace output.
Returns the numeric state the parser is in, or -1 for an invalid handle. A freshly-initialized parser is in state 0.
The state number is meaningful only as an input to ParseExpectedTokens(). Treat it as opaque — the actual number depends on the grammar and changes every time the parser is regenerated.
Fills out with the token codes that are valid at stateno and returns the count written (up to max). If out is NULL or max is 0, returns the total count without writing anything — so callers can size a buffer and call again.
Returns a heap-allocated comma-separated list of expected token names at the current state:
This is equivalent to calling ParseState + ParseExpectedTokens + ParseTokenName yourself, then joining with commas. Use it for quick messages; use the three lower-level calls when you want more control (e.g., filtering, sorting, localization).
Inside a syntax_error { ... } block, these identifiers are available:
| Identifier | Type | Meaning |
|---|---|---|
yymajor | int | Token code of the offending lookahead. 0 means end-of-input. |
yyminor | ParseTOKENTYPE | Semantic value of the offending lookahead. |
TOKEN | ParseTOKENTYPE | Alias for yyminor. |
yyloc | YYLOCATIONTYPE | Source location of the offending lookahead. See "Location semantics" below. |
TOKEN_LOC | YYLOCATIONTYPE | Alias for yyloc. |
yypParser | void * | Parser handle, for passing to ParseState etc. |
When the grammar declares locations and parsing is driven by ParseLoc() (or parse_token() with a non-LIME_LOC_UNKNOWN location), yyloc holds the source location of the offending lookahead – the token that the parser could not accept – not the location of the previously-shifted symbol on top of the stack.
This matches Bison's *yylloc semantics in yyerror(): the location Bison passes is the location of the token that triggered the error. Concretely, this distinguishes two cases that previous Lime versions could not:
yymajor == 0): the parser failed on the EOF marker. yyloc is whatever location the caller passed for the EOF marker (typically LIME_LOC_UNKNOWN or a sentinel like the byte offset just past the input's end).yymajor != 0): the parser failed on a real token. yyloc is that token's location.Without this distinction, an error message that said "at or near X" would print the wrong token name in one of the two cases. PG- compatible callers can write:
When the grammar does not declare locations, or parsing is driven by the location-less Parse() entry point, yyloc is zero-initialised at ParseInit() time and remains so. Treat zero as "location unknown" in that case.
A complete handler looks like this:
include/lime_error.h provides an optional linked-list error type for accumulating multiple errors during a parse. Hosts that want a different structure can ignore it and roll their own.
The message and expected fields are duplicated on append; the filename is borrowed (not copied) so make sure it outlives the error chain, or pass NULL.
Lime inherits Lemon's error-recovery mechanism, which lets a parser continue after a syntax error by resynchronizing at a known grammar point. The host program then sees a list of all errors, not just the first.
error is a special nonterminal that matches "any stretch of input we
couldn't parse." You add productions that use error at points where recovery makes sense — typically statement boundaries or block delimiters.
For an SQL-like grammar:
When a syntax error occurs inside a statement, the parser:
syntax_error so you can record the error.error can be shifted.error and continues parsing, skipping input tokens until it finds something that follows error in the recovery production (here, SEMI at the statement level).After shifting error, the parser refuses to report another syntax error until it has successfully shifted at least three real tokens. This prevents a cascade of spurious errors from a single mistake.
If you want a different threshold, use error_sync N to set it (e.g., error_sync 1 makes every error report immediately, at the cost of lower-quality output on cascading errors).
If the parser runs out of tokens (or hits EOF) before recovering, it calls parse_failure:
This is your signal to stop trying and return whatever errors you've accumulated.
With this grammar, parsing SELECT * FROM ?garbage; INSERT INTO t VALUES (1); produces both errors (the ?garbage and any follow-on), instead of stopping at the first.
A minimal example of the full pipeline — tokenize, parse, report errors with caret underlines:
Output looks like:
man/lime_grammar.5 — grammar directive reference (including syntax_error, parse_failure, error_sync, locations)