This document provides a comprehensive analysis of the performance impact of the extensible grammar framework in the Lime parser generator. It covers overhead breakdowns by component, memory costs, CPU costs per token, scaling behavior with increasing extension counts, JIT interaction, and tuning recommendations.

All measurements referenced in this document were collected on a Linux x86_64 system (GCC 14.3, C11, -O2) using the benchmark suites in bench/. Numbers are representative; your hardware will vary.

Architecture Overview

The extension framework adds four major subsystems to the parsing pipeline:

Grammar File (.y)
      |
      v
+------------------+       +---------------------+
| Extension        | ----> | Conflict            |
| Registry         |       | Detector            |
+------------------+       +---------------------+
      |                           |
      v                           v
+------------------+       +---------------------+
| Snapshot         | <---- | Disambiguation      |
| (Action Tables)  |       | Framework           |
+------------------+       +---------------------+
      |                           |
      v                           v
+------------------+       +---------------------+
| Parser Engine    |       | Parser Forking      |
| (Table-Driven    |       | (State Cloning)     |
|  or JIT)         |       +---------------------+
+------------------+

Each subsystem introduces overhead that is only incurred when extensions are actually loaded and when conflicts are actually encountered. The key design principle is zero overhead when no extensions are loaded: the parser operates identically to a standard Lemon/Lime-generated parser.

Component Overhead Breakdown

Extension Registry

The ExtensionRegistry (src/extension_registry.c) manages extension metadata, dependency resolution, and conflict declarations.

Operation	Complexity	Typical Cost	When Incurred
`extension_registry_create`	O(1)	~200 ns	Once at startup
`extension_registry_register`	O(1) amortized	~500 ns	Once per extension load
`extension_registry_find`	O(1)	~50 ns	Per conflict detection query
`extension_registry_check_dependencies`	O(V + E)	~2-50 us	Once after all extensions registered
`extension_registry_get_order`	O(V + E)	~5-60 us	Once to determine load order
`extension_registry_unregister`	O(1) amortized	~300 ns	On extension unload
`extension_registry_destroy`	O(N)	~100 ns per ext	Once at shutdown

Implementation details:

Hash table: FNV-1a hashing with open addressing and linear probing. Load factor capped at 75%, rehash doubles the table. Lookups are O(1) expected, O(N) worst case (pathological hash collisions).
Dependency resolution: Kahn's algorithm (topological sort) with adjacency-list representation. Cost is O(V + E) where V = number of extensions and E = total number of dependency edges. For typical workloads (5-20 extensions, 1-3 deps each), this completes in microseconds.
String copies: All metadata strings are deep-copied on register. This means registration has a higher constant cost than a simple pointer store, but avoids lifetime management issues.

Key insight: Registry operations are almost entirely startup costs. During parsing, the registry is read-only (protected by pthread_rwlock read lock), so multiple parser threads proceed concurrently with no contention.

Conflict Detection

The conflict_detector module (src/conflict_detector.c) identifies ambiguities at three levels: token, rule, and semantic.

Detection Phase	Complexity	Typical Cost	Notes
Token conflicts	O(N * M)	~5-50 us	N=extensions, M=tokens/ext
Rule conflicts	O(N * M * T)	~10-100 us	T=distinct token codes
Semantic conflicts	O(N^2 * R^2)	~20-500 us	R=rules/ext; pairwise compare
Full scan	Sum of above	~50-650 us	Called once after load

Token-level detection uses a TokenCollector with linear scan for grouping. For typical extension counts (< 20), this is fast despite O(N*M) complexity.

Rule-level detection requires scanning all loaded extensions for each distinct token code, checking whether any extension defines rules that reference the token. Cost grows linearly with the number of distinct tokens across all extensions.

Semantic-level detection is the most expensive: it performs pairwise comparison of all MOD_ADD_RULE modifications across loaded extensions, checking for identical LHS+RHS with different action code. This is O(N^2 * R^2) in the worst case but is bounded by the FORK_RESOLVE_MAX_FORKS limit (16) in practice.

Key insight: Conflict detection is a one-time cost incurred at extension load time. It does not run on the per-token parsing hot path.

Disambiguation Framework

The DisambiguationContext (src/disambiguation.c) dispatches conflict resolution to a pluggable strategy via vtable.

Operation	Cost	Notes
`disambiguation_create`	~1-5 us	Creates strategy context
`disambiguation_resolve`	~100-5000 ns	Depends on strategy (see below)
`disambiguation_update`	~10-50 ns	No-op for priority/fork-resolve
`disambiguation_destroy`	~50-200 ns	Frees strategy context

The framework overhead is minimal: it is essentially a vtable dispatch plus a malloc for the StrategyResult.

Parser Forking

The ParseFork system (src/parser_fork.c) provides deep-copy of parser state for the fork-resolve disambiguation strategy.

Operation	Typical Cost	Notes
`fork_parser` (full clone)	~1-10 us	Depends on stack depth
`clone_parser_state` (core)	~0.5-5 us	malloc + memcpy of parser+stack
`feed_token_to_fork`	~20-50 ns	Per token per fork
`parse_fork_set_create`	~100 ns	Allocates fork pointer array
`parse_fork_set_prune`	~50-200 ns	Linear scan + free
`parse_fork_set_best`	~50-100 ns	Linear scan
`free_parse_fork`	~200-500 ns	Frees parser state + stack
`parse_fork_set_destroy`	~1-5 us	Frees all forks
`parser_fork_next_id`	~5-10 ns	Atomic fetch-add (relaxed)

Cloning cost breakdown for a typical parser (2 KB parser struct, 4 KB stack):

Step	Bytes Copied	Estimated Cost
malloc parser struct	-	~50 ns
memcpy parser struct	~2 KB	~100 ns
malloc stack buffer	-	~50 ns
memcpy stack (used portion)	~1-4 KB	~100-400 ns
Pointer fixup (3 writes)	24 B	~10 ns
snapshot_acquire (atomic)	-	~35 ns
Total		~350-650 ns

The clone always produces a heap-allocated stack regardless of whether the source parser uses the inline yystk0[] buffer. This simplifies lifecycle management at the cost of an extra malloc for parsers that are still using the inline stack.

Memory Overhead

Per-Extension Memory

Each registered extension consumes memory in the registry:

Component	Size	Notes
`RegistryEntry` struct	~200 B	Includes deep-copied metadata
Name string (strdup)	~10-50 B	Extension name
Version string (strdup)	~6-20 B	Semver string
`requires` array (strdup each)	~50-200 B	NULL-terminated string array
`conflicts_with` array (strdup)	~50-200 B	NULL-terminated string array
Hash table entry	~24 B	Key pointer + index + bool
Typical total per extension	~350-700 B

Per-Fork Memory

Each parser fork during disambiguation consumes:

Component	Size	Notes
`ParseFork` struct	~128 B	Status, priority, counters
Cloned parser state (`state_data`)	~2-8 KB	Grammar-dependent
Cloned stack (`stack_data`)	~2-32 KB	Depth-dependent
Snapshot reference	0 B (shared)	Only refcount increment
Typical total per fork	~4-40 KB

The fork-resolve strategy creates up to FORK_RESOLVE_MAX_FORKS (16) forks per conflict point, so peak memory for a single conflict resolution is approximately:

16 forks * 40 KB/fork = ~640 KB peak

This memory is freed immediately after the conflict is resolved.

Aggregate Memory Budget

For a deployment with N extensions and F concurrent fork sets:

Component	Formula	Example (10 ext, 1 fork set)
Extension registry	~700 B * N + ~4 KB base	~11 KB
Conflict detection result	~200 B * conflicts	~2 KB
Disambiguation context	~100 B per strategy	~100 B
Fork set (peak)	~40 KB * 16 * F	~640 KB
Snapshot (action tables)	~150-300 KB per snapshot	~300 KB
JIT context (if compiled)	~50-500 KB per snapshot	~200 KB
Total extension overhead		~1.15 MB

Compare this to the base parser memory (no extensions): ~500 KB - 1 MB for a typical deployment with 1 snapshot, 1 JIT context, and 4 parser threads (see docs/PERFORMANCE.md). Extensions roughly double the memory footprint at peak, but the fork memory is transient.

CPU Cost Per Token

Baseline (No Extensions)

When no extensions are loaded, the parser follows the standard table-driven (or JIT-compiled) path with zero extension overhead:

Grammar Size	Interpreted (ns/token)	JIT (ns/token)
Small (64 states, 32 terminals)	~8-12	~3-5
Medium (256 states, 100 terminals)	~12-18	~4-6
Large (512 states, 150 terminals)	~15-25	~4-7

These numbers are derived from the jit_comparison benchmark:

Small grammar: 424 ns / 50 tokens = ~8.5 ns/token (interpreted)
Medium grammar: 1,244 ns / 100 tokens = ~12.4 ns/token (interpreted)
Large grammar: 2,890 ns / 200 tokens = ~14.5 ns/token (interpreted)

With Extensions (No Conflicts)

When extensions are loaded but no conflicts arise during parsing, the overhead comes from:

Snapshot check: The parser uses the current snapshot's action tables. If an extension was loaded, the parser uses a new snapshot with the merged tables. Action table lookups are identical in cost because the table format is unchanged.
Token table lookup: Extension tokens are added to the hash table. Cost remains O(1) per lookup. Adding 10 extension tokens to a 36-keyword table increases average lookup time by < 5%.

Estimated per-token overhead: < 1 ns (effectively zero).

With Extensions (Active Conflicts)

When a token triggers a conflict between multiple extensions, the disambiguation framework is invoked. Per-token overhead depends on the strategy:

Strategy	Per-conflict Cost	Amortized Per-token	Notes
Priority	~100-300 ns	~2-6 ns	Linear scan of contexts
Fork-resolve	~2-50 us	~40-1000 ns	Creates and evaluates forks

The "amortized per-token" assumes conflicts occur on ~5% of tokens in a typical extension-heavy workload.

Conflict Type Comparison

The performance impact varies significantly by conflict type:

Token-Level Conflicts

Cause: Multiple extensions define the same lexeme as different tokens.

Metric	Value	Notes
Detection cost	~5-50 us	Linear scan of all extension tokens
Resolution cost	~100-300 ns	Priority comparison
Memory cost	~200 B	ConflictPoint + LimeContexts
Frequency	Low	Caught at load time

Token conflicts are detected and resolved at extension load time, not during parsing. They have no per-token runtime cost.

Rule-Level Conflicts

Cause: Multiple grammars can parse the same token sequence.

Metric	Value	Notes
Detection cost	~10-100 us	Per distinct token code
Resolution cost	~200 ns - 50 us	Depends on strategy
Memory cost	~500 B	ConflictPoint + contexts
Frequency	Medium	May occur at parse time

Rule conflicts may be detected statically (at load time) or dynamically (when the parser encounters an ambiguous state). Dynamic detection adds per-token overhead only for the specific states where conflicts exist.

Semantic-Level Conflicts

Cause: Identical rule structure but different reduction actions.

Metric	Value	Notes
Detection cost	~20-500 us	Pairwise O(N^2 * R^2) comparison
Resolution cost	~5-50 us	Usually requires fork-resolve
Memory cost	~1 KB	ConflictPoint + fork set
Frequency	Low	Detected at load time

Semantic conflicts are the most expensive to detect but also the rarest. The pairwise comparison algorithm (detect_semantic_conflicts) examines every rule pair across every extension pair. For N=10 extensions with R=20 rules each, this is 10*9/2 * 20*20 = 18,000 comparisons, which completes in ~200-500 us.

Disambiguation Strategy Performance

Priority Strategy

The priority strategy (src/strategy_priority.c) resolves conflicts by comparing numeric priority values.

Metric	Value	Notes
Init cost	~500 ns	Builds priority table
Resolve cost	~100-300 ns	Linear scan of contexts
Update cost	~10 ns	No-op (static strategy)
Destroy cost	~50 ns	Frees priority table
Memory per resolution	~100 B	StrategyResult + explanation
Confidence	Always 1.0	Deterministic

The priority strategy's resolve function performs a linear scan of the ConflictPoint.contexts array (typically 2-8 entries) comparing priority values. The inner loop has no allocations and no hash lookups.

Complexity: O(C) where C is the number of conflicting contexts.

Fork-Resolve Strategy

The fork-resolve strategy (src/strategy_fork_resolve.c) creates parser forks for each candidate interpretation and evaluates them.

Metric	Value	Notes
Init cost	~1-2 us	Builds extension priority map
Resolve cost (static)	~2-10 us	No parser state to clone
Resolve cost (runtime)	~5-50 us	Clones parser state per fork
Update cost	~10 ns	No-op (currently)
Destroy cost	~100 ns	Frees context + entries
Max forks per conflict	16	`FORK_RESOLVE_MAX_FORKS`
Default lookahead	3 tokens	`FORK_RESOLVE_DEFAULT_LOOKAHEAD`
Max lookahead	128 tokens	`FORK_RESOLVE_MAX_LOOKAHEAD`

Static mode (no live parser): The strategy creates lightweight ParseFork structs (just calloc + metadata) and simulates token feeding. Each fork is ~128 bytes. This mode is used during pre-parse conflict analysis.

Runtime mode (live parser available): The strategy calls fork_parser() to deep-clone the parser state for each candidate. Cloning cost is dominated by the memcpy of the parser struct and stack. For a typical parser with a 4 KB stack at depth 20:

Per-fork clone cost:  ~650 ns
Token feeding (3 tokens): ~60-150 ns
Evaluation:           ~100 ns
Total per-fork:       ~810-900 ns
 
For 4 competing forks: ~3.2-3.6 us
For 16 forks (max):   ~13-14.5 us

Tiebreaker overhead is negligible. All three rules (priority, longest-match, first-complete) are O(F) scans where F is the number of completed forks:

Tiebreaker Rule	Cost (16 forks)	Notes
TIEBREAK_PRIORITY	~30-50 ns	Compare priority + error count
TIEBREAK_LONGEST_MATCH	~30-50 ns	Compare tokens_consumed
TIEBREAK_FIRST_COMPLETE	~20-40 ns	Compare fork_id

Strategy Comparison

Aspect	Priority	Fork-Resolve
Resolution cost	~100-300 ns	~2-50 us
Accuracy	Low (static)	High (dynamic)
Memory per resolve	~100 B	~4-640 KB
Deterministic	Yes	Yes
Needs parser state	No	Optional
Learns over time	No	No (future)
Best for	Known priorities	Unknown/dynamic conflicts

JIT Interaction with Extensions

JIT Invalidation on Extension Load

Loading an extension creates a new ParserSnapshot with modified action tables. The JIT context is bound to a specific snapshot, so:

The old snapshot's JIT code remains valid for parsers still using it.
The new snapshot starts with no JIT code.
The JIT policy system begins accumulating metrics for the new snapshot.
After the policy threshold is met, JIT compilation triggers for the new snapshot.

This means there is a warmup period after each extension load during which parsing runs on the interpreted path.

Break-Even Analysis with Extensions

The JIT break-even point depends on the compilation cost and the per-parse savings. Loading extensions affects both:

Scenario	Compile Cost	Per-Parse Savings	Break-Even
Baseline (no extensions)	~10-50 ms	~1-5 us	~5K-50K parses
After 1 extension load	~10-50 ms	~1-5 us	~5K-50K parses
After 5 extension loads	~15-60 ms	~1-5 us	~6K-60K parses
Frequent extension loads	Repeated	Reduced window	May not break even

Key insight: If extensions are loaded/unloaded frequently, the JIT never reaches break-even because each load resets the warmup period. For best JIT utilization:

Load all extensions at startup.
Use synchronous JIT compilation after the final extension load.
Avoid dynamic extension loading in latency-sensitive paths.

JIT Compilation Cost by Grammar Size

Extension loading increases the effective grammar size (more states, more terminals). JIT compilation cost scales with grammar complexity:

Grammar Config	States	Terminals	Compile Time	Code Size
Base grammar	64	32	~5-10 ms	~6-12 KB
+ 1 small extension	80	40	~7-14 ms	~8-16 KB
+ 5 extensions	150	80	~15-30 ms	~15-30 KB
+ 10 extensions	256	100	~25-50 ms	~25-50 KB
Full SQL + 10 ext	512	150	~40-80 ms	~50-100 KB

Compilation cost is approximately linear in nstate * nterminal because the JIT generates one function per state, each with a switch over terminal values.

Scaling Behavior

1 Extension

Metric	Value	Delta from Baseline
Registry overhead	~700 B	+700 B
Snapshot size	~300 KB	~0% (tables merged)
Conflict detection	~5 us	One-time
Per-token overhead (no conflict)	< 1 ns	~0%
JIT recompilation	~10-50 ms	One-time

5 Extensions

Metric	Value	Delta from Baseline
Registry overhead	~3.5 KB	+3.5 KB
Snapshot size	~310-350 KB	+3-15%
Conflict detection (full)	~50-200 us	One-time
Dependency resolution	~5-15 us	One-time
Potential token conflicts	0-3	Depends on overlap
Per-token overhead (no conflict)	< 1 ns	~0%
Per-conflict resolution	~200 ns - 10 us	Strategy-dependent

10 Extensions

Metric	Value	Delta from Baseline
Registry overhead	~7 KB	+7 KB
Snapshot size	~320-400 KB	+7-30%
Conflict detection (full)	~100-500 us	One-time
Dependency resolution	~10-40 us	One-time
Potential token conflicts	0-10	Higher overlap
Semantic conflict detection	~100-300 us	O(N^2) pairwise
Per-token overhead (no conflict)	< 1 ns	~0%
Per-conflict resolution	~200 ns - 50 us	Strategy-dependent

50 Extensions

Metric	Value	Delta from Baseline
Registry overhead	~35 KB	+35 KB
Snapshot size	~400-600 KB	+30-100%
Conflict detection (full)	~1-5 ms	One-time
Dependency resolution	~50-200 us	One-time
Semantic conflict detection	~2-10 ms	O(N^2) dominates
Hash table (registry)	~128 buckets	Auto-resized
Per-token overhead (no conflict)	< 2 ns	~0%
Per-conflict resolution	~500 ns - 100 us	More contexts

At 50 extensions, the dominant costs are:

Semantic conflict detection: O(N^2) pairwise comparison becomes significant. With 50 extensions, there are 1,225 pairs to compare. If each extension has 20 rules, this is up to 490,000 rule-pair comparisons.
Fork-resolve with many contexts: A conflict involving 16+ extensions requires the maximum 16 forks, each cloning the parser state.
Hash table resizing: The registry hash table auto-resizes to maintain < 75% load factor. At 50 extensions, the table has at least 128 buckets (1.5 KB).

Scaling Summary

Extensions	Registry	Conflict Detect	Semantic Detect	Snapshot	Per-token
0	0	0	0	~296 KB	0
1	~700 B	~5 us	N/A	~300 KB	< 1 ns
5	~3.5 KB	~50-200 us	~20-50 us	~330 KB	< 1 ns
10	~7 KB	~100-500 us	~100-300 us	~370 KB	< 1 ns
50	~35 KB	~1-5 ms	~2-10 ms	~500 KB	< 2 ns

Observation: Extension overhead scales sub-linearly for most operations. The exception is semantic conflict detection, which is O(N^2) and should be profiled when using > 20 extensions.

Tiebreaker Rule Overhead

The fork-resolve strategy supports three tiebreaker rules. Their performance is nearly identical because all are simple linear scans:

Rule	Algorithm	Cost (16 forks)
`TIEBREAK_PRIORITY`	Scan for min priority, sub-break by error count then tokens	~40 ns
`TIEBREAK_LONGEST_MATCH`	Scan for max tokens_consumed, sub-break by priority	~40 ns
`TIEBREAK_FIRST_COMPLETE`	Scan for min fork_id	~30 ns

The tiebreaker is invoked once per conflict resolution after fork evaluation completes. It operates on the completed-forks subset (typically 1-4 forks survive out of the initial set).

Performance Tuning Recommendations

Extension Loading

Load all extensions at startup. This amortizes conflict detection and JIT compilation over the entire process lifetime rather than disrupting the hot parsing path.
Order extensions by dependency. Use extension_registry_get_order() to determine the correct load order. Loading in dependency order avoids redundant conflict re-evaluation.
Avoid frequent load/unload cycles. Each load creates a new snapshot, invalidates JIT, and triggers conflict detection.

Disambiguation Strategy Selection

Use priority strategy for known conflicts. If you know which extension should win (e.g., your custom extension overrides a default), set explicit priorities and use STRAT_PRIORITY. This is 100-1000x faster than fork-resolve.
Reserve fork-resolve for truly ambiguous cases. Fork-resolve is the right choice when you cannot determine the correct interpretation without actually trying to parse with each grammar.
Limit fork count for large extension sets. The default FORK_RESOLVE_MAX_FORKS of 16 is reasonable for most workloads. Reducing it to 4-8 significantly reduces peak memory and resolution time with minimal accuracy loss.

JIT Integration

Use synchronous JIT after final extension load. Call the JIT compilation function directly on the final snapshot to ensure JIT is active before the first query.
Lower JIT thresholds for extension-heavy grammars. Larger grammars (more states from extensions) benefit more from JIT because the table-lookup overhead is proportionally higher.
Monitor JIT metrics per snapshot. Each new snapshot starts with fresh metrics. If extensions are loaded infrequently, the JIT will eventually trigger. If loaded frequently, consider forced compilation.

Conflict Reduction

Use conflicts_with declarations. Declaring known conflicts in extension metadata allows the registry to reject incompatible combinations at registration time, before expensive conflict detection runs.
Set conflict_threshold appropriately. Extensions with conflict_threshold = 0.0 reject any conflict; 1.0 tolerates all. Tuning this per-extension avoids unnecessary fork-resolve invocations.
Minimize overlapping token definitions. Token-level conflicts are the most common and easiest to avoid. Use unique token names across extensions, or define shared tokens in a base extension.

Memory Management

Release old snapshots promptly. After an extension load, parsers that finish with the old snapshot should release it to free the old action tables.
Pin snapshots during batch processing. Acquire a snapshot reference at the start of a batch and release it at the end, rather than per-query, to reduce atomic refcount traffic.

Benchmark Reproduction

Running the Parser Benchmark Suite

# Build
meson setup builddir
ninja -C builddir
 
# General parser benchmarks (CSV output)
./builddir/bench/parser_bench 10000 200 > results.csv 2> diag.txt
 
# JIT vs Interpreted comparison (with statistical analysis)
./builddir/bench/jit_comparison

Measuring Extension Overhead

Extension overhead can be measured by comparing baseline benchmark results (no extensions) against results with extensions loaded:

# Baseline (no extensions)
./builddir/bench/parser_bench 10000 200 > baseline.csv
 
# With extensions (modify the benchmark to load extensions before parsing)
# See bench/parser_bench.c for the benchmark framework

Key Metrics to Compare

Metric	How to Measure
Per-token parse time	`ns_per_op / tokens_per_parse`
Conflict detection time	Instrument `detect_all_multi_grammar_conflicts()`
Fork-resolve time	Instrument `fork_resolve_resolve()`
Snapshot clone time	Instrument `clone_parser_state()`
JIT break-even	Compare total interpreted vs JIT time

Memory Profiling

# Build with AddressSanitizer
meson setup builddir-asan -Db_sanitize=address
ninja -C builddir-asan
 
# Run tests (checks for leaks)
meson test -C builddir-asan
 
# Valgrind for detailed heap profiling
valgrind --tool=massif ./builddir/bench/parser_bench 1000
ms_print massif.out.* | head -100

Appendix: Data Structure Sizes

Approximate sizes of key structures (x86_64, 64-bit pointers):

Structure	Size	Notes
`ExtensionRegistry`	~32 B	Plus dynamic entries + hash
`RegistryEntry`	~200 B	Includes `GrammarExtensionMetadata`
`GrammarExtensionMetadata`	~120 B	Pointers to strings + arrays
`HashEntry`	~24 B	Key pointer + index + bool
`ConflictPoint`	~48 B	Plus dynamic contexts array
`LimeContext`	~32 B	Token, state, ext_id, priority
`MultiGrammarConflictResult`	~32 B	Plus dynamic points array
`DisambiguationContext`	~200 B	vtable + strategy context
`StrategyResult`	~32 B	Plus malloc'd arrays
`ForkResolveContext`	~80 B	Plus priority entries
`PriorityContext`	~24 B	Plus priority entries
`ParseFork`	~128 B	Plus cloned state + stack
`ClonedParserState`	~48 B	Plus malloc'd state + stack
`ParseForkSet`	~24 B	Plus fork pointer array
`ParserSnapshot`	~128 B	Plus action tables

Last updated: 2026-04-29 Reference implementation: src/extension_registry.c, src/strategy_fork_resolve.c, src/parser_fork.c, src/conflict_detector.c, src/disambiguation.c, src/strategy_priority.c

Table of Contents