18 KiB
Shrimp Parser Architecture
This document explains the special cases, tricks, and design decisions in the Shrimp parser and tokenizer.
Table of Contents
- Token Types and Their Purpose
- External Tokenizer Tricks
- Grammar Special Cases
- Scope Tracking Architecture
- Common Pitfalls
Token Types and Their Purpose
Four Token Types from External Tokenizer
The external tokenizer (src/parser/tokenizer.ts) emits four different token types based on context:
| Token | Purpose | Example |
|---|---|---|
Identifier |
Regular identifiers in expressions, function calls | echo, x in x + 1 |
AssignableIdentifier |
Identifiers on LHS of = or in function params |
x in x = 5, params in fn x y: |
Word |
Anything else: paths, URLs, @mentions, #hashtags | ./file.txt, @user, #tag |
IdentifierBeforeDot |
Identifier that's in scope, followed by . |
obj in obj.prop |
Why We Need Both Identifier Types
The Problem: At the start of a statement like x ..., the parser doesn't know if it's:
- An assignment:
x = 5(needsAssignableIdentifier) - A function call:
x hello world(needsIdentifier)
The Solution: The external tokenizer uses a three-way decision:
- Only
AssignableIdentifiercan shift (e.g., inParamsrule) → emitAssignableIdentifier - Only
Identifiercan shift (e.g., in function arguments) → emitIdentifier - Both can shift (ambiguous statement start) → peek ahead for
=to disambiguate
See Identifier vs AssignableIdentifier Disambiguation below for implementation details.
External Tokenizer Tricks
1. Identifier vs AssignableIdentifier Disambiguation
Location: src/parser/tokenizer.ts lines 88-118
The Challenge: When both Identifier and AssignableIdentifier are valid (at statement start), how do we choose?
The Solution: Three-way branching with lookahead:
const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)
if (canAssignable && !canRegular) {
// Only AssignableIdentifier valid (e.g., in Params)
input.acceptToken(AssignableIdentifier)
} else if (canRegular && !canAssignable) {
// Only Identifier valid (e.g., in function args)
input.acceptToken(Identifier)
} else {
// BOTH possible - peek ahead for '='
// Skip whitespace, check if next char is '='
const nextCh = getFullCodePoint(input, peekPos)
if (nextCh === 61 /* = */) {
input.acceptToken(AssignableIdentifier) // It's an assignment
} else {
input.acceptToken(Identifier) // It's a function call
}
}
Key Insight: stack.canShift() returns true for BOTH token types when the grammar has multiple valid paths. We can't just use canShift() alone - we need lookahead.
Why This Works:
fn x y: ...→ InParamsrule, onlyAssignableIdentifiercan shift → no lookahead neededecho hello→ Both can shift, but no=ahead → emitsIdentifier→ parses asFunctionCallx = 5→ Both can shift, finds=ahead → emitsAssignableIdentifier→ parses asAssign
2. Surrogate Pair Handling for Emoji
Location: src/parser/tokenizer.ts lines 71-84, getFullCodePoint() function
The Problem: JavaScript strings use UTF-16, but emoji like 🍤 use code points outside the BMP (Basic Multilingual Plane), requiring surrogate pairs.
The Solution: When reading characters, check for high surrogates (0xD800-0xDBFF) and combine them with low surrogates (0xDC00-0xDFFF):
const getFullCodePoint = (input: InputStream, pos: number): number => {
const ch = input.peek(pos)
// Check if this is a high surrogate (0xD800-0xDBFF)
if (ch >= 0xd800 && ch <= 0xdbff) {
const low = input.peek(pos + 1)
// Check if next is low surrogate (0xDC00-0xDFFF)
if (low >= 0xdc00 && low <= 0xdfff) {
// Combine surrogate pair into full code point
return 0x10000 + ((ch & 0x3ff) << 10) + (low & 0x3ff)
}
}
return ch
}
Why This Matters: Without this, shrimp-🍤 would be treated as shrimp-<high><low> (4 characters) instead of shrimp-🍤 (2 characters).
3. Context-Aware Termination for Semicolon and Colon
Location: src/parser/tokenizer.ts lines 51-57
The Problem: How do we parse basename ./cool; vs basename ./cool; 2?
The Solution: Only treat ; and : as terminators if they're followed by whitespace (or EOF):
if (canBeWord && (ch === 59 /* ; */ || ch === 58) /* : */) {
const nextCh = getFullCodePoint(input, pos + 1)
if (!isWordChar(nextCh)) break // It's a terminator
// Otherwise, continue consuming as part of the Word
}
Examples:
basename ./cool;→;is followed by EOF → terminates the word at./coolbasename ./cool;2→;is followed by2→ included in word as./cool;2basename ./cool; 2→;is followed by space → terminates at./cool,2is next arg
4. Scope-Aware Property Access (DotGet)
Location: src/parser/tokenizer.ts lines 19-48
The Problem: How do we distinguish obj.prop (property access) from readme.txt (filename)?
The Solution: When we see a . after an identifier, check if that identifier is in scope:
if (ch === 46 /* . */ && isValidIdentifier) {
// Build identifier text
let identifierText = '...' // (surrogate-pair aware)
const scopeContext = stack.context as ScopeContext | undefined
const scope = scopeContext?.scope
if (scope?.has(identifierText)) {
// In scope - stop here, emit IdentifierBeforeDot
// Grammar will parse as DotGet
input.acceptToken(IdentifierBeforeDot)
return
}
// Not in scope - continue consuming as Word
// Will parse as Word("readme.txt")
}
Examples:
config = {path: "..."}; config.path→configis in scope → parses asDotGet(IdentifierBeforeDot, Identifier)cat readme.txt→readmeis not in scope → parses asWord("readme.txt")
Grammar Special Cases
1. expressionWithoutIdentifier Pattern
Location: src/parser/shrimp.grammar lines 200-210
The Problem: GLR conflict in consumeToTerminator rule:
consumeToTerminator {
ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier
expression // → Identifier
}
When parsing my-var at statement level, both paths want the same Identifier token, causing a conflict.
The Solution: Remove Identifier from the expression path by creating expressionWithoutIdentifier:
expression {
expressionWithoutIdentifier | DotGet | Identifier
}
expressionWithoutIdentifier {
ParenExpr | Word | String | Number | Boolean | Regex | Null
}
Then use expressionWithoutIdentifier in places where we don't want bare identifiers:
consumeToTerminator {
PipeExpr |
ambiguousFunctionCall | // ← Handles standalone identifiers
DotGet |
IfExpr |
FunctionDef |
Assign |
BinOp |
expressionWithoutIdentifier // ← No bare Identifier here
}
Why This Works: Now standalone identifiers MUST go through ambiguousFunctionCall, which is semantically what we want (they're either function calls or variable references).
2. @skip {} Wrapper for DotGet
Location: src/parser/shrimp.grammar lines 176-183
The Problem: DotGet needs to be whitespace-sensitive (no spaces allowed around .), but the global @skip { space } would remove them.
The Solution: Use @skip {} (empty skip) wrapper to disable automatic whitespace skipping:
@skip {} {
DotGet {
IdentifierBeforeDot "." Identifier
}
String { "'" stringContent* "'" }
}
Why This Matters:
obj.prop→ Parses asDotGet✓obj. prop→ Would parse asobjfollowed by. prop(error) if whitespace was skippedobj .prop→ Would parse asobjfollowed by.prop(error) if whitespace was skipped
3. EOF Handling in item Rule
Location: src/parser/shrimp.grammar lines 54-58
The Problem: How do we handle empty lines and end-of-file without infinite loops?
The Solution: Use alternatives instead of repetition for EOF:
item {
consumeToTerminator newlineOrSemicolon | // Statement with newline/semicolon
consumeToTerminator eof | // Statement at end of file
newlineOrSemicolon // Allow blank lines
}
Why Not Just item { (statement | newlineOrSemicolon)+ eof? }?
That would match EOF multiple times (once after each statement), causing parser errors. By making EOF part of an alternative, it's only matched once per item.
4. Params Uses AssignableIdentifier
Location: src/parser/shrimp.grammar lines 153-155
Params {
AssignableIdentifier*
}
Why This Matters: Function parameters are in "assignable" positions - they're being bound to values when the function is called. Using AssignableIdentifier here:
- Makes the grammar explicit about which identifiers create bindings
- Enables the tokenizer to use
canShift(AssignableIdentifier)to detect param context - Allows the scope tracker to only capture
AssignableIdentifiertokens
5. String Interpolation Inside @skip
Location: src/parser/shrimp.grammar lines 181-198
The Problem: String contents need to preserve whitespace, but string interpolation $identifier needs to use the external tokenizer.
The Solution: Put String inside @skip {} and use the external tokenizer for Identifier within interpolation:
@skip {} {
String { "'" stringContent* "'" }
}
stringContent {
StringFragment | // Matches literal text (preserves spaces)
Interpolation | // $identifier or $(expr)
EscapeSeq // \$, \n, etc.
}
Interpolation {
"$" Identifier | // Uses external tokenizer!
"$" ParenExpr
}
Key Insight: External tokenizers work inside @skip {} blocks! The tokenizer gets called even when skip is disabled.
Scope Tracking Architecture
Overview
Scope tracking uses Lezer's @context feature to maintain a scope chain during parsing. This enables:
- Distinguishing
obj.prop(property access) fromreadme.txt(filename) - Tracking which variables are in scope for each position in the parse tree
Architecture: Scope vs ScopeContext
Two-Class Design:
// Pure, hashable scope - only variable tracking
class Scope {
constructor(
public parent: Scope | null,
public vars: Set<string>
) {}
has(name: string): boolean
add(...names: string[]): Scope
push(): Scope // Create child scope
pop(): Scope // Return to parent
hash(): number // For incremental parsing
}
// Wrapper with temporary state
export class ScopeContext {
constructor(
public scope: Scope,
public pendingIds: string[] = []
) {}
}
Why This Separation?
- Scope is pure and hashable - Only contains committed variable bindings, no temporary state
- ScopeContext holds temporary state - The
pendingIdsarray captures identifiers during parsing but isn't part of the hash - Hash function only hashes Scope - Incremental parsing only cares about actual scope, not pending identifiers
How Scope Tracking Works
1. Capture Phase (shift):
When the parser shifts an AssignableIdentifier token, the scope tracker captures its text:
shift(context, term, stack, input) {
if (term === terms.AssignableIdentifier) {
// Build text by peeking at input
let text = '...' // (read from input.pos to stack.pos)
return new ScopeContext(
context.scope,
[...context.pendingIds, text] // Append to pending
)
}
return context
}
2. Commit Phase (reduce):
When the parser reduces to Assign or Params, the scope tracker commits pending identifiers:
reduce(context, term, stack, input) {
// Assignment: pop last identifier, add to scope
if (term === terms.Assign && context.pendingIds.length > 0) {
const varName = context.pendingIds[context.pendingIds.length - 1]!
return new ScopeContext(
context.scope.add(varName), // Add to scope
context.pendingIds.slice(0, -1) // Remove from pending
)
}
// Function params: add all identifiers, push new scope
if (term === terms.Params) {
const newScope = context.scope.push()
return new ScopeContext(
context.pendingIds.length > 0
? newScope.add(...context.pendingIds)
: newScope,
[] // Clear pending
)
}
// Function exit: pop scope
if (term === terms.FunctionDef) {
return new ScopeContext(context.scope.pop(), [])
}
return context
}
3. Usage in Tokenizer:
The tokenizer accesses scope to check if identifiers are bound:
const scopeContext = stack.context as ScopeContext | undefined
const scope = scopeContext?.scope
if (scope?.has(identifierText)) {
// Identifier is in scope - can use in DotGet
input.acceptToken(IdentifierBeforeDot)
}
Why Only Track AssignableIdentifier?
Before (complex):
- Tracked ALL identifiers with
term === terms.Identifier - Used
isInParamsflag to know which ones to keep - Had to manually clear "stale" identifiers after DotGet, FunctionCall, etc.
After (simple):
- Only track
AssignableIdentifiertokens - These only appear in
ParamsandAssign(by grammar design) - No stale identifiers - they're consumed immediately
Example:
fn x y: echo x end
Scope tracking:
- Shift
AssignableIdentifier("x")→ pending = ["x"] - Shift
AssignableIdentifier("y")→ pending = ["x", "y"] - Reduce
Params→ scope = {x, y}, pending = [] - Shift
Identifier("echo")→ not captured (not AssignableIdentifier) - Shift
Identifier("x")→ not captured - Reduce
FunctionDef→ pop scope
No stale identifier clearing needed!
Common Pitfalls
1. Forgetting Surrogate Pairs
Problem: Using input.peek(i) directly gives UTF-16 code units, not Unicode code points.
Solution: Always use getFullCodePoint(input, pos) when working with emoji.
Example:
// ❌ Wrong - breaks on emoji
const ch = input.peek(pos)
if (isEmoji(ch)) { ... }
// ✓ Right - handles surrogate pairs
const ch = getFullCodePoint(input, pos)
if (isEmoji(ch)) { ... }
pos += getCharSize(ch) // Advance by 1 or 2 code units
2. Adding Pending State to Hash
Problem: Including pendingIds or isInParams in the hash function breaks incremental parsing.
Why? The hash is used to determine if a cached parse tree node can be reused. If the hash includes temporary state that doesn't affect parsing decisions, nodes will be invalidated unnecessarily.
Solution: Only hash the Scope (vars + parent chain), not the ScopeContext wrapper.
// ✓ Right
const hashScope = (context: ScopeContext): number => {
return context.scope.hash() // Only hash committed scope
}
// ❌ Wrong
const hashScope = (context: ScopeContext): number => {
let h = context.scope.hash()
h = (h << 5) - h + context.pendingIds.length // Don't do this!
return h
}
3. Using canShift() Alone for Disambiguation
Problem: stack.canShift(AssignableIdentifier) returns true when BOTH paths are possible (e.g., at statement start).
Why? The GLR parser maintains multiple parse states. If any state can shift the token, canShift() returns true.
Solution: Check BOTH token types and use lookahead when both are possible:
const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)
if (canAssignable && canRegular) {
// Both possible - need lookahead
const hasEquals = peekForEquals(input, pos)
input.acceptToken(hasEquals ? AssignableIdentifier : Identifier)
}
4. Clearing Pending Identifiers Too Eagerly
Problem: In the old code, we had to clear pending identifiers after DotGet, FunctionCall, etc. to prevent state leakage. This was fragile and easy to forget.
Why This Happened: We were tracking ALL identifiers, not just assignable ones.
Solution: Only track AssignableIdentifier tokens. They only appear in contexts where they'll be consumed (Params, Assign), so no clearing needed.
5. Line Number Confusion in Edit Tool
Problem: The Edit tool shows line numbers with a prefix (like 5→), but these aren't the real line numbers.
How to Read:
- The number before
→is the actual line number - Use that number when referencing code in comments or documentation
- Example:
5→export const foomeans the code is on line 5
Testing Strategy
Parser Tests
Use the toMatchTree helper to verify parse tree structure:
test('assignment with AssignableIdentifier', () => {
expect('x = 5').toMatchTree(`
Assign
AssignableIdentifier x
operator =
Number 5
`)
})
Key Testing Patterns:
- Test both token type expectations (Identifier vs AssignableIdentifier)
- Test scope-aware features (DotGet for in-scope vs Word for out-of-scope)
- Test edge cases (empty lines, EOF, surrogate pairs)
Debugging Parser Issues
- Check token types: Run parser on input and examine tree structure
- Test canShift(): Add logging to tokenizer to see what
canShift()returns - Verify scope state: Log scope contents during parsing
- Use GLR visualization: Lezer has tools for visualizing parse states
Further Reading
- Lezer System Guide
- Lezer API Reference
- CLAUDE.md - General project guidance
- Scope Tracker Source
- Tokenizer Source