probablycorey/shrimp

Fork 0

Corey Johnson 290270dc7b docs: add comprehensive parser architecture documentation

2025-10-17 19:15:43 -07:00

18 KiB

Raw Blame History

Shrimp Parser Architecture

This document explains the special cases, tricks, and design decisions in the Shrimp parser and tokenizer.

Token Types and Their Purpose
External Tokenizer Tricks
Grammar Special Cases
Scope Tracking Architecture
Common Pitfalls

Token Types and Their Purpose

Four Token Types from External Tokenizer

The external tokenizer (src/parser/tokenizer.ts) emits four different token types based on context:

Token	Purpose	Example
`Identifier`	Regular identifiers in expressions, function calls	`echo`, `x` in `x + 1`
`AssignableIdentifier`	Identifiers on LHS of `=` or in function params	`x` in `x = 5`, params in `fn x y:`
`Word`	Anything else: paths, URLs, @mentions, #hashtags	`./file.txt`, `@user`, `#tag`
`IdentifierBeforeDot`	Identifier that's in scope, followed by `.`	`obj` in `obj.prop`

Why We Need Both Identifier Types

The Problem: At the start of a statement like x ..., the parser doesn't know if it's:

An assignment: x = 5 (needs AssignableIdentifier)
A function call: x hello world (needs Identifier)

The Solution: The external tokenizer uses a three-way decision:

Only AssignableIdentifier can shift (e.g., in Params rule) → emit AssignableIdentifier
Only Identifier can shift (e.g., in function arguments) → emit Identifier
Both can shift (ambiguous statement start) → peek ahead for = to disambiguate

See Identifier vs AssignableIdentifier Disambiguation below for implementation details.

External Tokenizer Tricks

1. Identifier vs AssignableIdentifier Disambiguation

Location: src/parser/tokenizer.ts lines 88-118

The Challenge: When both Identifier and AssignableIdentifier are valid (at statement start), how do we choose?

The Solution: Three-way branching with lookahead:

const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)

if (canAssignable && !canRegular) {
  // Only AssignableIdentifier valid (e.g., in Params)
  input.acceptToken(AssignableIdentifier)
} else if (canRegular && !canAssignable) {
  // Only Identifier valid (e.g., in function args)
  input.acceptToken(Identifier)
} else {
  // BOTH possible - peek ahead for '='
  // Skip whitespace, check if next char is '='
  const nextCh = getFullCodePoint(input, peekPos)
  if (nextCh === 61 /* = */) {
    input.acceptToken(AssignableIdentifier)  // It's an assignment
  } else {
    input.acceptToken(Identifier)  // It's a function call
  }
}

Key Insight: stack.canShift() returns true for BOTH token types when the grammar has multiple valid paths. We can't just use canShift() alone - we need lookahead.

Why This Works:

fn x y: ... → In Params rule, only AssignableIdentifier can shift → no lookahead needed
echo hello → Both can shift, but no = ahead → emits Identifier → parses as FunctionCall
x = 5 → Both can shift, finds = ahead → emits AssignableIdentifier → parses as Assign

2. Surrogate Pair Handling for Emoji

Location: src/parser/tokenizer.ts lines 71-84, getFullCodePoint() function

The Problem: JavaScript strings use UTF-16, but emoji like 🍤 use code points outside the BMP (Basic Multilingual Plane), requiring surrogate pairs.

The Solution: When reading characters, check for high surrogates (0xD800-0xDBFF) and combine them with low surrogates (0xDC00-0xDFFF):

const getFullCodePoint = (input: InputStream, pos: number): number => {
  const ch = input.peek(pos)

  // Check if this is a high surrogate (0xD800-0xDBFF)
  if (ch >= 0xd800 && ch <= 0xdbff) {
    const low = input.peek(pos + 1)
    // Check if next is low surrogate (0xDC00-0xDFFF)
    if (low >= 0xdc00 && low <= 0xdfff) {
      // Combine surrogate pair into full code point
      return 0x10000 + ((ch & 0x3ff) << 10) + (low & 0x3ff)
    }
  }

  return ch
}

Why This Matters: Without this, shrimp-🍤 would be treated as shrimp-<high><low> (4 characters) instead of shrimp-🍤 (2 characters).

3. Context-Aware Termination for Semicolon and Colon

Location: src/parser/tokenizer.ts lines 51-57

The Problem: How do we parse basename ./cool; vs basename ./cool; 2?

The Solution: Only treat ; and : as terminators if they're followed by whitespace (or EOF):

if (canBeWord && (ch === 59 /* ; */ || ch === 58) /* : */) {
  const nextCh = getFullCodePoint(input, pos + 1)
  if (!isWordChar(nextCh)) break  // It's a terminator
  // Otherwise, continue consuming as part of the Word
}

Examples:

basename ./cool; → ; is followed by EOF → terminates the word at ./cool
basename ./cool;2 → ; is followed by 2 → included in word as ./cool;2
basename ./cool; 2 → ; is followed by space → terminates at ./cool, 2 is next arg

4. Scope-Aware Property Access (DotGet)

Location: src/parser/tokenizer.ts lines 19-48

The Problem: How do we distinguish obj.prop (property access) from readme.txt (filename)?

The Solution: When we see a . after an identifier, check if that identifier is in scope:

if (ch === 46 /* . */ && isValidIdentifier) {
  // Build identifier text
  let identifierText = '...'  // (surrogate-pair aware)

  const scopeContext = stack.context as ScopeContext | undefined
  const scope = scopeContext?.scope

  if (scope?.has(identifierText)) {
    // In scope - stop here, emit IdentifierBeforeDot
    // Grammar will parse as DotGet
    input.acceptToken(IdentifierBeforeDot)
    return
  }
  // Not in scope - continue consuming as Word
  // Will parse as Word("readme.txt")
}

Examples:

config = {path: "..."}; config.path → config is in scope → parses as DotGet(IdentifierBeforeDot, Identifier)
cat readme.txt → readme is not in scope → parses as Word("readme.txt")

Grammar Special Cases

1. expressionWithoutIdentifier Pattern

Location: src/parser/shrimp.grammar lines 200-210

The Problem: GLR conflict in consumeToTerminator rule:

consumeToTerminator {
  ambiguousFunctionCall |  // → FunctionCallOrIdentifier → Identifier
  expression              // → Identifier
}

When parsing my-var at statement level, both paths want the same Identifier token, causing a conflict.

The Solution: Remove Identifier from the expression path by creating expressionWithoutIdentifier:

expression {
  expressionWithoutIdentifier | DotGet | Identifier
}

expressionWithoutIdentifier {
  ParenExpr | Word | String | Number | Boolean | Regex | Null
}

Then use expressionWithoutIdentifier in places where we don't want bare identifiers:

consumeToTerminator {
  PipeExpr |
  ambiguousFunctionCall |   // ← Handles standalone identifiers
  DotGet |
  IfExpr |
  FunctionDef |
  Assign |
  BinOp |
  expressionWithoutIdentifier  // ← No bare Identifier here
}

Why This Works: Now standalone identifiers MUST go through ambiguousFunctionCall, which is semantically what we want (they're either function calls or variable references).

2. @skip {} Wrapper for DotGet

Location: src/parser/shrimp.grammar lines 176-183

The Problem: DotGet needs to be whitespace-sensitive (no spaces allowed around .), but the global @skip { space } would remove them.

The Solution: Use @skip {} (empty skip) wrapper to disable automatic whitespace skipping:

@skip {} {
  DotGet {
    IdentifierBeforeDot "." Identifier
  }

  String { "'" stringContent* "'" }
}

Why This Matters:

obj.prop → Parses as DotGet ✓
obj. prop → Would parse as obj followed by . prop (error) if whitespace was skipped
obj .prop → Would parse as obj followed by .prop (error) if whitespace was skipped

3. EOF Handling in item Rule

Location: src/parser/shrimp.grammar lines 54-58

The Problem: How do we handle empty lines and end-of-file without infinite loops?

The Solution: Use alternatives instead of repetition for EOF:

item {
  consumeToTerminator newlineOrSemicolon |  // Statement with newline/semicolon
  consumeToTerminator eof |                 // Statement at end of file
  newlineOrSemicolon                        // Allow blank lines
}

Why Not Just item { (statement | newlineOrSemicolon)+ eof? }?

That would match EOF multiple times (once after each statement), causing parser errors. By making EOF part of an alternative, it's only matched once per item.

4. Params Uses AssignableIdentifier

Location: src/parser/shrimp.grammar lines 153-155

Params {
  AssignableIdentifier*
}

Why This Matters: Function parameters are in "assignable" positions - they're being bound to values when the function is called. Using AssignableIdentifier here:

Makes the grammar explicit about which identifiers create bindings
Enables the tokenizer to use canShift(AssignableIdentifier) to detect param context
Allows the scope tracker to only capture AssignableIdentifier tokens

5. String Interpolation Inside @skip

Location: src/parser/shrimp.grammar lines 181-198

The Problem: String contents need to preserve whitespace, but string interpolation $identifier needs to use the external tokenizer.

The Solution: Put String inside @skip {} and use the external tokenizer for Identifier within interpolation:

@skip {} {
  String { "'" stringContent* "'" }
}

stringContent {
  StringFragment |      // Matches literal text (preserves spaces)
  Interpolation |       // $identifier or $(expr)
  EscapeSeq            // \$, \n, etc.
}

Interpolation {
  "$" Identifier |      // Uses external tokenizer!
  "$" ParenExpr
}

Key Insight: External tokenizers work inside @skip {} blocks! The tokenizer gets called even when skip is disabled.

Scope Tracking Architecture

Overview

Scope tracking uses Lezer's @context feature to maintain a scope chain during parsing. This enables:

Distinguishing obj.prop (property access) from readme.txt (filename)
Tracking which variables are in scope for each position in the parse tree

Architecture: Scope vs ScopeContext

Two-Class Design:

// Pure, hashable scope - only variable tracking
class Scope {
  constructor(
    public parent: Scope | null,
    public vars: Set<string>
  ) {}

  has(name: string): boolean
  add(...names: string[]): Scope
  push(): Scope  // Create child scope
  pop(): Scope   // Return to parent
  hash(): number // For incremental parsing
}

// Wrapper with temporary state
export class ScopeContext {
  constructor(
    public scope: Scope,
    public pendingIds: string[] = []
  ) {}
}

Why This Separation?

Scope is pure and hashable - Only contains committed variable bindings, no temporary state
ScopeContext holds temporary state - The pendingIds array captures identifiers during parsing but isn't part of the hash
Hash function only hashes Scope - Incremental parsing only cares about actual scope, not pending identifiers

How Scope Tracking Works

1. Capture Phase (shift):

When the parser shifts an AssignableIdentifier token, the scope tracker captures its text:

shift(context, term, stack, input) {
  if (term === terms.AssignableIdentifier) {
    // Build text by peeking at input
    let text = '...'  // (read from input.pos to stack.pos)

    return new ScopeContext(
      context.scope,
      [...context.pendingIds, text]  // Append to pending
    )
  }
  return context
}

2. Commit Phase (reduce):

When the parser reduces to Assign or Params, the scope tracker commits pending identifiers:

reduce(context, term, stack, input) {
  // Assignment: pop last identifier, add to scope
  if (term === terms.Assign && context.pendingIds.length > 0) {
    const varName = context.pendingIds[context.pendingIds.length - 1]!
    return new ScopeContext(
      context.scope.add(varName),      // Add to scope
      context.pendingIds.slice(0, -1)  // Remove from pending
    )
  }

  // Function params: add all identifiers, push new scope
  if (term === terms.Params) {
    const newScope = context.scope.push()
    return new ScopeContext(
      context.pendingIds.length > 0
        ? newScope.add(...context.pendingIds)
        : newScope,
      []  // Clear pending
    )
  }

  // Function exit: pop scope
  if (term === terms.FunctionDef) {
    return new ScopeContext(context.scope.pop(), [])
  }

  return context
}

3. Usage in Tokenizer:

The tokenizer accesses scope to check if identifiers are bound:

const scopeContext = stack.context as ScopeContext | undefined
const scope = scopeContext?.scope

if (scope?.has(identifierText)) {
  // Identifier is in scope - can use in DotGet
  input.acceptToken(IdentifierBeforeDot)
}

Why Only Track AssignableIdentifier?

Before (complex):

Tracked ALL identifiers with term === terms.Identifier
Used isInParams flag to know which ones to keep
Had to manually clear "stale" identifiers after DotGet, FunctionCall, etc.

After (simple):

Only track AssignableIdentifier tokens
These only appear in Params and Assign (by grammar design)
No stale identifiers - they're consumed immediately

Example:

fn x y: echo x end

Scope tracking:

Shift AssignableIdentifier("x") → pending = ["x"]
Shift AssignableIdentifier("y") → pending = ["x", "y"]
Reduce Params → scope = {x, y}, pending = []
Shift Identifier("echo") → not captured (not AssignableIdentifier)
Shift Identifier("x") → not captured
Reduce FunctionDef → pop scope

No stale identifier clearing needed!

Common Pitfalls

1. Forgetting Surrogate Pairs

Problem: Using input.peek(i) directly gives UTF-16 code units, not Unicode code points.

Solution: Always use getFullCodePoint(input, pos) when working with emoji.

Example:

// ❌ Wrong - breaks on emoji
const ch = input.peek(pos)
if (isEmoji(ch)) { ... }

// ✓ Right - handles surrogate pairs
const ch = getFullCodePoint(input, pos)
if (isEmoji(ch)) { ... }
pos += getCharSize(ch)  // Advance by 1 or 2 code units

2. Adding Pending State to Hash

Problem: Including pendingIds or isInParams in the hash function breaks incremental parsing.

Why? The hash is used to determine if a cached parse tree node can be reused. If the hash includes temporary state that doesn't affect parsing decisions, nodes will be invalidated unnecessarily.

Solution: Only hash the Scope (vars + parent chain), not the ScopeContext wrapper.

// ✓ Right
const hashScope = (context: ScopeContext): number => {
  return context.scope.hash()  // Only hash committed scope
}

// ❌ Wrong
const hashScope = (context: ScopeContext): number => {
  let h = context.scope.hash()
  h = (h << 5) - h + context.pendingIds.length  // Don't do this!
  return h
}

3. Using canShift() Alone for Disambiguation

Problem: stack.canShift(AssignableIdentifier) returns true when BOTH paths are possible (e.g., at statement start).

Why? The GLR parser maintains multiple parse states. If any state can shift the token, canShift() returns true.

Solution: Check BOTH token types and use lookahead when both are possible:

const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)

if (canAssignable && canRegular) {
  // Both possible - need lookahead
  const hasEquals = peekForEquals(input, pos)
  input.acceptToken(hasEquals ? AssignableIdentifier : Identifier)
}

4. Clearing Pending Identifiers Too Eagerly

Problem: In the old code, we had to clear pending identifiers after DotGet, FunctionCall, etc. to prevent state leakage. This was fragile and easy to forget.

Why This Happened: We were tracking ALL identifiers, not just assignable ones.

Solution: Only track AssignableIdentifier tokens. They only appear in contexts where they'll be consumed (Params, Assign), so no clearing needed.

5. Line Number Confusion in Edit Tool

Problem: The Edit tool shows line numbers with a prefix (like 5→), but these aren't the real line numbers.

How to Read:

The number before → is the actual line number
Use that number when referencing code in comments or documentation
Example: 5→export const foo means the code is on line 5

Testing Strategy

Parser Tests

Use the toMatchTree helper to verify parse tree structure:

test('assignment with AssignableIdentifier', () => {
  expect('x = 5').toMatchTree(`
    Assign
      AssignableIdentifier x
      operator =
      Number 5
  `)
})

Key Testing Patterns:

Test both token type expectations (Identifier vs AssignableIdentifier)
Test scope-aware features (DotGet for in-scope vs Word for out-of-scope)
Test edge cases (empty lines, EOF, surrogate pairs)

Debugging Parser Issues

Check token types: Run parser on input and examine tree structure
Test canShift(): Add logging to tokenizer to see what canShift() returns
Verify scope state: Log scope contents during parsing
Use GLR visualization: Lezer has tools for visualizing parse states

18 KiB

Raw Blame History

Shrimp Parser Architecture

Table of Contents

Token Types and Their Purpose

Four Token Types from External Tokenizer

Why We Need Both Identifier Types

External Tokenizer Tricks

1. Identifier vs AssignableIdentifier Disambiguation

2. Surrogate Pair Handling for Emoji

3. Context-Aware Termination for Semicolon and Colon

4. Scope-Aware Property Access (DotGet)

Grammar Special Cases

1. expressionWithoutIdentifier Pattern

2. @skip {} Wrapper for DotGet

3. EOF Handling in item Rule

4. Params Uses AssignableIdentifier

5. String Interpolation Inside @skip

Scope Tracking Architecture

Overview

Architecture: Scope vs ScopeContext

How Scope Tracking Works

Why Only Track AssignableIdentifier?

Common Pitfalls

1. Forgetting Surrogate Pairs

2. Adding Pending State to Hash

3. Using canShift() Alone for Disambiguation

4. Clearing Pending Identifiers Too Eagerly

5. Line Number Confusion in Edit Tool

Testing Strategy

Parser Tests

Debugging Parser Issues

Further Reading

18 KiB Raw Blame History

Shrimp Parser Architecture

Table of Contents

Token Types and Their Purpose

Four Token Types from External Tokenizer

Why We Need Both Identifier Types

External Tokenizer Tricks

1. Identifier vs AssignableIdentifier Disambiguation

2. Surrogate Pair Handling for Emoji

3. Context-Aware Termination for Semicolon and Colon

4. Scope-Aware Property Access (DotGet)

Grammar Special Cases

1. expressionWithoutIdentifier Pattern

2. @skip {} Wrapper for DotGet

3. EOF Handling in item Rule

4. Params Uses AssignableIdentifier

5. String Interpolation Inside @skip

Scope Tracking Architecture

Overview

Architecture: Scope vs ScopeContext

How Scope Tracking Works

Why Only Track AssignableIdentifier?

Common Pitfalls

1. Forgetting Surrogate Pairs

2. Adding Pending State to Hash

3. Using canShift() Alone for Disambiguation

4. Clearing Pending Identifiers Too Eagerly

5. Line Number Confusion in Edit Tool

Testing Strategy

Parser Tests

Debugging Parser Issues

Further Reading

18 KiB

Raw Blame History