From 290270dc7b71694f92feb965411317d2f6a5a677 Mon Sep 17 00:00:00 2001
From: Corey Johnson <probablycorey@gmail.com>
Date: Fri, 17 Oct 2025 19:15:43 -0700
Subject: [PATCH] docs: add comprehensive parser architecture documentation

---
 docs/parser-architecture.md | 557 ++++++++++++++++++++++++++++++++++++
 1 file changed, 557 insertions(+)
 create mode 100644 docs/parser-architecture.md

diff --git a/docs/parser-architecture.md b/docs/parser-architecture.md
new file mode 100644
index 0000000..ee0984c
--- /dev/null
+++ b/docs/parser-architecture.md
@@ -0,0 +1,557 @@
+# Shrimp Parser Architecture
+
+This document explains the special cases, tricks, and design decisions in the Shrimp parser and tokenizer.
+
+## Table of Contents
+
+1. [Token Types and Their Purpose](#token-types-and-their-purpose)
+2. [External Tokenizer Tricks](#external-tokenizer-tricks)
+3. [Grammar Special Cases](#grammar-special-cases)
+4. [Scope Tracking Architecture](#scope-tracking-architecture)
+5. [Common Pitfalls](#common-pitfalls)
+
+---
+
+## Token Types and Their Purpose
+
+### Four Token Types from External Tokenizer
+
+The external tokenizer (`src/parser/tokenizer.ts`) emits four different token types based on context:
+
+| Token | Purpose | Example |
+|-------|---------|---------|
+| `Identifier` | Regular identifiers in expressions, function calls | `echo`, `x` in `x + 1` |
+| `AssignableIdentifier` | Identifiers on LHS of `=` or in function params | `x` in `x = 5`, params in `fn x y:` |
+| `Word` | Anything else: paths, URLs, @mentions, #hashtags | `./file.txt`, `@user`, `#tag` |
+| `IdentifierBeforeDot` | Identifier that's in scope, followed by `.` | `obj` in `obj.prop` |
+
+### Why We Need Both Identifier Types
+
+**The Problem:** At the start of a statement like `x ...`, the parser doesn't know if it's:
+- An assignment: `x = 5` (needs `AssignableIdentifier`)
+- A function call: `x hello world` (needs `Identifier`)
+
+**The Solution:** The external tokenizer uses a three-way decision:
+
+1. **Only `AssignableIdentifier` can shift** (e.g., in `Params` rule) → emit `AssignableIdentifier`
+2. **Only `Identifier` can shift** (e.g., in function arguments) → emit `Identifier`
+3. **Both can shift** (ambiguous statement start) → peek ahead for `=` to disambiguate
+
+See [`Identifier vs AssignableIdentifier Disambiguation`](#identifier-vs-assignableidentifier-disambiguation) below for implementation details.
+
+---
+
+## External Tokenizer Tricks
+
+### 1. Identifier vs AssignableIdentifier Disambiguation
+
+**Location:** `src/parser/tokenizer.ts` lines 88-118
+
+**The Challenge:** When both `Identifier` and `AssignableIdentifier` are valid (at statement start), how do we choose?
+
+**The Solution:** Three-way branching with lookahead:
+
+```typescript
+const canAssignable = stack.canShift(AssignableIdentifier)
+const canRegular = stack.canShift(Identifier)
+
+if (canAssignable && !canRegular) {
+  // Only AssignableIdentifier valid (e.g., in Params)
+  input.acceptToken(AssignableIdentifier)
+} else if (canRegular && !canAssignable) {
+  // Only Identifier valid (e.g., in function args)
+  input.acceptToken(Identifier)
+} else {
+  // BOTH possible - peek ahead for '='
+  // Skip whitespace, check if next char is '='
+  const nextCh = getFullCodePoint(input, peekPos)
+  if (nextCh === 61 /* = */) {
+    input.acceptToken(AssignableIdentifier)  // It's an assignment
+  } else {
+    input.acceptToken(Identifier)  // It's a function call
+  }
+}
+```
+
+**Key Insight:** `stack.canShift()` returns true for BOTH token types when the grammar has multiple valid paths. We can't just use `canShift()` alone - we need lookahead.
+
+**Why This Works:**
+- `fn x y: ...` → In `Params` rule, only `AssignableIdentifier` can shift → no lookahead needed
+- `echo hello` → Both can shift, but no `=` ahead → emits `Identifier` → parses as `FunctionCall`
+- `x = 5` → Both can shift, finds `=` ahead → emits `AssignableIdentifier` → parses as `Assign`
+
+### 2. Surrogate Pair Handling for Emoji
+
+**Location:** `src/parser/tokenizer.ts` lines 71-84, `getFullCodePoint()` function
+
+**The Problem:** JavaScript strings use UTF-16, but emoji like 🍤 use code points outside the BMP (Basic Multilingual Plane), requiring surrogate pairs.
+
+**The Solution:** When reading characters, check for high surrogates (0xD800-0xDBFF) and combine them with low surrogates (0xDC00-0xDFFF):
+
+```typescript
+const getFullCodePoint = (input: InputStream, pos: number): number => {
+  const ch = input.peek(pos)
+
+  // Check if this is a high surrogate (0xD800-0xDBFF)
+  if (ch >= 0xd800 && ch <= 0xdbff) {
+    const low = input.peek(pos + 1)
+    // Check if next is low surrogate (0xDC00-0xDFFF)
+    if (low >= 0xdc00 && low <= 0xdfff) {
+      // Combine surrogate pair into full code point
+      return 0x10000 + ((ch & 0x3ff) << 10) + (low & 0x3ff)
+    }
+  }
+
+  return ch
+}
+```
+
+**Why This Matters:** Without this, `shrimp-🍤` would be treated as `shrimp-<high><low>` (4 characters) instead of `shrimp-🍤` (2 characters).
+
+### 3. Context-Aware Termination for Semicolon and Colon
+
+**Location:** `src/parser/tokenizer.ts` lines 51-57
+
+**The Problem:** How do we parse `basename ./cool;` vs `basename ./cool; 2`?
+
+**The Solution:** Only treat `;` and `:` as terminators if they're followed by whitespace (or EOF):
+
+```typescript
+if (canBeWord && (ch === 59 /* ; */ || ch === 58) /* : */) {
+  const nextCh = getFullCodePoint(input, pos + 1)
+  if (!isWordChar(nextCh)) break  // It's a terminator
+  // Otherwise, continue consuming as part of the Word
+}
+```
+
+**Examples:**
+- `basename ./cool;` → `;` is followed by EOF → terminates the word at `./cool`
+- `basename ./cool;2` → `;` is followed by `2` → included in word as `./cool;2`
+- `basename ./cool; 2` → `;` is followed by space → terminates at `./cool`, `2` is next arg
+
+### 4. Scope-Aware Property Access (DotGet)
+
+**Location:** `src/parser/tokenizer.ts` lines 19-48
+
+**The Problem:** How do we distinguish `obj.prop` (property access) from `readme.txt` (filename)?
+
+**The Solution:** When we see a `.` after an identifier, check if that identifier is in scope:
+
+```typescript
+if (ch === 46 /* . */ && isValidIdentifier) {
+  // Build identifier text
+  let identifierText = '...'  // (surrogate-pair aware)
+
+  const scopeContext = stack.context as ScopeContext | undefined
+  const scope = scopeContext?.scope
+
+  if (scope?.has(identifierText)) {
+    // In scope - stop here, emit IdentifierBeforeDot
+    // Grammar will parse as DotGet
+    input.acceptToken(IdentifierBeforeDot)
+    return
+  }
+  // Not in scope - continue consuming as Word
+  // Will parse as Word("readme.txt")
+}
+```
+
+**Examples:**
+- `config = {path: "..."}; config.path` → `config` is in scope → parses as `DotGet(IdentifierBeforeDot, Identifier)`
+- `cat readme.txt` → `readme` is not in scope → parses as `Word("readme.txt")`
+
+---
+
+## Grammar Special Cases
+
+### 1. expressionWithoutIdentifier Pattern
+
+**Location:** `src/parser/shrimp.grammar` lines 200-210
+
+**The Problem:** GLR conflict in `consumeToTerminator` rule:
+
+```lezer
+consumeToTerminator {
+  ambiguousFunctionCall |  // → FunctionCallOrIdentifier → Identifier
+  expression              // → Identifier
+}
+```
+
+When parsing `my-var` at statement level, both paths want the same `Identifier` token, causing a conflict.
+
+**The Solution:** Remove `Identifier` from the `expression` path by creating `expressionWithoutIdentifier`:
+
+```lezer
+expression {
+  expressionWithoutIdentifier | DotGet | Identifier
+}
+
+expressionWithoutIdentifier {
+  ParenExpr | Word | String | Number | Boolean | Regex | Null
+}
+```
+
+Then use `expressionWithoutIdentifier` in places where we don't want bare identifiers:
+
+```lezer
+consumeToTerminator {
+  PipeExpr |
+  ambiguousFunctionCall |   // ← Handles standalone identifiers
+  DotGet |
+  IfExpr |
+  FunctionDef |
+  Assign |
+  BinOp |
+  expressionWithoutIdentifier  // ← No bare Identifier here
+}
+```
+
+**Why This Works:** Now standalone identifiers MUST go through `ambiguousFunctionCall`, which is semantically what we want (they're either function calls or variable references).
+
+### 2. @skip {} Wrapper for DotGet
+
+**Location:** `src/parser/shrimp.grammar` lines 176-183
+
+**The Problem:** DotGet needs to be whitespace-sensitive (no spaces allowed around `.`), but the global `@skip { space }` would remove them.
+
+**The Solution:** Use `@skip {}` (empty skip) wrapper to disable automatic whitespace skipping:
+
+```lezer
+@skip {} {
+  DotGet {
+    IdentifierBeforeDot "." Identifier
+  }
+
+  String { "'" stringContent* "'" }
+}
+```
+
+**Why This Matters:**
+- `obj.prop` → Parses as `DotGet` ✓
+- `obj. prop` → Would parse as `obj` followed by `. prop` (error) if whitespace was skipped
+- `obj .prop` → Would parse as `obj` followed by `.prop` (error) if whitespace was skipped
+
+### 3. EOF Handling in item Rule
+
+**Location:** `src/parser/shrimp.grammar` lines 54-58
+
+**The Problem:** How do we handle empty lines and end-of-file without infinite loops?
+
+**The Solution:** Use alternatives instead of repetition for EOF:
+
+```lezer
+item {
+  consumeToTerminator newlineOrSemicolon |  // Statement with newline/semicolon
+  consumeToTerminator eof |                 // Statement at end of file
+  newlineOrSemicolon                        // Allow blank lines
+}
+```
+
+**Why Not Just `item { (statement | newlineOrSemicolon)+ eof? }`?**
+
+That would match EOF multiple times (once after each statement), causing parser errors. By making EOF part of an alternative, it's only matched once per item.
+
+### 4. Params Uses AssignableIdentifier
+
+**Location:** `src/parser/shrimp.grammar` lines 153-155
+
+```lezer
+Params {
+  AssignableIdentifier*
+}
+```
+
+**Why This Matters:** Function parameters are in "assignable" positions - they're being bound to values when the function is called. Using `AssignableIdentifier` here:
+1. Makes the grammar explicit about which identifiers create bindings
+2. Enables the tokenizer to use `canShift(AssignableIdentifier)` to detect param context
+3. Allows the scope tracker to only capture `AssignableIdentifier` tokens
+
+### 5. String Interpolation Inside @skip {}
+
+**Location:** `src/parser/shrimp.grammar` lines 181-198
+
+**The Problem:** String contents need to preserve whitespace, but string interpolation `$identifier` needs to use the external tokenizer.
+
+**The Solution:** Put `String` inside `@skip {}` and use the external tokenizer for `Identifier` within interpolation:
+
+```lezer
+@skip {} {
+  String { "'" stringContent* "'" }
+}
+
+stringContent {
+  StringFragment |      // Matches literal text (preserves spaces)
+  Interpolation |       // $identifier or $(expr)
+  EscapeSeq            // \$, \n, etc.
+}
+
+Interpolation {
+  "$" Identifier |      // Uses external tokenizer!
+  "$" ParenExpr
+}
+```
+
+**Key Insight:** External tokenizers work inside `@skip {}` blocks! The tokenizer gets called even when skip is disabled.
+
+---
+
+## Scope Tracking Architecture
+
+### Overview
+
+Scope tracking uses Lezer's `@context` feature to maintain a scope chain during parsing. This enables:
+- Distinguishing `obj.prop` (property access) from `readme.txt` (filename)
+- Tracking which variables are in scope for each position in the parse tree
+
+### Architecture: Scope vs ScopeContext
+
+**Two-Class Design:**
+
+```typescript
+// Pure, hashable scope - only variable tracking
+class Scope {
+  constructor(
+    public parent: Scope | null,
+    public vars: Set<string>
+  ) {}
+
+  has(name: string): boolean
+  add(...names: string[]): Scope
+  push(): Scope  // Create child scope
+  pop(): Scope   // Return to parent
+  hash(): number // For incremental parsing
+}
+
+// Wrapper with temporary state
+export class ScopeContext {
+  constructor(
+    public scope: Scope,
+    public pendingIds: string[] = []
+  ) {}
+}
+```
+
+**Why This Separation?**
+
+1. **Scope is pure and hashable** - Only contains committed variable bindings, no temporary state
+2. **ScopeContext holds temporary state** - The `pendingIds` array captures identifiers during parsing but isn't part of the hash
+3. **Hash function only hashes Scope** - Incremental parsing only cares about actual scope, not pending identifiers
+
+### How Scope Tracking Works
+
+**1. Capture Phase (shift):**
+
+When the parser shifts an `AssignableIdentifier` token, the scope tracker captures its text:
+
+```typescript
+shift(context, term, stack, input) {
+  if (term === terms.AssignableIdentifier) {
+    // Build text by peeking at input
+    let text = '...'  // (read from input.pos to stack.pos)
+
+    return new ScopeContext(
+      context.scope,
+      [...context.pendingIds, text]  // Append to pending
+    )
+  }
+  return context
+}
+```
+
+**2. Commit Phase (reduce):**
+
+When the parser reduces to `Assign` or `Params`, the scope tracker commits pending identifiers:
+
+```typescript
+reduce(context, term, stack, input) {
+  // Assignment: pop last identifier, add to scope
+  if (term === terms.Assign && context.pendingIds.length > 0) {
+    const varName = context.pendingIds[context.pendingIds.length - 1]!
+    return new ScopeContext(
+      context.scope.add(varName),      // Add to scope
+      context.pendingIds.slice(0, -1)  // Remove from pending
+    )
+  }
+
+  // Function params: add all identifiers, push new scope
+  if (term === terms.Params) {
+    const newScope = context.scope.push()
+    return new ScopeContext(
+      context.pendingIds.length > 0
+        ? newScope.add(...context.pendingIds)
+        : newScope,
+      []  // Clear pending
+    )
+  }
+
+  // Function exit: pop scope
+  if (term === terms.FunctionDef) {
+    return new ScopeContext(context.scope.pop(), [])
+  }
+
+  return context
+}
+```
+
+**3. Usage in Tokenizer:**
+
+The tokenizer accesses scope to check if identifiers are bound:
+
+```typescript
+const scopeContext = stack.context as ScopeContext | undefined
+const scope = scopeContext?.scope
+
+if (scope?.has(identifierText)) {
+  // Identifier is in scope - can use in DotGet
+  input.acceptToken(IdentifierBeforeDot)
+}
+```
+
+### Why Only Track AssignableIdentifier?
+
+**Before (complex):**
+- Tracked ALL identifiers with `term === terms.Identifier`
+- Used `isInParams` flag to know which ones to keep
+- Had to manually clear "stale" identifiers after DotGet, FunctionCall, etc.
+
+**After (simple):**
+- Only track `AssignableIdentifier` tokens
+- These only appear in `Params` and `Assign` (by grammar design)
+- No stale identifiers - they're consumed immediately
+
+**Example:**
+
+```shrimp
+fn x y: echo x end
+```
+
+Scope tracking:
+1. Shift `AssignableIdentifier("x")` → pending = ["x"]
+2. Shift `AssignableIdentifier("y")` → pending = ["x", "y"]
+3. Reduce `Params` → scope = {x, y}, pending = []
+4. Shift `Identifier("echo")` → **not captured** (not AssignableIdentifier)
+5. Shift `Identifier("x")` → **not captured**
+6. Reduce `FunctionDef` → pop scope
+
+No stale identifier clearing needed!
+
+---
+
+## Common Pitfalls
+
+### 1. Forgetting Surrogate Pairs
+
+**Problem:** Using `input.peek(i)` directly gives UTF-16 code units, not Unicode code points.
+
+**Solution:** Always use `getFullCodePoint(input, pos)` when working with emoji.
+
+**Example:**
+```typescript
+// ❌ Wrong - breaks on emoji
+const ch = input.peek(pos)
+if (isEmoji(ch)) { ... }
+
+// ✓ Right - handles surrogate pairs
+const ch = getFullCodePoint(input, pos)
+if (isEmoji(ch)) { ... }
+pos += getCharSize(ch)  // Advance by 1 or 2 code units
+```
+
+### 2. Adding Pending State to Hash
+
+**Problem:** Including `pendingIds` or `isInParams` in the hash function breaks incremental parsing.
+
+**Why?** The hash is used to determine if a cached parse tree node can be reused. If the hash includes temporary state that doesn't affect parsing decisions, nodes will be invalidated unnecessarily.
+
+**Solution:** Only hash the `Scope` (vars + parent chain), not the `ScopeContext` wrapper.
+
+```typescript
+// ✓ Right
+const hashScope = (context: ScopeContext): number => {
+  return context.scope.hash()  // Only hash committed scope
+}
+
+// ❌ Wrong
+const hashScope = (context: ScopeContext): number => {
+  let h = context.scope.hash()
+  h = (h << 5) - h + context.pendingIds.length  // Don't do this!
+  return h
+}
+```
+
+### 3. Using canShift() Alone for Disambiguation
+
+**Problem:** `stack.canShift(AssignableIdentifier)` returns true when BOTH paths are possible (e.g., at statement start).
+
+**Why?** The GLR parser maintains multiple parse states. If any state can shift the token, `canShift()` returns true.
+
+**Solution:** Check BOTH token types and use lookahead when both are possible:
+
+```typescript
+const canAssignable = stack.canShift(AssignableIdentifier)
+const canRegular = stack.canShift(Identifier)
+
+if (canAssignable && canRegular) {
+  // Both possible - need lookahead
+  const hasEquals = peekForEquals(input, pos)
+  input.acceptToken(hasEquals ? AssignableIdentifier : Identifier)
+}
+```
+
+### 4. Clearing Pending Identifiers Too Eagerly
+
+**Problem:** In the old code, we had to clear pending identifiers after DotGet, FunctionCall, etc. to prevent state leakage. This was fragile and easy to forget.
+
+**Why This Happened:** We were tracking ALL identifiers, not just assignable ones.
+
+**Solution:** Only track `AssignableIdentifier` tokens. They only appear in contexts where they'll be consumed (Params, Assign), so no clearing needed.
+
+### 5. Line Number Confusion in Edit Tool
+
+**Problem:** The Edit tool shows line numbers with a prefix (like `     5→`), but these aren't the real line numbers.
+
+**How to Read:**
+- The number before `→` is the actual line number
+- Use that number when referencing code in comments or documentation
+- Example: `     5→export const foo` means the code is on line 5
+
+---
+
+## Testing Strategy
+
+### Parser Tests
+
+Use the `toMatchTree` helper to verify parse tree structure:
+
+```typescript
+test('assignment with AssignableIdentifier', () => {
+  expect('x = 5').toMatchTree(`
+    Assign
+      AssignableIdentifier x
+      operator =
+      Number 5
+  `)
+})
+```
+
+**Key Testing Patterns:**
+- Test both token type expectations (Identifier vs AssignableIdentifier)
+- Test scope-aware features (DotGet for in-scope vs Word for out-of-scope)
+- Test edge cases (empty lines, EOF, surrogate pairs)
+
+### Debugging Parser Issues
+
+1. **Check token types:** Run parser on input and examine tree structure
+2. **Test canShift():** Add logging to tokenizer to see what `canShift()` returns
+3. **Verify scope state:** Log scope contents during parsing
+4. **Use GLR visualization:** Lezer has tools for visualizing parse states
+
+---
+
+## Further Reading
+
+- [Lezer System Guide](https://lezer.codemirror.net/docs/guide/)
+- [Lezer API Reference](https://lezer.codemirror.net/docs/ref/)
+- [CLAUDE.md](../CLAUDE.md) - General project guidance
+- [Scope Tracker Source](../src/parser/scopeTracker.ts)
+- [Tokenizer Source](../src/parser/tokenizer.ts)