# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with the Shrimp programming language. ## Pair Programming Approach Act as a pair programming partner and teacher, not an autonomous code writer: **Research and guide, don't implement**: - Focus on research, analysis, and finding solutions - Explain concepts, trade-offs, and best practices - Guide the human through changes rather than making them directly - Help them learn the codebase deeply by maintaining ownership **Use tmp/ directory for experimentation**: - Create temporary files in `tmp/` to test ideas out experiments you want to run. - Example: `tmp/eof-test.grammar`, `tmp/pattern-experiments.ts` - Clean up tmp files when done - Show multiple approaches so the human can choose **Teaching moments**: - Explain the "why" behind solutions - Point out potential pitfalls and edge cases - Share relevant documentation and examples - Help build understanding, not just solve problems ## Project Overview Shrimp is a shell-like scripting language that combines command-line simplicity with functional programming. The architecture flows: Shrimp source → parser (CST) → compiler (bytecode) → ReefVM (execution). **Essential reading**: Before making changes, read README.md to understand the language design philosophy and parser architecture. Key references: [Lezer System Guide](https://lezer.codemirror.net/docs/guide/) | [Lezer API](https://lezer.codemirror.net/docs/ref/) ## Reading the Codebase: What to Look For When exploring Shrimp, focus on these key files in order: 1. **src/parser/shrimp.grammar** - Language syntax rules - Note the `expressionWithoutIdentifier` pattern and its comment - See how `consumeToTerminator` handles statement-level parsing 2. **src/parser/tokenizer.ts** - How Identifier vs Word is determined - Check the emoji Unicode ranges and surrogate pair handling - See context-aware termination logic (`;`, `)`, `:`) 3. **src/compiler/compiler.ts** - CST to bytecode transformation - See how functions emit inline with JUMP wrappers - Check short-circuit logic for `and`/`or` - Notice `TRY_CALL` emission for bare identifiers 4. **packages/ReefVM/src/vm.ts** - Bytecode execution - See `TRY_CALL` fall-through to `CALL` (lines 357-375) - Check `TRY_LOAD` string coercion (lines 135-145) - Notice NOSE-style named parameter binding (lines 425-443) ## Development Commands ### Running Files ```bash bun # Run TypeScript files directly bun src/server/server.tsx # Start development server bun dev # Start development server (alias) ``` ### Testing ```bash bun test # Run all tests bun test src/parser/parser.test.ts # Run parser tests specifically bun test --watch # Watch mode ``` ### Parser Development ```bash bun generate-parser # Regenerate parser from grammar bun test src/parser/parser.test.ts # Test grammar changes ``` ### Server ```bash bun dev # Start playground at http://localhost:3000 ``` ### Building No build step required - Bun runs TypeScript directly. Parser auto-regenerates during tests. ## Code Style Preferences **Early returns over deep nesting**: ```typescript // ✅ Good const processToken = (token: Token) => { if (!token) return null if (token.type !== 'identifier') return null return processIdentifier(token) } // ❌ Avoid const processToken = (token: Token) => { if (token) { if (token.type === 'identifier') { return processIdentifier(token) } } return null } ``` **Arrow functions over function keyword**: ```typescript // ✅ Good const parseExpression = (input: string) => { // implementation } // ❌ Avoid function parseExpression(input: string) { // implementation } ``` **Code readability over cleverness**: - Use descriptive variable names - Write code that explains itself - Prefer explicit over implicit - Two simple functions beat one complex function ## Architecture ### Core Components **parser/** (Lezer-based parsing): - **shrimp.grammar**: Lezer grammar definition with tokens and rules - **shrimp.ts**: Auto-generated parser (don't edit directly) - **tokenizer.ts**: Custom tokenizer for identifier vs word distinction - **parser.test.ts**: Comprehensive grammar tests using `toMatchTree` **editor/** (CodeMirror integration): - Syntax highlighting for Shrimp language - Language support and autocomplete - Integration with the parser for real-time feedback **compiler/** (CST to bytecode): - Transforms concrete syntax trees into ReefVM bytecode - Handles function definitions, expressions, and control flow ### Critical Design Decisions **Whitespace-sensitive parsing**: Spaces distinguish operators from identifiers (`x-1` vs `x - 1`). This enables natural shell-like syntax. **Identifier vs Word tokenization**: The custom tokenizer (tokenizer.ts) is sophisticated: - **Surrogate pair handling**: Processes emoji as full Unicode code points (lines 51-65) - **Context-aware termination**: Stops at `;`, `)`, `:` only when followed by whitespace (lines 19-24) - This allows `basename ./cool;` to parse correctly - But `basename ./cool; 2` treats the semicolon as a terminator - **GLR state checking**: Uses `stack.canShift(Word)` to decide whether to track identifier validity - **Permissive Words**: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags) **Why this matters**: This complexity is what enables shell-like syntax. Without it, you'd need quotes around `./file.txt` or special handling for paths. **Identifier rules**: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji. **Word rules**: Everything else that isn't whitespace or a delimiter. **Ambiguous identifier resolution**: Bare identifiers like `myVar` could be function calls or variable references. The parser creates `FunctionCallOrIdentifier` nodes, resolved at runtime using the `TRY_CALL` opcode. **How it works**: - The compiler emits `TRY_CALL varname` for bare identifiers (src/compiler/compiler.ts:152) - ReefVM checks if the variable is a function at runtime (vm.ts:357-373) - If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement) - If it's not a function or undefined, it pushes the value/string and returns - This runtime resolution enables shell-like "echo hello" without quotes **Unbound symbols become strings**: When `TRY_LOAD` encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick. **Expression-oriented design**: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns. **Scope-aware property access (DotGet)**: The parser uses Lezer's `@context` feature to track variable scope at parse time. When it encounters `obj.prop`, it checks if `obj` is in scope: - **In scope** → Parses as `DotGet(Identifier, Identifier)` → compiles to `TRY_LOAD obj; PUSH 'prop'; DOT_GET` - **Not in scope** → Parses as `Word("obj.prop")` → compiles to `PUSH 'obj.prop'` (treated as file path/string) Implementation files: - **src/parser/scopeTracker.ts**: ContextTracker that maintains immutable scope chain - **src/parser/tokenizer.ts**: External tokenizer checks `stack.context` to decide if dot creates DotGet or Word - Scope tracking: Captures variables from assignments (`x = 5`) and function parameters (`fn x:`) - See `src/parser/tests/dot-get.test.ts` for comprehensive examples **Why this matters**: This enables shell-like file paths (`readme.txt`) while supporting dictionary/array access (`config.path`) without quotes, determined entirely at parse time based on lexical scope. **Array and dict literals**: Square brackets `[]` create both arrays and dicts, distinguished by content: - **Arrays**: Space/newline/semicolon-separated args that work like calling a function → `[1 2 3]` (call functions using parens eg `[1 (double 4) 200]`) - **Dicts**: NamedArg syntax (key=value pairs) → `[a=1 b=2]` - **Empty array**: `[]` (standard empty brackets) - **Empty dict**: `[=]` (exactly this, no spaces) Implementation details: - Grammar rules (shrimp.grammar:194-201): Dict uses `NamedArg` nodes, Array uses `expression` nodes - Parser distinguishes at parse time based on whether first element contains `=` - Both support multiline, comments, and nesting - Separators: spaces, newlines (`\n`), or semicolons (`;`) work interchangeably - Test files: `src/parser/tests/literals.test.ts` and `src/compiler/tests/literals.test.ts` **EOF handling**: The grammar uses `(statement | newlineOrSemicolon)+ eof?` to handle empty lines and end-of-file without infinite loops. ## Compiler Architecture **Function compilation strategy**: Functions are compiled inline where they're defined, with JUMP instructions to skip over their bodies during linear execution: ``` JUMP .after_.func_0 # Skip over body during definition .func_0: # Function body label (function body code) RETURN .after_.func_0: # Resume here after jump MAKE_FUNCTION (x) .func_0 # Create function object with label ``` This approach: - Emits function bodies inline (no deferred collection) - Uses JUMP to skip bodies during normal execution flow - Each function is self-contained at its definition site - Works seamlessly in REPL mode (important for `vm.appendBytecode()`) - Allows ReefVM to jump to function bodies by label when called **Short-circuit logic**: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using: ```typescript // For `a and b`: LOAD a DUP // Duplicate so we can return it if falsy JUMP_IF_FALSE skip // If false, skip evaluating b POP // Remove duplicate if we're continuing LOAD b // Evaluate right side skip: ``` See compiler.ts:267-282 for the full implementation. The `or` operator uses `JUMP_IF_TRUE` instead. **If/else compilation**: The compiler uses label-based jumps: - `JUMP_IF_FALSE` skips the then-block when condition is false - Each branch ends with `JUMP endLabel` to skip remaining branches - The final label marks where all branches converge - If there's no else branch, compiler emits `PUSH null` as the default value ## Grammar Development ### Grammar Structure The grammar follows this hierarchy: ``` Program → statement* statement → line newlineOrSemicolon | line eof line → FunctionCall | FunctionCallOrIdentifier | FunctionDef | Assign | expression ``` Key tokens: - `newlineOrSemicolon`: `"\n" | ";"` - `eof`: `@eof` - `Identifier`: Lowercase/emoji start, assignable variables - `Word`: Everything else (paths, URLs, etc.) ### Adding Grammar Rules When modifying the grammar: 1. **Update `src/parser/shrimp.grammar`** with your changes 2. **Run tests** - the parser auto-regenerates during test runs 3. **Add test cases** in `src/parser/parser.test.ts` using `toMatchTree` 4. **Test empty line handling** - ensure EOF works properly ### Test Format Grammar tests use this pattern: ```typescript test('function call with args', () => { expect('echo hello world').toMatchTree(` FunctionCall Identifier echo PositionalArg Word hello PositionalArg Word world `) }) ``` The `toMatchTree` helper compares parser output with expected CST structure. ### Common Grammar Gotchas **EOF infinite loops**: Using `@eof` in repeating patterns can match EOF multiple times. Current approach uses explicit statement/newline alternatives. **Token precedence**: Use `@precedence` to resolve conflicts between similar tokens. **External tokenizers**: Custom logic in `tokenizers.ts` handles complex cases like identifier vs word distinction. **Empty line parsing**: The grammar structure `(statement | newlineOrSemicolon)+ eof?` allows proper empty line and EOF handling. ## Lezer: Surprising Behaviors These discoveries came from implementing string interpolation with external tokenizers. See `tmp/string-test4.grammar` for working examples. ### 1. Rule Capitalization Controls Tree Structure **The most surprising discovery**: Rule names determine whether nodes appear in the parse tree. **Lowercase rules get inlined** (no tree nodes): ```lezer statement { assign | expr } // ❌ No "statement" node assign { x "=" y } // ❌ No "assign" node expr { x | y } // ❌ No "expr" node ``` **Capitalized rules create tree nodes**: ```lezer Statement { Assign | Expr } // ✅ Creates Statement node Assign { x "=" y } // ✅ Creates Assign node Expr { x | y } // ✅ Creates Expr node ``` **Why this matters**: When debugging grammar that "doesn't match," check capitalization first. The rules might be matching perfectly—they're just being compiled away! Example: `x = 42` was parsing as `Program(Identifier,"=",Number)` instead of `Program(Statement(Assign(...)))`. The grammar rules existed and were matching, but they were inlined because they were lowercase. ### 2. @skip {} Wrapper is Essential for Preserving Whitespace **Initial assumption (wrong)**: Could exclude whitespace from token patterns to avoid needing `@skip {}`. **Reality**: The `@skip {}` wrapper is absolutely required to preserve whitespace in strings: ```lezer @skip {} { String { "'" StringContent* "'" } } @tokens { StringFragment { !['\\$]+ } // Matches everything including spaces } ``` **Without the wrapper**: All spaces get stripped by the global `@skip { space }`, even though `StringFragment` can match them. **Test that proved it wrong**: `' spaces '` was being parsed as `"spaces"` (leading/trailing spaces removed) until we added `@skip {}`. ### 3. External Tokenizers Work Inside @skip {} Blocks **Initial assumption (wrong)**: External tokenizers can't be used inside `@skip {}` blocks, so identifier patterns need to be duplicated as simple tokens. **Reality**: External tokenizers work perfectly inside `@skip {}` blocks! The tokenizer gets called even when skip is disabled. **Working pattern**: ```lezer @external tokens tokenizer from "./tokenizer" { Identifier, Word } @skip {} { String { "'" StringContent* "'" } } Interpolation { "$" Identifier | // ← Uses external tokenizer! "$" "(" expr ")" } ``` **Test that proved it**: `'hello $name'` correctly calls the external tokenizer for `name` inside the string, creating an `Identifier` token. No duplication needed! ### 4. Single-Character Tokens Can Be Literals **Initial approach**: Define every single character as a token: ```lezer @tokens { dollar[@name="$"] { "$" } backslash[@name="\\"] { "\\" } } ``` **Simpler approach**: Just use literals in the grammar rules: ```lezer Interpolation { "$" Identifier | // Literal "$" "$" "(" expr ")" } EscapeSeq { "\\" ("$" | "n" | ...) // Literal "\\" } ``` This works fine and reduces boilerplate in the @tokens section. ### 5. StringFragment as Simple Token, Not External For string content, use a simple token pattern instead of handling it in the external tokenizer: ```lezer @tokens { StringFragment { !['\\$]+ } // Simple pattern: not quote, backslash, or dollar } ``` The external tokenizer should focus on Identifier/Word distinction at the top level. String content is simpler and doesn't need the complexity of the external tokenizer. ### Why expressionWithoutIdentifier Exists The grammar has an unusual pattern: `expressionWithoutIdentifier`. This exists to solve a GLR conflict: ``` consumeToTerminator { ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier expression // → Identifier } ``` Without `expressionWithoutIdentifier`, parsing `my-var` at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it." **The solution**: Remove Identifier from the `expression` path by creating `expressionWithoutIdentifier`, forcing standalone identifiers through `ambiguousFunctionCall`. This is pragmatic over theoretical purity. ## Testing Strategy ### Parser Tests (`src/parser/parser.test.ts`) - **Token types**: Identifier vs Word distinction - **Function calls**: With and without arguments - **Expressions**: Binary operations, parentheses, precedence - **Functions**: Single-line and multiline definitions - **Whitespace**: Empty lines, mixed delimiters - **Edge cases**: Ambiguous parsing, incomplete input Test structure: ```typescript describe('feature area', () => { test('specific case', () => { expect(input).toMatchTree(expectedCST) }) }) ``` When adding language features: 1. Write grammar tests first showing expected CST structure 2. Update grammar rules to make tests pass 3. Add integration tests showing real usage 4. Test edge cases and error conditions ## Bun Usage Default to Bun over Node.js/npm: - Use `bun ` instead of `node ` or `ts-node ` - Use `bun test` instead of `jest` or `vitest` - Use `bun install` instead of `npm install` - Use `bun run