# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with the Shrimp programming language. ## Pair Programming Approach Act as a pair programming partner and teacher, not an autonomous code writer: **Research and guide, don't implement**: - Focus on research, analysis, and finding solutions - Explain concepts, trade-offs, and best practices - Guide the human through changes rather than making them directly - Help them learn the codebase deeply by maintaining ownership **Use tmp/ directory for experimentation**: - Create temporary files in `tmp/` to test ideas out experiments you want to run. - Example: `tmp/eof-test.grammar`, `tmp/pattern-experiments.ts` - Clean up tmp files when done - Show multiple approaches so the human can choose **Teaching moments**: - Explain the "why" behind solutions - Point out potential pitfalls and edge cases - Share relevant documentation and examples - Help build understanding, not just solve problems ## Project Overview Shrimp is a shell-like scripting language that combines command-line simplicity with functional programming. The architecture flows: Shrimp source → parser (CST) → compiler (bytecode) → ReefVM (execution). **Essential reading**: Before making changes, read README.md to understand the language design philosophy and parser architecture. Key references: [Lezer System Guide](https://lezer.codemirror.net/docs/guide/) | [Lezer API](https://lezer.codemirror.net/docs/ref/) ## Reading the Codebase: What to Look For When exploring Shrimp, focus on these key files in order: 1. **src/parser/shrimp.grammar** - Language syntax rules - Note the `expressionWithoutIdentifier` pattern and its comment - See how `consumeToTerminator` handles statement-level parsing 2. **src/parser/tokenizer.ts** - How Identifier vs Word is determined - Check the emoji Unicode ranges and surrogate pair handling - See context-aware termination logic (`;`, `)`, `:`) 3. **src/compiler/compiler.ts** - CST to bytecode transformation - See how functions become labels in `fnLabels` map - Check short-circuit logic for `and`/`or` (lines 267-282) - Notice `TRY_CALL` emission for bare identifiers (line 152) 4. **packages/ReefVM/src/vm.ts** - Bytecode execution - See `TRY_CALL` fall-through to `CALL` (lines 357-375) - Check `TRY_LOAD` string coercion (lines 135-145) - Notice NOSE-style named parameter binding (lines 425-443) ## Development Commands ### Running Files ```bash bun # Run TypeScript files directly bun src/server/server.tsx # Start development server bun dev # Start development server (alias) ``` ### Testing ```bash bun test # Run all tests bun test src/parser/parser.test.ts # Run parser tests specifically bun test --watch # Watch mode ``` ### Parser Development ```bash bun generate-parser # Regenerate parser from grammar bun test src/parser/parser.test.ts # Test grammar changes ``` ### Server ```bash bun dev # Start playground at http://localhost:3000 ``` ### Building No build step required - Bun runs TypeScript directly. Parser auto-regenerates during tests. ## Code Style Preferences **Early returns over deep nesting**: ```typescript // ✅ Good const processToken = (token: Token) => { if (!token) return null if (token.type !== 'identifier') return null return processIdentifier(token) } // ❌ Avoid const processToken = (token: Token) => { if (token) { if (token.type === 'identifier') { return processIdentifier(token) } } return null } ``` **Arrow functions over function keyword**: ```typescript // ✅ Good const parseExpression = (input: string) => { // implementation } // ❌ Avoid function parseExpression(input: string) { // implementation } ``` **Code readability over cleverness**: - Use descriptive variable names - Write code that explains itself - Prefer explicit over implicit - Two simple functions beat one complex function ## Architecture ### Core Components **parser/** (Lezer-based parsing): - **shrimp.grammar**: Lezer grammar definition with tokens and rules - **shrimp.ts**: Auto-generated parser (don't edit directly) - **tokenizer.ts**: Custom tokenizer for identifier vs word distinction - **parser.test.ts**: Comprehensive grammar tests using `toMatchTree` **editor/** (CodeMirror integration): - Syntax highlighting for Shrimp language - Language support and autocomplete - Integration with the parser for real-time feedback **compiler/** (CST to bytecode): - Transforms concrete syntax trees into ReefVM bytecode - Handles function definitions, expressions, and control flow ### Critical Design Decisions **Whitespace-sensitive parsing**: Spaces distinguish operators from identifiers (`x-1` vs `x - 1`). This enables natural shell-like syntax. **Identifier vs Word tokenization**: The custom tokenizer (tokenizer.ts) is sophisticated: - **Surrogate pair handling**: Processes emoji as full Unicode code points (lines 51-65) - **Context-aware termination**: Stops at `;`, `)`, `:` only when followed by whitespace (lines 19-24) - This allows `basename ./cool;` to parse correctly - But `basename ./cool; 2` treats the semicolon as a terminator - **GLR state checking**: Uses `stack.canShift(Word)` to decide whether to track identifier validity - **Permissive Words**: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags) **Why this matters**: This complexity is what enables shell-like syntax. Without it, you'd need quotes around `./file.txt` or special handling for paths. **Identifier rules**: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji. **Word rules**: Everything else that isn't whitespace or a delimiter. **Ambiguous identifier resolution**: Bare identifiers like `myVar` could be function calls or variable references. The parser creates `FunctionCallOrIdentifier` nodes, resolved at runtime using the `TRY_CALL` opcode. **How it works**: - The compiler emits `TRY_CALL varname` for bare identifiers (src/compiler/compiler.ts:152) - ReefVM checks if the variable is a function at runtime (vm.ts:357-373) - If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement) - If it's not a function or undefined, it pushes the value/string and returns - This runtime resolution enables shell-like "echo hello" without quotes **Unbound symbols become strings**: When `TRY_LOAD` encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick. **Expression-oriented design**: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns. **EOF handling**: The grammar uses `(statement | newlineOrSemicolon)+ eof?` to handle empty lines and end-of-file without infinite loops. ## Compiler Architecture **Function compilation strategy**: The compiler doesn't create inline function objects. Instead it: 1. Generates unique labels (`.func_0`, `.func_1`) for each function body (compiler.ts:137) 2. Stores function body instructions in `fnLabels` map during compilation 3. Appends all function bodies to the end of bytecode with RETURN instructions (compiler.ts:36-41) 4. Emits `MAKE_FUNCTION` with parameters and label reference This approach keeps the main program linear and allows ReefVM to jump to function bodies by label. **Short-circuit logic**: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using: ```typescript // For `a and b`: LOAD a DUP // Duplicate so we can return it if falsy JUMP_IF_FALSE skip // If false, skip evaluating b POP // Remove duplicate if we're continuing LOAD b // Evaluate right side skip: ``` See compiler.ts:267-282 for the full implementation. The `or` operator uses `JUMP_IF_TRUE` instead. **If/else compilation**: The compiler uses label-based jumps: - `JUMP_IF_FALSE` skips the then-block when condition is false - Each branch ends with `JUMP endLabel` to skip remaining branches - The final label marks where all branches converge - If there's no else branch, compiler emits `PUSH null` as the default value ## Grammar Development ### Grammar Structure The grammar follows this hierarchy: ``` Program → statement* statement → line newlineOrSemicolon | line eof line → FunctionCall | FunctionCallOrIdentifier | FunctionDef | Assign | expression ``` Key tokens: - `newlineOrSemicolon`: `"\n" | ";"` - `eof`: `@eof` - `Identifier`: Lowercase/emoji start, assignable variables - `Word`: Everything else (paths, URLs, etc.) ### Adding Grammar Rules When modifying the grammar: 1. **Update `src/parser/shrimp.grammar`** with your changes 2. **Run tests** - the parser auto-regenerates during test runs 3. **Add test cases** in `src/parser/parser.test.ts` using `toMatchTree` 4. **Test empty line handling** - ensure EOF works properly ### Test Format Grammar tests use this pattern: ```typescript test('function call with args', () => { expect('echo hello world').toMatchTree(` FunctionCall Identifier echo PositionalArg Word hello PositionalArg Word world `) }) ``` The `toMatchTree` helper compares parser output with expected CST structure. ### Common Grammar Gotchas **EOF infinite loops**: Using `@eof` in repeating patterns can match EOF multiple times. Current approach uses explicit statement/newline alternatives. **Token precedence**: Use `@precedence` to resolve conflicts between similar tokens. **External tokenizers**: Custom logic in `tokenizers.ts` handles complex cases like identifier vs word distinction. **Empty line parsing**: The grammar structure `(statement | newlineOrSemicolon)+ eof?` allows proper empty line and EOF handling. ### Why expressionWithoutIdentifier Exists The grammar has an unusual pattern: `expressionWithoutIdentifier`. This exists to solve a GLR conflict: ``` consumeToTerminator { ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier expression // → Identifier } ``` Without `expressionWithoutIdentifier`, parsing `my-var` at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it." **The solution**: Remove Identifier from the `expression` path by creating `expressionWithoutIdentifier`, forcing standalone identifiers through `ambiguousFunctionCall`. This is pragmatic over theoretical purity. ## Testing Strategy ### Parser Tests (`src/parser/parser.test.ts`) - **Token types**: Identifier vs Word distinction - **Function calls**: With and without arguments - **Expressions**: Binary operations, parentheses, precedence - **Functions**: Single-line and multiline definitions - **Whitespace**: Empty lines, mixed delimiters - **Edge cases**: Ambiguous parsing, incomplete input Test structure: ```typescript describe('feature area', () => { test('specific case', () => { expect(input).toMatchTree(expectedCST) }) }) ``` When adding language features: 1. Write grammar tests first showing expected CST structure 2. Update grammar rules to make tests pass 3. Add integration tests showing real usage 4. Test edge cases and error conditions ## Bun Usage Default to Bun over Node.js/npm: - Use `bun ` instead of `node ` or `ts-node ` - Use `bun test` instead of `jest` or `vitest` - Use `bun install` instead of `npm install` - Use `bun run