shrimp/CLAUDE.md

512 lines
18 KiB
Markdown

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with the Shrimp programming language.
## Pair Programming Approach
Act as a pair programming partner and teacher, not an autonomous code writer:
**Research and guide, don't implement**:
- Focus on research, analysis, and finding solutions
- Explain concepts, trade-offs, and best practices
- Guide the human through changes rather than making them directly
- Help them learn the codebase deeply by maintaining ownership
**Use tmp/ directory for experimentation**:
- Create temporary files in `tmp/` to test ideas out experiments you want to run.
- Example: `tmp/eof-test.grammar`, `tmp/pattern-experiments.ts`
- Clean up tmp files when done
- Show multiple approaches so the human can choose
**Teaching moments**:
- Explain the "why" behind solutions
- Point out potential pitfalls and edge cases
- Share relevant documentation and examples
- Help build understanding, not just solve problems
## Project Overview
Shrimp is a shell-like scripting language that combines command-line simplicity with functional programming. The architecture flows: Shrimp source → parser (CST) → compiler (bytecode) → ReefVM (execution).
**Essential reading**: Before making changes, read README.md to understand the language design philosophy and parser architecture.
Key references: [Lezer System Guide](https://lezer.codemirror.net/docs/guide/) | [Lezer API](https://lezer.codemirror.net/docs/ref/)
## Reading the Codebase: What to Look For
When exploring Shrimp, focus on these key files in order:
1. **src/parser/shrimp.grammar** - Language syntax rules
- Note the `expressionWithoutIdentifier` pattern and its comment
- See how `consumeToTerminator` handles statement-level parsing
2. **src/parser/tokenizer.ts** - How Identifier vs Word is determined
- Check the emoji Unicode ranges and surrogate pair handling
- See context-aware termination logic (`;`, `)`, `:`)
3. **src/compiler/compiler.ts** - CST to bytecode transformation
- See how functions emit inline with JUMP wrappers
- Check short-circuit logic for `and`/`or`
- Notice `TRY_CALL` emission for bare identifiers
4. **packages/ReefVM/src/vm.ts** - Bytecode execution
- See `TRY_CALL` fall-through to `CALL` (lines 357-375)
- Check `TRY_LOAD` string coercion (lines 135-145)
- Notice NOSE-style named parameter binding (lines 425-443)
## Development Commands
### Running Files
```bash
bun <file> # Run TypeScript files directly
bun src/server/server.tsx # Start development server
bun dev # Start development server (alias)
```
### Testing
```bash
bun test # Run all tests
bun test src/parser/parser.test.ts # Run parser tests specifically
bun test --watch # Watch mode
```
### Parser Development
```bash
bun generate-parser # Regenerate parser from grammar
bun test src/parser/parser.test.ts # Test grammar changes
```
### Server
```bash
bun dev # Start playground at http://localhost:3000
```
### Building
No build step required - Bun runs TypeScript directly. Parser auto-regenerates during tests.
## Code Style Preferences
**Early returns over deep nesting**:
```typescript
// ✅ Good
const processToken = (token: Token) => {
if (!token) return null
if (token.type !== 'identifier') return null
return processIdentifier(token)
}
// ❌ Avoid
const processToken = (token: Token) => {
if (token) {
if (token.type === 'identifier') {
return processIdentifier(token)
}
}
return null
}
```
**Arrow functions over function keyword**:
```typescript
// ✅ Good
const parseExpression = (input: string) => {
// implementation
}
// ❌ Avoid
function parseExpression(input: string) {
// implementation
}
```
**Code readability over cleverness**:
- Use descriptive variable names
- Write code that explains itself
- Prefer explicit over implicit
- Two simple functions beat one complex function
## Architecture
### Core Components
**parser/** (Lezer-based parsing):
- **shrimp.grammar**: Lezer grammar definition with tokens and rules
- **shrimp.ts**: Auto-generated parser (don't edit directly)
- **tokenizer.ts**: Custom tokenizer for identifier vs word distinction
- **parser.test.ts**: Comprehensive grammar tests using `toMatchTree`
**editor/** (CodeMirror integration):
- Syntax highlighting for Shrimp language
- Language support and autocomplete
- Integration with the parser for real-time feedback
**compiler/** (CST to bytecode):
- Transforms concrete syntax trees into ReefVM bytecode
- Handles function definitions, expressions, and control flow
### Critical Design Decisions
**Whitespace-sensitive parsing**: Spaces distinguish operators from identifiers (`x-1` vs `x - 1`). This enables natural shell-like syntax.
**Identifier vs Word tokenization**: The custom tokenizer (tokenizer.ts) is sophisticated:
- **Surrogate pair handling**: Processes emoji as full Unicode code points (lines 51-65)
- **Context-aware termination**: Stops at `;`, `)`, `:` only when followed by whitespace (lines 19-24)
- This allows `basename ./cool;` to parse correctly
- But `basename ./cool; 2` treats the semicolon as a terminator
- **GLR state checking**: Uses `stack.canShift(Word)` to decide whether to track identifier validity
- **Permissive Words**: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags)
**Why this matters**: This complexity is what enables shell-like syntax. Without it, you'd need quotes around `./file.txt` or special handling for paths.
**Identifier rules**: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji.
**Word rules**: Everything else that isn't whitespace or a delimiter.
**Ambiguous identifier resolution**: Bare identifiers like `myVar` could be function calls or variable references. The parser creates `FunctionCallOrIdentifier` nodes, resolved at runtime using the `TRY_CALL` opcode.
**How it works**:
- The compiler emits `TRY_CALL varname` for bare identifiers (src/compiler/compiler.ts:152)
- ReefVM checks if the variable is a function at runtime (vm.ts:357-373)
- If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement)
- If it's not a function or undefined, it pushes the value/string and returns
- This runtime resolution enables shell-like "echo hello" without quotes
**Unbound symbols become strings**: When `TRY_LOAD` encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick.
**Expression-oriented design**: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns.
**Scope-aware property access (DotGet)**: The parser uses Lezer's `@context` feature to track variable scope at parse time. When it encounters `obj.prop`, it checks if `obj` is in scope:
- **In scope** → Parses as `DotGet(Identifier, Identifier)` → compiles to `TRY_LOAD obj; PUSH 'prop'; DOT_GET`
- **Not in scope** → Parses as `Word("obj.prop")` → compiles to `PUSH 'obj.prop'` (treated as file path/string)
Implementation files:
- **src/parser/scopeTracker.ts**: ContextTracker that maintains immutable scope chain
- **src/parser/tokenizer.ts**: External tokenizer checks `stack.context` to decide if dot creates DotGet or Word
- Scope tracking: Captures variables from assignments (`x = 5`) and function parameters (`fn x:`)
- See `src/parser/tests/dot-get.test.ts` for comprehensive examples
**Why this matters**: This enables shell-like file paths (`readme.txt`) while supporting dictionary/array access (`config.path`) without quotes, determined entirely at parse time based on lexical scope.
**Array and dict literals**: Square brackets `[]` create both arrays and dicts, distinguished by content:
- **Arrays**: Space/newline/semicolon-separated args that work like calling a function → `[1 2 3]` (call functions using parens eg `[1 (double 4) 200]`)
- **Dicts**: NamedArg syntax (key=value pairs) → `[a=1 b=2]`
- **Empty array**: `[]` (standard empty brackets)
- **Empty dict**: `[=]` (exactly this, no spaces)
Implementation details:
- Grammar rules (shrimp.grammar:194-201): Dict uses `NamedArg` nodes, Array uses `expression` nodes
- Parser distinguishes at parse time based on whether first element contains `=`
- Both support multiline, comments, and nesting
- Separators: spaces, newlines (`\n`), or semicolons (`;`) work interchangeably
- Test files: `src/parser/tests/literals.test.ts` and `src/compiler/tests/literals.test.ts`
**EOF handling**: The grammar uses `(statement | newlineOrSemicolon)+ eof?` to handle empty lines and end-of-file without infinite loops.
## Compiler Architecture
**Function compilation strategy**: Functions are compiled inline where they're defined, with JUMP instructions to skip over their bodies during linear execution:
```
JUMP .after_.func_0 # Skip over body during definition
.func_0: # Function body label
(function body code)
RETURN
.after_.func_0: # Resume here after jump
MAKE_FUNCTION (x) .func_0 # Create function object with label
```
This approach:
- Emits function bodies inline (no deferred collection)
- Uses JUMP to skip bodies during normal execution flow
- Each function is self-contained at its definition site
- Works seamlessly in REPL mode (important for `vm.appendBytecode()`)
- Allows ReefVM to jump to function bodies by label when called
**Short-circuit logic**: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using:
```typescript
// For `a and b`:
LOAD a
DUP // Duplicate so we can return it if falsy
JUMP_IF_FALSE skip // If false, skip evaluating b
POP // Remove duplicate if we're continuing
LOAD b // Evaluate right side
skip:
```
See compiler.ts:267-282 for the full implementation. The `or` operator uses `JUMP_IF_TRUE` instead.
**If/else compilation**: The compiler uses label-based jumps:
- `JUMP_IF_FALSE` skips the then-block when condition is false
- Each branch ends with `JUMP endLabel` to skip remaining branches
- The final label marks where all branches converge
- If there's no else branch, compiler emits `PUSH null` as the default value
## Grammar Development
### Grammar Structure
The grammar follows this hierarchy:
```
Program → statement*
statement → line newlineOrSemicolon | line eof
line → FunctionCall | FunctionCallOrIdentifier | FunctionDef | Assign | expression
```
Key tokens:
- `newlineOrSemicolon`: `"\n" | ";"`
- `eof`: `@eof`
- `Identifier`: Lowercase/emoji start, assignable variables
- `Word`: Everything else (paths, URLs, etc.)
### Adding Grammar Rules
When modifying the grammar:
1. **Update `src/parser/shrimp.grammar`** with your changes
2. **Run tests** - the parser auto-regenerates during test runs
3. **Add test cases** in `src/parser/parser.test.ts` using `toMatchTree`
4. **Test empty line handling** - ensure EOF works properly
### Test Format
Grammar tests use this pattern:
```typescript
test('function call with args', () => {
expect('echo hello world').toMatchTree(`
FunctionCall
Identifier echo
PositionalArg
Word hello
PositionalArg
Word world
`)
})
```
The `toMatchTree` helper compares parser output with expected CST structure.
### Common Grammar Gotchas
**EOF infinite loops**: Using `@eof` in repeating patterns can match EOF multiple times. Current approach uses explicit statement/newline alternatives.
**Token precedence**: Use `@precedence` to resolve conflicts between similar tokens.
**External tokenizers**: Custom logic in `tokenizers.ts` handles complex cases like identifier vs word distinction.
**Empty line parsing**: The grammar structure `(statement | newlineOrSemicolon)+ eof?` allows proper empty line and EOF handling.
## Lezer: Surprising Behaviors
These discoveries came from implementing string interpolation with external tokenizers. See `tmp/string-test4.grammar` for working examples.
### 1. Rule Capitalization Controls Tree Structure
**The most surprising discovery**: Rule names determine whether nodes appear in the parse tree.
**Lowercase rules get inlined** (no tree nodes):
```lezer
statement { assign | expr } // ❌ No "statement" node
assign { x "=" y } // ❌ No "assign" node
expr { x | y } // ❌ No "expr" node
```
**Capitalized rules create tree nodes**:
```lezer
Statement { Assign | Expr } // ✅ Creates Statement node
Assign { x "=" y } // ✅ Creates Assign node
Expr { x | y } // ✅ Creates Expr node
```
**Why this matters**: When debugging grammar that "doesn't match," check capitalization first. The rules might be matching perfectly—they're just being compiled away!
Example: `x = 42` was parsing as `Program(Identifier,"=",Number)` instead of `Program(Statement(Assign(...)))`. The grammar rules existed and were matching, but they were inlined because they were lowercase.
### 2. @skip {} Wrapper is Essential for Preserving Whitespace
**Initial assumption (wrong)**: Could exclude whitespace from token patterns to avoid needing `@skip {}`.
**Reality**: The `@skip {}` wrapper is absolutely required to preserve whitespace in strings:
```lezer
@skip {} {
String { "'" StringContent* "'" }
}
@tokens {
StringFragment { !['\\$]+ } // Matches everything including spaces
}
```
**Without the wrapper**: All spaces get stripped by the global `@skip { space }`, even though `StringFragment` can match them.
**Test that proved it wrong**: `' spaces '` was being parsed as `"spaces"` (leading/trailing spaces removed) until we added `@skip {}`.
### 3. External Tokenizers Work Inside @skip {} Blocks
**Initial assumption (wrong)**: External tokenizers can't be used inside `@skip {}` blocks, so identifier patterns need to be duplicated as simple tokens.
**Reality**: External tokenizers work perfectly inside `@skip {}` blocks! The tokenizer gets called even when skip is disabled.
**Working pattern**:
```lezer
@external tokens tokenizer from "./tokenizer" { Identifier, Word }
@skip {} {
String { "'" StringContent* "'" }
}
Interpolation {
"$" Identifier | // ← Uses external tokenizer!
"$" "(" expr ")"
}
```
**Test that proved it**: `'hello $name'` correctly calls the external tokenizer for `name` inside the string, creating an `Identifier` token. No duplication needed!
### 4. Single-Character Tokens Can Be Literals
**Initial approach**: Define every single character as a token:
```lezer
@tokens {
dollar[@name="$"] { "$" }
backslash[@name="\\"] { "\\" }
}
```
**Simpler approach**: Just use literals in the grammar rules:
```lezer
Interpolation {
"$" Identifier | // Literal "$"
"$" "(" expr ")"
}
EscapeSeq {
"\\" ("$" | "n" | ...) // Literal "\\"
}
```
This works fine and reduces boilerplate in the @tokens section.
### 5. StringFragment as Simple Token, Not External
For string content, use a simple token pattern instead of handling it in the external tokenizer:
```lezer
@tokens {
StringFragment { !['\\$]+ } // Simple pattern: not quote, backslash, or dollar
}
```
The external tokenizer should focus on Identifier/Word distinction at the top level. String content is simpler and doesn't need the complexity of the external tokenizer.
### Why expressionWithoutIdentifier Exists
The grammar has an unusual pattern: `expressionWithoutIdentifier`. This exists to solve a GLR conflict:
```
consumeToTerminator {
ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier
expression // → Identifier
}
```
Without `expressionWithoutIdentifier`, parsing `my-var` at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it."
**The solution**: Remove Identifier from the `expression` path by creating `expressionWithoutIdentifier`, forcing standalone identifiers through `ambiguousFunctionCall`. This is pragmatic over theoretical purity.
## Testing Strategy
### Parser Tests (`src/parser/parser.test.ts`)
- **Token types**: Identifier vs Word distinction
- **Function calls**: With and without arguments
- **Expressions**: Binary operations, parentheses, precedence
- **Functions**: Single-line and multiline definitions
- **Whitespace**: Empty lines, mixed delimiters
- **Edge cases**: Ambiguous parsing, incomplete input
Test structure:
```typescript
describe('feature area', () => {
test('specific case', () => {
expect(input).toMatchTree(expectedCST)
})
})
```
When adding language features:
1. Write grammar tests first showing expected CST structure
2. Update grammar rules to make tests pass
3. Add integration tests showing real usage
4. Test edge cases and error conditions
## Bun Usage
Default to Bun over Node.js/npm:
- Use `bun <file>` instead of `node <file>` or `ts-node <file>`
- Use `bun test` instead of `jest` or `vitest`
- Use `bun install` instead of `npm install`
- Use `bun run <script>` instead of `npm run <script>`
- Bun automatically loads .env, so don't use dotenv
### Bun APIs
- Prefer `Bun.file` over `node:fs`'s readFile/writeFile
- Use `Bun.$` for shell commands instead of execa
## Common Patterns
### Grammar Debugging
When grammar isn't parsing correctly:
1. **Check token precedence** - ensure tokens are recognized correctly
2. **Test simpler cases first** - build up from basic to complex
3. **Use `toMatchTree` output** - see what the parser actually produces
4. **Check external tokenizer** - identifier vs word logic in `tokenizers.ts`
## Common Misconceptions
**"The parser handles unbound symbols as strings"** → False. The _VM_ does this via `TRY_LOAD` opcode. The parser creates `FunctionCallOrIdentifier` nodes; the compiler emits `TRY_LOAD`/`TRY_CALL`; the VM resolves at runtime.
**"Words are just paths"** → False. Words are _anything_ that isn't an identifier. Paths, URLs, `@mentions`, `#hashtags` all parse as Words. The tokenizer accepts any non-whitespace that doesn't match identifier rules.
**"Functions are first-class values"** → True, but they're compiled to labels, not inline bytecode. The VM creates closures with label references, not embedded instructions.
**"The grammar is simple"** → False. It has pragmatic workarounds for GLR conflicts (`expressionWithoutIdentifier`), complex EOF handling, and relies heavily on the external tokenizer for correctness.
**"Short-circuit logic is a VM feature"** → False. It's a compiler pattern using `DUP`, `JUMP_IF_FALSE/TRUE`, and `POP`. The VM has no AND/OR opcodes.