512 lines
18 KiB
Markdown
512 lines
18 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with the Shrimp programming language.
|
|
|
|
## Pair Programming Approach
|
|
|
|
Act as a pair programming partner and teacher, not an autonomous code writer:
|
|
|
|
**Research and guide, don't implement**:
|
|
|
|
- Focus on research, analysis, and finding solutions
|
|
- Explain concepts, trade-offs, and best practices
|
|
- Guide the human through changes rather than making them directly
|
|
- Help them learn the codebase deeply by maintaining ownership
|
|
|
|
**Use tmp/ directory for experimentation**:
|
|
|
|
- Create temporary files in `tmp/` to test ideas out experiments you want to run.
|
|
- Example: `tmp/eof-test.grammar`, `tmp/pattern-experiments.ts`
|
|
- Clean up tmp files when done
|
|
- Show multiple approaches so the human can choose
|
|
|
|
**Teaching moments**:
|
|
|
|
- Explain the "why" behind solutions
|
|
- Point out potential pitfalls and edge cases
|
|
- Share relevant documentation and examples
|
|
- Help build understanding, not just solve problems
|
|
|
|
## Project Overview
|
|
|
|
Shrimp is a shell-like scripting language that combines command-line simplicity with functional programming. The architecture flows: Shrimp source → parser (CST) → compiler (bytecode) → ReefVM (execution).
|
|
|
|
**Essential reading**: Before making changes, read README.md to understand the language design philosophy and parser architecture.
|
|
|
|
Key references: [Lezer System Guide](https://lezer.codemirror.net/docs/guide/) | [Lezer API](https://lezer.codemirror.net/docs/ref/)
|
|
|
|
## Reading the Codebase: What to Look For
|
|
|
|
When exploring Shrimp, focus on these key files in order:
|
|
|
|
1. **src/parser/shrimp.grammar** - Language syntax rules
|
|
|
|
- Note the `expressionWithoutIdentifier` pattern and its comment
|
|
- See how `consumeToTerminator` handles statement-level parsing
|
|
|
|
2. **src/parser/tokenizer.ts** - How Identifier vs Word is determined
|
|
|
|
- Check the emoji Unicode ranges and surrogate pair handling
|
|
- See context-aware termination logic (`;`, `)`, `:`)
|
|
|
|
3. **src/compiler/compiler.ts** - CST to bytecode transformation
|
|
|
|
- See how functions emit inline with JUMP wrappers
|
|
- Check short-circuit logic for `and`/`or`
|
|
- Notice `TRY_CALL` emission for bare identifiers
|
|
|
|
4. **packages/ReefVM/src/vm.ts** - Bytecode execution
|
|
- See `TRY_CALL` fall-through to `CALL` (lines 357-375)
|
|
- Check `TRY_LOAD` string coercion (lines 135-145)
|
|
- Notice NOSE-style named parameter binding (lines 425-443)
|
|
|
|
## Development Commands
|
|
|
|
### Running Files
|
|
|
|
```bash
|
|
bun <file> # Run TypeScript files directly
|
|
bun src/server/server.tsx # Start development server
|
|
bun dev # Start development server (alias)
|
|
```
|
|
|
|
### Testing
|
|
|
|
```bash
|
|
bun test # Run all tests
|
|
bun test src/parser/parser.test.ts # Run parser tests specifically
|
|
bun test --watch # Watch mode
|
|
```
|
|
|
|
### Parser Development
|
|
|
|
```bash
|
|
bun generate-parser # Regenerate parser from grammar
|
|
bun test src/parser/parser.test.ts # Test grammar changes
|
|
```
|
|
|
|
### Server
|
|
|
|
```bash
|
|
bun dev # Start playground at http://localhost:3000
|
|
```
|
|
|
|
### Building
|
|
|
|
No build step required - Bun runs TypeScript directly. Parser auto-regenerates during tests.
|
|
|
|
## Code Style Preferences
|
|
|
|
**Early returns over deep nesting**:
|
|
|
|
```typescript
|
|
// ✅ Good
|
|
const processToken = (token: Token) => {
|
|
if (!token) return null
|
|
if (token.type !== 'identifier') return null
|
|
|
|
return processIdentifier(token)
|
|
}
|
|
|
|
// ❌ Avoid
|
|
const processToken = (token: Token) => {
|
|
if (token) {
|
|
if (token.type === 'identifier') {
|
|
return processIdentifier(token)
|
|
}
|
|
}
|
|
return null
|
|
}
|
|
```
|
|
|
|
**Arrow functions over function keyword**:
|
|
|
|
```typescript
|
|
// ✅ Good
|
|
const parseExpression = (input: string) => {
|
|
// implementation
|
|
}
|
|
|
|
// ❌ Avoid
|
|
function parseExpression(input: string) {
|
|
// implementation
|
|
}
|
|
```
|
|
|
|
**Code readability over cleverness**:
|
|
|
|
- Use descriptive variable names
|
|
- Write code that explains itself
|
|
- Prefer explicit over implicit
|
|
- Two simple functions beat one complex function
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
**parser/** (Lezer-based parsing):
|
|
|
|
- **shrimp.grammar**: Lezer grammar definition with tokens and rules
|
|
- **shrimp.ts**: Auto-generated parser (don't edit directly)
|
|
- **tokenizer.ts**: Custom tokenizer for identifier vs word distinction
|
|
- **parser.test.ts**: Comprehensive grammar tests using `toMatchTree`
|
|
|
|
**editor/** (CodeMirror integration):
|
|
|
|
- Syntax highlighting for Shrimp language
|
|
- Language support and autocomplete
|
|
- Integration with the parser for real-time feedback
|
|
|
|
**compiler/** (CST to bytecode):
|
|
|
|
- Transforms concrete syntax trees into ReefVM bytecode
|
|
- Handles function definitions, expressions, and control flow
|
|
|
|
### Critical Design Decisions
|
|
|
|
**Whitespace-sensitive parsing**: Spaces distinguish operators from identifiers (`x-1` vs `x - 1`). This enables natural shell-like syntax.
|
|
|
|
**Identifier vs Word tokenization**: The custom tokenizer (tokenizer.ts) is sophisticated:
|
|
|
|
- **Surrogate pair handling**: Processes emoji as full Unicode code points (lines 51-65)
|
|
- **Context-aware termination**: Stops at `;`, `)`, `:` only when followed by whitespace (lines 19-24)
|
|
- This allows `basename ./cool;` to parse correctly
|
|
- But `basename ./cool; 2` treats the semicolon as a terminator
|
|
- **GLR state checking**: Uses `stack.canShift(Word)` to decide whether to track identifier validity
|
|
- **Permissive Words**: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags)
|
|
|
|
**Why this matters**: This complexity is what enables shell-like syntax. Without it, you'd need quotes around `./file.txt` or special handling for paths.
|
|
|
|
**Identifier rules**: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji.
|
|
|
|
**Word rules**: Everything else that isn't whitespace or a delimiter.
|
|
|
|
**Ambiguous identifier resolution**: Bare identifiers like `myVar` could be function calls or variable references. The parser creates `FunctionCallOrIdentifier` nodes, resolved at runtime using the `TRY_CALL` opcode.
|
|
|
|
**How it works**:
|
|
|
|
- The compiler emits `TRY_CALL varname` for bare identifiers (src/compiler/compiler.ts:152)
|
|
- ReefVM checks if the variable is a function at runtime (vm.ts:357-373)
|
|
- If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement)
|
|
- If it's not a function or undefined, it pushes the value/string and returns
|
|
- This runtime resolution enables shell-like "echo hello" without quotes
|
|
|
|
**Unbound symbols become strings**: When `TRY_LOAD` encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick.
|
|
|
|
**Expression-oriented design**: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns.
|
|
|
|
**Scope-aware property access (DotGet)**: The parser uses Lezer's `@context` feature to track variable scope at parse time. When it encounters `obj.prop`, it checks if `obj` is in scope:
|
|
- **In scope** → Parses as `DotGet(Identifier, Identifier)` → compiles to `TRY_LOAD obj; PUSH 'prop'; DOT_GET`
|
|
- **Not in scope** → Parses as `Word("obj.prop")` → compiles to `PUSH 'obj.prop'` (treated as file path/string)
|
|
|
|
Implementation files:
|
|
- **src/parser/parserScopeContext.ts**: ContextTracker that maintains immutable scope chain
|
|
- **src/parser/tokenizer.ts**: External tokenizer checks `stack.context` to decide if dot creates DotGet or Word
|
|
- Scope tracking: Captures variables from assignments (`x = 5`) and function parameters (`fn x:`)
|
|
- See `src/parser/tests/dot-get.test.ts` for comprehensive examples
|
|
|
|
**Why this matters**: This enables shell-like file paths (`readme.txt`) while supporting dictionary/array access (`config.path`) without quotes, determined entirely at parse time based on lexical scope.
|
|
|
|
**Array and dict literals**: Square brackets `[]` create both arrays and dicts, distinguished by content:
|
|
- **Arrays**: Space/newline/semicolon-separated args that work like calling a function → `[1 2 3]` (call functions using parens eg `[1 (double 4) 200]`)
|
|
- **Dicts**: NamedArg syntax (key=value pairs) → `[a=1 b=2]`
|
|
- **Empty array**: `[]` (standard empty brackets)
|
|
- **Empty dict**: `[=]` (exactly this, no spaces)
|
|
|
|
Implementation details:
|
|
- Grammar rules (shrimp.grammar:194-201): Dict uses `NamedArg` nodes, Array uses `expression` nodes
|
|
- Parser distinguishes at parse time based on whether first element contains `=`
|
|
- Both support multiline, comments, and nesting
|
|
- Separators: spaces, newlines (`\n`), or semicolons (`;`) work interchangeably
|
|
- Test files: `src/parser/tests/literals.test.ts` and `src/compiler/tests/literals.test.ts`
|
|
|
|
**EOF handling**: The grammar uses `(statement | newlineOrSemicolon)+ eof?` to handle empty lines and end-of-file without infinite loops.
|
|
|
|
## Compiler Architecture
|
|
|
|
**Function compilation strategy**: Functions are compiled inline where they're defined, with JUMP instructions to skip over their bodies during linear execution:
|
|
|
|
```
|
|
JUMP .after_.func_0 # Skip over body during definition
|
|
.func_0: # Function body label
|
|
(function body code)
|
|
RETURN
|
|
.after_.func_0: # Resume here after jump
|
|
MAKE_FUNCTION (x) .func_0 # Create function object with label
|
|
```
|
|
|
|
This approach:
|
|
- Emits function bodies inline (no deferred collection)
|
|
- Uses JUMP to skip bodies during normal execution flow
|
|
- Each function is self-contained at its definition site
|
|
- Works seamlessly in REPL mode (important for `vm.appendBytecode()`)
|
|
- Allows ReefVM to jump to function bodies by label when called
|
|
|
|
**Short-circuit logic**: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using:
|
|
|
|
```typescript
|
|
// For `a and b`:
|
|
LOAD a
|
|
DUP // Duplicate so we can return it if falsy
|
|
JUMP_IF_FALSE skip // If false, skip evaluating b
|
|
POP // Remove duplicate if we're continuing
|
|
LOAD b // Evaluate right side
|
|
skip:
|
|
```
|
|
|
|
See compiler.ts:267-282 for the full implementation. The `or` operator uses `JUMP_IF_TRUE` instead.
|
|
|
|
**If/else compilation**: The compiler uses label-based jumps:
|
|
|
|
- `JUMP_IF_FALSE` skips the then-block when condition is false
|
|
- Each branch ends with `JUMP endLabel` to skip remaining branches
|
|
- The final label marks where all branches converge
|
|
- If there's no else branch, compiler emits `PUSH null` as the default value
|
|
|
|
## Grammar Development
|
|
|
|
### Grammar Structure
|
|
|
|
The grammar follows this hierarchy:
|
|
|
|
```
|
|
Program → statement*
|
|
statement → line newlineOrSemicolon | line eof
|
|
line → FunctionCall | FunctionCallOrIdentifier | FunctionDef | Assign | expression
|
|
```
|
|
|
|
Key tokens:
|
|
|
|
- `newlineOrSemicolon`: `"\n" | ";"`
|
|
- `eof`: `@eof`
|
|
- `Identifier`: Lowercase/emoji start, assignable variables
|
|
- `Word`: Everything else (paths, URLs, etc.)
|
|
|
|
### Adding Grammar Rules
|
|
|
|
When modifying the grammar:
|
|
|
|
1. **Update `src/parser/shrimp.grammar`** with your changes
|
|
2. **Run tests** - the parser auto-regenerates during test runs
|
|
3. **Add test cases** in `src/parser/parser.test.ts` using `toMatchTree`
|
|
4. **Test empty line handling** - ensure EOF works properly
|
|
|
|
### Test Format
|
|
|
|
Grammar tests use this pattern:
|
|
|
|
```typescript
|
|
test('function call with args', () => {
|
|
expect('echo hello world').toMatchTree(`
|
|
FunctionCall
|
|
Identifier echo
|
|
PositionalArg
|
|
Word hello
|
|
PositionalArg
|
|
Word world
|
|
`)
|
|
})
|
|
```
|
|
|
|
The `toMatchTree` helper compares parser output with expected CST structure.
|
|
|
|
### Common Grammar Gotchas
|
|
|
|
**EOF infinite loops**: Using `@eof` in repeating patterns can match EOF multiple times. Current approach uses explicit statement/newline alternatives.
|
|
|
|
**Token precedence**: Use `@precedence` to resolve conflicts between similar tokens.
|
|
|
|
**External tokenizers**: Custom logic in `tokenizers.ts` handles complex cases like identifier vs word distinction.
|
|
|
|
**Empty line parsing**: The grammar structure `(statement | newlineOrSemicolon)+ eof?` allows proper empty line and EOF handling.
|
|
|
|
## Lezer: Surprising Behaviors
|
|
|
|
These discoveries came from implementing string interpolation with external tokenizers. See `tmp/string-test4.grammar` for working examples.
|
|
|
|
### 1. Rule Capitalization Controls Tree Structure
|
|
|
|
**The most surprising discovery**: Rule names determine whether nodes appear in the parse tree.
|
|
|
|
**Lowercase rules get inlined** (no tree nodes):
|
|
|
|
```lezer
|
|
statement { assign | expr } // ❌ No "statement" node
|
|
assign { x "=" y } // ❌ No "assign" node
|
|
expr { x | y } // ❌ No "expr" node
|
|
```
|
|
|
|
**Capitalized rules create tree nodes**:
|
|
|
|
```lezer
|
|
Statement { Assign | Expr } // ✅ Creates Statement node
|
|
Assign { x "=" y } // ✅ Creates Assign node
|
|
Expr { x | y } // ✅ Creates Expr node
|
|
```
|
|
|
|
**Why this matters**: When debugging grammar that "doesn't match," check capitalization first. The rules might be matching perfectly—they're just being compiled away!
|
|
|
|
Example: `x = 42` was parsing as `Program(Identifier,"=",Number)` instead of `Program(Statement(Assign(...)))`. The grammar rules existed and were matching, but they were inlined because they were lowercase.
|
|
|
|
### 2. @skip {} Wrapper is Essential for Preserving Whitespace
|
|
|
|
**Initial assumption (wrong)**: Could exclude whitespace from token patterns to avoid needing `@skip {}`.
|
|
|
|
**Reality**: The `@skip {}` wrapper is absolutely required to preserve whitespace in strings:
|
|
|
|
```lezer
|
|
@skip {} {
|
|
String { "'" StringContent* "'" }
|
|
}
|
|
|
|
@tokens {
|
|
StringFragment { !['\\$]+ } // Matches everything including spaces
|
|
}
|
|
```
|
|
|
|
**Without the wrapper**: All spaces get stripped by the global `@skip { space }`, even though `StringFragment` can match them.
|
|
|
|
**Test that proved it wrong**: `' spaces '` was being parsed as `"spaces"` (leading/trailing spaces removed) until we added `@skip {}`.
|
|
|
|
### 3. External Tokenizers Work Inside @skip {} Blocks
|
|
|
|
**Initial assumption (wrong)**: External tokenizers can't be used inside `@skip {}` blocks, so identifier patterns need to be duplicated as simple tokens.
|
|
|
|
**Reality**: External tokenizers work perfectly inside `@skip {}` blocks! The tokenizer gets called even when skip is disabled.
|
|
|
|
**Working pattern**:
|
|
|
|
```lezer
|
|
@external tokens tokenizer from "./tokenizer" { Identifier, Word }
|
|
|
|
@skip {} {
|
|
String { "'" StringContent* "'" }
|
|
}
|
|
|
|
Interpolation {
|
|
"$" Identifier | // ← Uses external tokenizer!
|
|
"$" "(" expr ")"
|
|
}
|
|
```
|
|
|
|
**Test that proved it**: `'hello $name'` correctly calls the external tokenizer for `name` inside the string, creating an `Identifier` token. No duplication needed!
|
|
|
|
### 4. Single-Character Tokens Can Be Literals
|
|
|
|
**Initial approach**: Define every single character as a token:
|
|
|
|
```lezer
|
|
@tokens {
|
|
dollar[@name="$"] { "$" }
|
|
backslash[@name="\\"] { "\\" }
|
|
}
|
|
```
|
|
|
|
**Simpler approach**: Just use literals in the grammar rules:
|
|
|
|
```lezer
|
|
Interpolation {
|
|
"$" Identifier | // Literal "$"
|
|
"$" "(" expr ")"
|
|
}
|
|
|
|
EscapeSeq {
|
|
"\\" ("$" | "n" | ...) // Literal "\\"
|
|
}
|
|
```
|
|
|
|
This works fine and reduces boilerplate in the @tokens section.
|
|
|
|
### 5. StringFragment as Simple Token, Not External
|
|
|
|
For string content, use a simple token pattern instead of handling it in the external tokenizer:
|
|
|
|
```lezer
|
|
@tokens {
|
|
StringFragment { !['\\$]+ } // Simple pattern: not quote, backslash, or dollar
|
|
}
|
|
```
|
|
|
|
The external tokenizer should focus on Identifier/Word distinction at the top level. String content is simpler and doesn't need the complexity of the external tokenizer.
|
|
|
|
### Why expressionWithoutIdentifier Exists
|
|
|
|
The grammar has an unusual pattern: `expressionWithoutIdentifier`. This exists to solve a GLR conflict:
|
|
|
|
```
|
|
consumeToTerminator {
|
|
ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier
|
|
expression // → Identifier
|
|
}
|
|
```
|
|
|
|
Without `expressionWithoutIdentifier`, parsing `my-var` at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it."
|
|
|
|
**The solution**: Remove Identifier from the `expression` path by creating `expressionWithoutIdentifier`, forcing standalone identifiers through `ambiguousFunctionCall`. This is pragmatic over theoretical purity.
|
|
|
|
## Testing Strategy
|
|
|
|
### Parser Tests (`src/parser/parser.test.ts`)
|
|
|
|
- **Token types**: Identifier vs Word distinction
|
|
- **Function calls**: With and without arguments
|
|
- **Expressions**: Binary operations, parentheses, precedence
|
|
- **Functions**: Single-line and multiline definitions
|
|
- **Whitespace**: Empty lines, mixed delimiters
|
|
- **Edge cases**: Ambiguous parsing, incomplete input
|
|
|
|
Test structure:
|
|
|
|
```typescript
|
|
describe('feature area', () => {
|
|
test('specific case', () => {
|
|
expect(input).toMatchTree(expectedCST)
|
|
})
|
|
})
|
|
```
|
|
|
|
When adding language features:
|
|
|
|
1. Write grammar tests first showing expected CST structure
|
|
2. Update grammar rules to make tests pass
|
|
3. Add integration tests showing real usage
|
|
4. Test edge cases and error conditions
|
|
|
|
## Bun Usage
|
|
|
|
Default to Bun over Node.js/npm:
|
|
|
|
- Use `bun <file>` instead of `node <file>` or `ts-node <file>`
|
|
- Use `bun test` instead of `jest` or `vitest`
|
|
- Use `bun install` instead of `npm install`
|
|
- Use `bun run <script>` instead of `npm run <script>`
|
|
- Bun automatically loads .env, so don't use dotenv
|
|
|
|
### Bun APIs
|
|
|
|
- Prefer `Bun.file` over `node:fs`'s readFile/writeFile
|
|
- Use `Bun.$` for shell commands instead of execa
|
|
|
|
## Common Patterns
|
|
|
|
### Grammar Debugging
|
|
|
|
When grammar isn't parsing correctly:
|
|
|
|
1. **Check token precedence** - ensure tokens are recognized correctly
|
|
2. **Test simpler cases first** - build up from basic to complex
|
|
3. **Use `toMatchTree` output** - see what the parser actually produces
|
|
4. **Check external tokenizer** - identifier vs word logic in `tokenizers.ts`
|
|
|
|
## Common Misconceptions
|
|
|
|
**"The parser handles unbound symbols as strings"** → False. The _VM_ does this via `TRY_LOAD` opcode. The parser creates `FunctionCallOrIdentifier` nodes; the compiler emits `TRY_LOAD`/`TRY_CALL`; the VM resolves at runtime.
|
|
|
|
**"Words are just paths"** → False. Words are _anything_ that isn't an identifier. Paths, URLs, `@mentions`, `#hashtags` all parse as Words. The tokenizer accepts any non-whitespace that doesn't match identifier rules.
|
|
|
|
**"Functions are first-class values"** → True, but they're compiled to labels, not inline bytecode. The VM creates closures with label references, not embedded instructions.
|
|
|
|
**"The grammar is simple"** → False. It has pragmatic workarounds for GLR conflicts (`expressionWithoutIdentifier`), complex EOF handling, and relies heavily on the external tokenizer for correctness.
|
|
|
|
**"Short-circuit logic is a VM feature"** → False. It's a compiler pattern using `DUP`, `JUMP_IF_FALSE/TRUE`, and `POP`. The VM has no AND/OR opcodes.
|