18 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with the Shrimp programming language.
Pair Programming Approach
Act as a pair programming partner and teacher, not an autonomous code writer:
Research and guide, don't implement:
- Focus on research, analysis, and finding solutions
- Explain concepts, trade-offs, and best practices
- Guide the human through changes rather than making them directly
- Help them learn the codebase deeply by maintaining ownership
Use tmp/ directory for experimentation:
- Create temporary files in
tmp/to test ideas out experiments you want to run. - Example:
tmp/eof-test.grammar,tmp/pattern-experiments.ts - Clean up tmp files when done
- Show multiple approaches so the human can choose
Teaching moments:
- Explain the "why" behind solutions
- Point out potential pitfalls and edge cases
- Share relevant documentation and examples
- Help build understanding, not just solve problems
Project Overview
Shrimp is a shell-like scripting language that combines command-line simplicity with functional programming. The architecture flows: Shrimp source → parser (CST) → compiler (bytecode) → ReefVM (execution).
Essential reading: Before making changes, read README.md to understand the language design philosophy and parser architecture.
Key references: Lezer System Guide | Lezer API
Reading the Codebase: What to Look For
When exploring Shrimp, focus on these key files in order:
-
src/parser/shrimp.grammar - Language syntax rules
- Note the
expressionWithoutIdentifierpattern and its comment - See how
consumeToTerminatorhandles statement-level parsing
- Note the
-
src/parser/tokenizer.ts - How Identifier vs Word is determined
- Check the emoji Unicode ranges and surrogate pair handling
- See context-aware termination logic (
;,),:)
-
src/compiler/compiler.ts - CST to bytecode transformation
- See how functions emit inline with JUMP wrappers
- Check short-circuit logic for
and/or - Notice
TRY_CALLemission for bare identifiers
-
packages/ReefVM/src/vm.ts - Bytecode execution
- See
TRY_CALLfall-through toCALL(lines 357-375) - Check
TRY_LOADstring coercion (lines 135-145) - Notice NOSE-style named parameter binding (lines 425-443)
- See
Development Commands
Running Files
bun <file> # Run TypeScript files directly
bun src/server/server.tsx # Start development server
bun dev # Start development server (alias)
Testing
bun test # Run all tests
bun test src/parser/parser.test.ts # Run parser tests specifically
bun test --watch # Watch mode
Parser Development
bun generate-parser # Regenerate parser from grammar
bun test src/parser/parser.test.ts # Test grammar changes
Server
bun dev # Start playground at http://localhost:3000
Building
No build step required - Bun runs TypeScript directly. Parser auto-regenerates during tests.
Code Style Preferences
Early returns over deep nesting:
// ✅ Good
const processToken = (token: Token) => {
if (!token) return null
if (token.type !== 'identifier') return null
return processIdentifier(token)
}
// ❌ Avoid
const processToken = (token: Token) => {
if (token) {
if (token.type === 'identifier') {
return processIdentifier(token)
}
}
return null
}
Arrow functions over function keyword:
// ✅ Good
const parseExpression = (input: string) => {
// implementation
}
// ❌ Avoid
function parseExpression(input: string) {
// implementation
}
Code readability over cleverness:
- Use descriptive variable names
- Write code that explains itself
- Prefer explicit over implicit
- Two simple functions beat one complex function
Architecture
Core Components
parser/ (Lezer-based parsing):
- shrimp.grammar: Lezer grammar definition with tokens and rules
- shrimp.ts: Auto-generated parser (don't edit directly)
- tokenizer.ts: Custom tokenizer for identifier vs word distinction
- parser.test.ts: Comprehensive grammar tests using
toMatchTree
editor/ (CodeMirror integration):
- Syntax highlighting for Shrimp language
- Language support and autocomplete
- Integration with the parser for real-time feedback
compiler/ (CST to bytecode):
- Transforms concrete syntax trees into ReefVM bytecode
- Handles function definitions, expressions, and control flow
Critical Design Decisions
Whitespace-sensitive parsing: Spaces distinguish operators from identifiers (x-1 vs x - 1). This enables natural shell-like syntax.
Identifier vs Word tokenization: The custom tokenizer (tokenizer.ts) is sophisticated:
- Surrogate pair handling: Processes emoji as full Unicode code points (lines 51-65)
- Context-aware termination: Stops at
;,),:only when followed by whitespace (lines 19-24)- This allows
basename ./cool;to parse correctly - But
basename ./cool; 2treats the semicolon as a terminator
- This allows
- GLR state checking: Uses
stack.canShift(Word)to decide whether to track identifier validity - Permissive Words: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags)
Why this matters: This complexity is what enables shell-like syntax. Without it, you'd need quotes around ./file.txt or special handling for paths.
Identifier rules: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji.
Word rules: Everything else that isn't whitespace or a delimiter.
Ambiguous identifier resolution: Bare identifiers like myVar could be function calls or variable references. The parser creates FunctionCallOrIdentifier nodes, resolved at runtime using the TRY_CALL opcode.
How it works:
- The compiler emits
TRY_CALL varnamefor bare identifiers (src/compiler/compiler.ts:152) - ReefVM checks if the variable is a function at runtime (vm.ts:357-373)
- If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement)
- If it's not a function or undefined, it pushes the value/string and returns
- This runtime resolution enables shell-like "echo hello" without quotes
Unbound symbols become strings: When TRY_LOAD encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick.
Expression-oriented design: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns.
Scope-aware property access (DotGet): The parser uses Lezer's @context feature to track variable scope at parse time. When it encounters obj.prop, it checks if obj is in scope:
- In scope → Parses as
DotGet(Identifier, Identifier)→ compiles toTRY_LOAD obj; PUSH 'prop'; DOT_GET - Not in scope → Parses as
Word("obj.prop")→ compiles toPUSH 'obj.prop'(treated as file path/string)
Implementation files:
- src/parser/scopeTracker.ts: ContextTracker that maintains immutable scope chain
- src/parser/tokenizer.ts: External tokenizer checks
stack.contextto decide if dot creates DotGet or Word - Scope tracking: Captures variables from assignments (
x = 5) and function parameters (fn x:) - See
src/parser/tests/dot-get.test.tsfor comprehensive examples
Why this matters: This enables shell-like file paths (readme.txt) while supporting dictionary/array access (config.path) without quotes, determined entirely at parse time based on lexical scope.
Array and dict literals: Square brackets [] create both arrays and dicts, distinguished by content:
- Arrays: Space/newline/semicolon-separated args that work like calling a function →
[1 2 3](call functions using parens eg[1 (double 4) 200]) - Dicts: NamedArg syntax (key=value pairs) →
[a=1 b=2] - Empty array:
[](standard empty brackets) - Empty dict:
[=](exactly this, no spaces)
Implementation details:
- Grammar rules (shrimp.grammar:194-201): Dict uses
NamedArgnodes, Array usesexpressionnodes - Parser distinguishes at parse time based on whether first element contains
= - Both support multiline, comments, and nesting
- Separators: spaces, newlines (
\n), or semicolons (;) work interchangeably - Test files:
src/parser/tests/literals.test.tsandsrc/compiler/tests/literals.test.ts
EOF handling: The grammar uses (statement | newlineOrSemicolon)+ eof? to handle empty lines and end-of-file without infinite loops.
Compiler Architecture
Function compilation strategy: Functions are compiled inline where they're defined, with JUMP instructions to skip over their bodies during linear execution:
JUMP .after_.func_0 # Skip over body during definition
.func_0: # Function body label
(function body code)
RETURN
.after_.func_0: # Resume here after jump
MAKE_FUNCTION (x) .func_0 # Create function object with label
This approach:
- Emits function bodies inline (no deferred collection)
- Uses JUMP to skip bodies during normal execution flow
- Each function is self-contained at its definition site
- Works seamlessly in REPL mode (important for
vm.appendBytecode()) - Allows ReefVM to jump to function bodies by label when called
Short-circuit logic: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using:
// For `a and b`:
LOAD a
DUP // Duplicate so we can return it if falsy
JUMP_IF_FALSE skip // If false, skip evaluating b
POP // Remove duplicate if we're continuing
LOAD b // Evaluate right side
skip:
See compiler.ts:267-282 for the full implementation. The or operator uses JUMP_IF_TRUE instead.
If/else compilation: The compiler uses label-based jumps:
JUMP_IF_FALSEskips the then-block when condition is false- Each branch ends with
JUMP endLabelto skip remaining branches - The final label marks where all branches converge
- If there's no else branch, compiler emits
PUSH nullas the default value
Grammar Development
Grammar Structure
The grammar follows this hierarchy:
Program → statement*
statement → line newlineOrSemicolon | line eof
line → FunctionCall | FunctionCallOrIdentifier | FunctionDef | Assign | expression
Key tokens:
newlineOrSemicolon:"\n" | ";"eof:@eofIdentifier: Lowercase/emoji start, assignable variablesWord: Everything else (paths, URLs, etc.)
Adding Grammar Rules
When modifying the grammar:
- Update
src/parser/shrimp.grammarwith your changes - Run tests - the parser auto-regenerates during test runs
- Add test cases in
src/parser/parser.test.tsusingtoMatchTree - Test empty line handling - ensure EOF works properly
Test Format
Grammar tests use this pattern:
test('function call with args', () => {
expect('echo hello world').toMatchTree(`
FunctionCall
Identifier echo
PositionalArg
Word hello
PositionalArg
Word world
`)
})
The toMatchTree helper compares parser output with expected CST structure.
Common Grammar Gotchas
EOF infinite loops: Using @eof in repeating patterns can match EOF multiple times. Current approach uses explicit statement/newline alternatives.
Token precedence: Use @precedence to resolve conflicts between similar tokens.
External tokenizers: Custom logic in tokenizers.ts handles complex cases like identifier vs word distinction.
Empty line parsing: The grammar structure (statement | newlineOrSemicolon)+ eof? allows proper empty line and EOF handling.
Lezer: Surprising Behaviors
These discoveries came from implementing string interpolation with external tokenizers. See tmp/string-test4.grammar for working examples.
1. Rule Capitalization Controls Tree Structure
The most surprising discovery: Rule names determine whether nodes appear in the parse tree.
Lowercase rules get inlined (no tree nodes):
statement { assign | expr } // ❌ No "statement" node
assign { x "=" y } // ❌ No "assign" node
expr { x | y } // ❌ No "expr" node
Capitalized rules create tree nodes:
Statement { Assign | Expr } // ✅ Creates Statement node
Assign { x "=" y } // ✅ Creates Assign node
Expr { x | y } // ✅ Creates Expr node
Why this matters: When debugging grammar that "doesn't match," check capitalization first. The rules might be matching perfectly—they're just being compiled away!
Example: x = 42 was parsing as Program(Identifier,"=",Number) instead of Program(Statement(Assign(...))). The grammar rules existed and were matching, but they were inlined because they were lowercase.
2. @skip {} Wrapper is Essential for Preserving Whitespace
Initial assumption (wrong): Could exclude whitespace from token patterns to avoid needing @skip {}.
Reality: The @skip {} wrapper is absolutely required to preserve whitespace in strings:
@skip {} {
String { "'" StringContent* "'" }
}
@tokens {
StringFragment { !['\\$]+ } // Matches everything including spaces
}
Without the wrapper: All spaces get stripped by the global @skip { space }, even though StringFragment can match them.
Test that proved it wrong: ' spaces ' was being parsed as "spaces" (leading/trailing spaces removed) until we added @skip {}.
3. External Tokenizers Work Inside @skip {} Blocks
Initial assumption (wrong): External tokenizers can't be used inside @skip {} blocks, so identifier patterns need to be duplicated as simple tokens.
Reality: External tokenizers work perfectly inside @skip {} blocks! The tokenizer gets called even when skip is disabled.
Working pattern:
@external tokens tokenizer from "./tokenizer" { Identifier, Word }
@skip {} {
String { "'" StringContent* "'" }
}
Interpolation {
"$" Identifier | // ← Uses external tokenizer!
"$" "(" expr ")"
}
Test that proved it: 'hello $name' correctly calls the external tokenizer for name inside the string, creating an Identifier token. No duplication needed!
4. Single-Character Tokens Can Be Literals
Initial approach: Define every single character as a token:
@tokens {
dollar[@name="$"] { "$" }
backslash[@name="\\"] { "\\" }
}
Simpler approach: Just use literals in the grammar rules:
Interpolation {
"$" Identifier | // Literal "$"
"$" "(" expr ")"
}
EscapeSeq {
"\\" ("$" | "n" | ...) // Literal "\\"
}
This works fine and reduces boilerplate in the @tokens section.
5. StringFragment as Simple Token, Not External
For string content, use a simple token pattern instead of handling it in the external tokenizer:
@tokens {
StringFragment { !['\\$]+ } // Simple pattern: not quote, backslash, or dollar
}
The external tokenizer should focus on Identifier/Word distinction at the top level. String content is simpler and doesn't need the complexity of the external tokenizer.
Why expressionWithoutIdentifier Exists
The grammar has an unusual pattern: expressionWithoutIdentifier. This exists to solve a GLR conflict:
consumeToTerminator {
ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier
expression // → Identifier
}
Without expressionWithoutIdentifier, parsing my-var at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it."
The solution: Remove Identifier from the expression path by creating expressionWithoutIdentifier, forcing standalone identifiers through ambiguousFunctionCall. This is pragmatic over theoretical purity.
Testing Strategy
Parser Tests (src/parser/parser.test.ts)
- Token types: Identifier vs Word distinction
- Function calls: With and without arguments
- Expressions: Binary operations, parentheses, precedence
- Functions: Single-line and multiline definitions
- Whitespace: Empty lines, mixed delimiters
- Edge cases: Ambiguous parsing, incomplete input
Test structure:
describe('feature area', () => {
test('specific case', () => {
expect(input).toMatchTree(expectedCST)
})
})
When adding language features:
- Write grammar tests first showing expected CST structure
- Update grammar rules to make tests pass
- Add integration tests showing real usage
- Test edge cases and error conditions
Bun Usage
Default to Bun over Node.js/npm:
- Use
bun <file>instead ofnode <file>orts-node <file> - Use
bun testinstead ofjestorvitest - Use
bun installinstead ofnpm install - Use
bun run <script>instead ofnpm run <script> - Bun automatically loads .env, so don't use dotenv
Bun APIs
- Prefer
Bun.fileovernode:fs's readFile/writeFile - Use
Bun.$for shell commands instead of execa
Common Patterns
Grammar Debugging
When grammar isn't parsing correctly:
- Check token precedence - ensure tokens are recognized correctly
- Test simpler cases first - build up from basic to complex
- Use
toMatchTreeoutput - see what the parser actually produces - Check external tokenizer - identifier vs word logic in
tokenizers.ts
Common Misconceptions
"The parser handles unbound symbols as strings" → False. The VM does this via TRY_LOAD opcode. The parser creates FunctionCallOrIdentifier nodes; the compiler emits TRY_LOAD/TRY_CALL; the VM resolves at runtime.
"Words are just paths" → False. Words are anything that isn't an identifier. Paths, URLs, @mentions, #hashtags all parse as Words. The tokenizer accepts any non-whitespace that doesn't match identifier rules.
"Functions are first-class values" → True, but they're compiled to labels, not inline bytecode. The VM creates closures with label references, not embedded instructions.
"The grammar is simple" → False. It has pragmatic workarounds for GLR conflicts (expressionWithoutIdentifier), complex EOF handling, and relies heavily on the external tokenizer for correctness.
"Short-circuit logic is a VM feature" → False. It's a compiler pattern using DUP, JUMP_IF_FALSE/TRUE, and POP. The VM has no AND/OR opcodes.