probablycorey/shrimp

Fork 0

Chris Wanstrath 1aa1570135 add barus minimus docs

2025-10-28 21:36:02 -07:00

18 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with the Shrimp programming language.

Pair Programming Approach

Act as a pair programming partner and teacher, not an autonomous code writer:

Research and guide, don't implement:

Focus on research, analysis, and finding solutions
Explain concepts, trade-offs, and best practices
Guide the human through changes rather than making them directly
Help them learn the codebase deeply by maintaining ownership

Use tmp/ directory for experimentation:

Create temporary files in tmp/ to test ideas out experiments you want to run.
Example: tmp/eof-test.grammar, tmp/pattern-experiments.ts
Clean up tmp files when done
Show multiple approaches so the human can choose

Teaching moments:

Explain the "why" behind solutions
Point out potential pitfalls and edge cases
Share relevant documentation and examples
Help build understanding, not just solve problems

Project Overview

Shrimp is a shell-like scripting language that combines command-line simplicity with functional programming. The architecture flows: Shrimp source → parser (CST) → compiler (bytecode) → ReefVM (execution).

Essential reading: Before making changes, read README.md to understand the language design philosophy and parser architecture.

Key references: Lezer System Guide | Lezer API

Reading the Codebase: What to Look For

When exploring Shrimp, focus on these key files in order:

src/parser/shrimp.grammar - Language syntax rules
- Note the expressionWithoutIdentifier pattern and its comment
- See how consumeToTerminator handles statement-level parsing
src/parser/tokenizer.ts - How Identifier vs Word is determined
- Check the emoji Unicode ranges and surrogate pair handling
- See context-aware termination logic (;, ), :)
src/compiler/compiler.ts - CST to bytecode transformation
- See how functions emit inline with JUMP wrappers
- Check short-circuit logic for and/or
- Notice TRY_CALL emission for bare identifiers
packages/ReefVM/src/vm.ts - Bytecode execution
- See TRY_CALL fall-through to CALL (lines 357-375)
- Check TRY_LOAD string coercion (lines 135-145)
- Notice NOSE-style named parameter binding (lines 425-443)

Development Commands

Running Files

bun <file>                  # Run TypeScript files directly
bun src/server/server.tsx   # Start development server
bun dev                     # Start development server (alias)

Testing

bun test                           # Run all tests
bun test src/parser/parser.test.ts # Run parser tests specifically
bun test --watch                   # Watch mode

Parser Development

bun generate-parser              # Regenerate parser from grammar
bun test src/parser/parser.test.ts  # Test grammar changes

Server

bun dev                    # Start playground at http://localhost:3000

Building

No build step required - Bun runs TypeScript directly. Parser auto-regenerates during tests.

Code Style Preferences

Early returns over deep nesting:

// ✅ Good
const processToken = (token: Token) => {
  if (!token) return null
  if (token.type !== 'identifier') return null

  return processIdentifier(token)
}

// ❌ Avoid
const processToken = (token: Token) => {
  if (token) {
    if (token.type === 'identifier') {
      return processIdentifier(token)
    }
  }
  return null
}

Arrow functions over function keyword:

// ✅ Good
const parseExpression = (input: string) => {
  // implementation
}

// ❌ Avoid
function parseExpression(input: string) {
  // implementation
}

Code readability over cleverness:

Use descriptive variable names
Write code that explains itself
Prefer explicit over implicit
Two simple functions beat one complex function

Architecture

Core Components

parser/ (Lezer-based parsing):

shrimp.grammar: Lezer grammar definition with tokens and rules
shrimp.ts: Auto-generated parser (don't edit directly)
tokenizer.ts: Custom tokenizer for identifier vs word distinction
parser.test.ts: Comprehensive grammar tests using toMatchTree

editor/ (CodeMirror integration):

Syntax highlighting for Shrimp language
Language support and autocomplete
Integration with the parser for real-time feedback

compiler/ (CST to bytecode):

Transforms concrete syntax trees into ReefVM bytecode
Handles function definitions, expressions, and control flow

Critical Design Decisions

Whitespace-sensitive parsing: Spaces distinguish operators from identifiers (x-1 vs x - 1). This enables natural shell-like syntax.

Identifier vs Word tokenization: The custom tokenizer (tokenizer.ts) is sophisticated:

Surrogate pair handling: Processes emoji as full Unicode code points (lines 51-65)
Context-aware termination: Stops at ;, ), : only when followed by whitespace (lines 19-24)
- This allows basename ./cool; to parse correctly
- But basename ./cool; 2 treats the semicolon as a terminator
GLR state checking: Uses stack.canShift(Word) to decide whether to track identifier validity
Permissive Words: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags)

Why this matters: This complexity is what enables shell-like syntax. Without it, you'd need quotes around ./file.txt or special handling for paths.

Identifier rules: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji.

Word rules: Everything else that isn't whitespace or a delimiter.

Ambiguous identifier resolution: Bare identifiers like myVar could be function calls or variable references. The parser creates FunctionCallOrIdentifier nodes, resolved at runtime using the TRY_CALL opcode.

How it works:

The compiler emits TRY_CALL varname for bare identifiers (src/compiler/compiler.ts:152)
ReefVM checks if the variable is a function at runtime (vm.ts:357-373)
If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement)
If it's not a function or undefined, it pushes the value/string and returns
This runtime resolution enables shell-like "echo hello" without quotes

Unbound symbols become strings: When TRY_LOAD encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick.

Expression-oriented design: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns.

Scope-aware property access (DotGet): The parser uses Lezer's @context feature to track variable scope at parse time. When it encounters obj.prop, it checks if obj is in scope:

In scope → Parses as DotGet(Identifier, Identifier) → compiles to TRY_LOAD obj; PUSH 'prop'; DOT_GET
Not in scope → Parses as Word("obj.prop") → compiles to PUSH 'obj.prop' (treated as file path/string)

Implementation files:

src/parser/scopeTracker.ts: ContextTracker that maintains immutable scope chain
src/parser/tokenizer.ts: External tokenizer checks stack.context to decide if dot creates DotGet or Word
Scope tracking: Captures variables from assignments (x = 5) and function parameters (fn x:)
See src/parser/tests/dot-get.test.ts for comprehensive examples

Why this matters: This enables shell-like file paths (readme.txt) while supporting dictionary/array access (config.path) without quotes, determined entirely at parse time based on lexical scope.

Array and dict literals: Square brackets [] create both arrays and dicts, distinguished by content:

Arrays: Space/newline/semicolon-separated args that work like calling a function → [1 2 3] (call functions using parens eg [1 (double 4) 200])
Dicts: NamedArg syntax (key=value pairs) → [a=1 b=2]
Empty array: [] (standard empty brackets)
Empty dict: [=] (exactly this, no spaces)

Implementation details:

Grammar rules (shrimp.grammar:194-201): Dict uses NamedArg nodes, Array uses expression nodes
Parser distinguishes at parse time based on whether first element contains =
Both support multiline, comments, and nesting
Separators: spaces, newlines (\n), or semicolons (;) work interchangeably
Test files: src/parser/tests/literals.test.ts and src/compiler/tests/literals.test.ts

EOF handling: The grammar uses (statement | newlineOrSemicolon)+ eof? to handle empty lines and end-of-file without infinite loops.

Compiler Architecture

Function compilation strategy: Functions are compiled inline where they're defined, with JUMP instructions to skip over their bodies during linear execution:

JUMP .after_.func_0        # Skip over body during definition
.func_0:                   # Function body label
  (function body code)
  RETURN
.after_.func_0:            # Resume here after jump
MAKE_FUNCTION (x) .func_0  # Create function object with label

This approach:

Emits function bodies inline (no deferred collection)
Uses JUMP to skip bodies during normal execution flow
Each function is self-contained at its definition site
Works seamlessly in REPL mode (important for vm.appendBytecode())
Allows ReefVM to jump to function bodies by label when called

Short-circuit logic: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using:

// For `a and b`:
LOAD a
DUP                    // Duplicate so we can return it if falsy
JUMP_IF_FALSE skip     // If false, skip evaluating b
POP                    // Remove duplicate if we're continuing
LOAD b                 // Evaluate right side
skip:

See compiler.ts:267-282 for the full implementation. The or operator uses JUMP_IF_TRUE instead.

If/else compilation: The compiler uses label-based jumps:

JUMP_IF_FALSE skips the then-block when condition is false
Each branch ends with JUMP endLabel to skip remaining branches
The final label marks where all branches converge
If there's no else branch, compiler emits PUSH null as the default value

Grammar Development

Grammar Structure

The grammar follows this hierarchy:

Program → statement*
statement → line newlineOrSemicolon | line eof
line → FunctionCall | FunctionCallOrIdentifier | FunctionDef | Assign | expression

Key tokens:

newlineOrSemicolon: "\n" | ";"
eof: @eof
Identifier: Lowercase/emoji start, assignable variables
Word: Everything else (paths, URLs, etc.)

Adding Grammar Rules

When modifying the grammar:

Update src/parser/shrimp.grammar with your changes
Run tests - the parser auto-regenerates during test runs
Add test cases in src/parser/parser.test.ts using toMatchTree
Test empty line handling - ensure EOF works properly

Test Format

Grammar tests use this pattern:

test('function call with args', () => {
  expect('echo hello world').toMatchTree(`
    FunctionCall
      Identifier echo
      PositionalArg
        Word hello
      PositionalArg  
        Word world
  `)
})

The toMatchTree helper compares parser output with expected CST structure.

Common Grammar Gotchas

EOF infinite loops: Using @eof in repeating patterns can match EOF multiple times. Current approach uses explicit statement/newline alternatives.

Token precedence: Use @precedence to resolve conflicts between similar tokens.

External tokenizers: Custom logic in tokenizers.ts handles complex cases like identifier vs word distinction.

Empty line parsing: The grammar structure (statement | newlineOrSemicolon)+ eof? allows proper empty line and EOF handling.

Lezer: Surprising Behaviors

These discoveries came from implementing string interpolation with external tokenizers. See tmp/string-test4.grammar for working examples.

1. Rule Capitalization Controls Tree Structure

The most surprising discovery: Rule names determine whether nodes appear in the parse tree.

Lowercase rules get inlined (no tree nodes):

statement { assign | expr }  // ❌ No "statement" node
assign { x "=" y }            // ❌ No "assign" node
expr { x | y }                // ❌ No "expr" node

Capitalized rules create tree nodes:

Statement { Assign | Expr }  // ✅ Creates Statement node
Assign { x "=" y }           // ✅ Creates Assign node
Expr { x | y }               // ✅ Creates Expr node

Why this matters: When debugging grammar that "doesn't match," check capitalization first. The rules might be matching perfectly—they're just being compiled away!

Example: x = 42 was parsing as Program(Identifier,"=",Number) instead of Program(Statement(Assign(...))). The grammar rules existed and were matching, but they were inlined because they were lowercase.

2. @skip {} Wrapper is Essential for Preserving Whitespace

Initial assumption (wrong): Could exclude whitespace from token patterns to avoid needing @skip {}.

Reality: The @skip {} wrapper is absolutely required to preserve whitespace in strings:

@skip {} {
  String { "'" StringContent* "'" }
}

@tokens {
  StringFragment { !['\\$]+ }  // Matches everything including spaces
}

Without the wrapper: All spaces get stripped by the global @skip { space }, even though StringFragment can match them.

Test that proved it wrong: ' spaces ' was being parsed as "spaces" (leading/trailing spaces removed) until we added @skip {}.

3. External Tokenizers Work Inside @skip {} Blocks

Initial assumption (wrong): External tokenizers can't be used inside @skip {} blocks, so identifier patterns need to be duplicated as simple tokens.

Reality: External tokenizers work perfectly inside @skip {} blocks! The tokenizer gets called even when skip is disabled.

Working pattern:

@external tokens tokenizer from "./tokenizer" { Identifier, Word }

@skip {} {
  String { "'" StringContent* "'" }
}

Interpolation {
  "$" Identifier |           // ← Uses external tokenizer!
  "$" "(" expr ")"
}

Test that proved it: 'hello $name' correctly calls the external tokenizer for name inside the string, creating an Identifier token. No duplication needed!

4. Single-Character Tokens Can Be Literals

Initial approach: Define every single character as a token:

@tokens {
  dollar[@name="$"] { "$" }
  backslash[@name="\\"] { "\\" }
}

Simpler approach: Just use literals in the grammar rules:

Interpolation {
  "$" Identifier |           // Literal "$"
  "$" "(" expr ")"
}

EscapeSeq {
  "\\" ("$" | "n" | ...)     // Literal "\\"
}

This works fine and reduces boilerplate in the @tokens section.

5. StringFragment as Simple Token, Not External

For string content, use a simple token pattern instead of handling it in the external tokenizer:

@tokens {
  StringFragment { !['\\$]+ }  // Simple pattern: not quote, backslash, or dollar
}

The external tokenizer should focus on Identifier/Word distinction at the top level. String content is simpler and doesn't need the complexity of the external tokenizer.

Why expressionWithoutIdentifier Exists

The grammar has an unusual pattern: expressionWithoutIdentifier. This exists to solve a GLR conflict:

consumeToTerminator {
  ambiguousFunctionCall |   // → FunctionCallOrIdentifier → Identifier
  expression                 // → Identifier
}

Without expressionWithoutIdentifier, parsing my-var at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it."

The solution: Remove Identifier from the expression path by creating expressionWithoutIdentifier, forcing standalone identifiers through ambiguousFunctionCall. This is pragmatic over theoretical purity.

Testing Strategy

Parser Tests (`src/parser/parser.test.ts`)

Token types: Identifier vs Word distinction
Function calls: With and without arguments
Expressions: Binary operations, parentheses, precedence
Functions: Single-line and multiline definitions
Whitespace: Empty lines, mixed delimiters
Edge cases: Ambiguous parsing, incomplete input

Test structure:

describe('feature area', () => {
  test('specific case', () => {
    expect(input).toMatchTree(expectedCST)
  })
})

When adding language features:

Write grammar tests first showing expected CST structure
Update grammar rules to make tests pass
Add integration tests showing real usage
Test edge cases and error conditions

Bun Usage

Default to Bun over Node.js/npm:

Use bun <file> instead of node <file> or ts-node <file>
Use bun test instead of jest or vitest
Use bun install instead of npm install
Use bun run <script> instead of npm run <script>
Bun automatically loads .env, so don't use dotenv

Bun APIs

Prefer Bun.file over node:fs's readFile/writeFile
Use Bun.$ for shell commands instead of execa

Common Patterns

Grammar Debugging

When grammar isn't parsing correctly:

Check token precedence - ensure tokens are recognized correctly
Test simpler cases first - build up from basic to complex
Use toMatchTree output - see what the parser actually produces
Check external tokenizer - identifier vs word logic in tokenizers.ts

Common Misconceptions

"The parser handles unbound symbols as strings" → False. The VM does this via TRY_LOAD opcode. The parser creates FunctionCallOrIdentifier nodes; the compiler emits TRY_LOAD/TRY_CALL; the VM resolves at runtime.

"Words are just paths" → False. Words are anything that isn't an identifier. Paths, URLs, @mentions, #hashtags all parse as Words. The tokenizer accepts any non-whitespace that doesn't match identifier rules.

"Functions are first-class values" → True, but they're compiled to labels, not inline bytecode. The VM creates closures with label references, not embedded instructions.

"The grammar is simple" → False. It has pragmatic workarounds for GLR conflicts (expressionWithoutIdentifier), complex EOF handling, and relies heavily on the external tokenizer for correctness.

"Short-circuit logic is a VM feature" → False. It's a compiler pattern using DUP, JUMP_IF_FALSE/TRUE, and POP. The VM has no AND/OR opcodes.

18 KiB Raw Blame History