This commit is contained in:
Corey Johnson 2025-10-12 16:33:53 -07:00
parent fe19191246
commit a53db50b1a
3 changed files with 202 additions and 71 deletions

111
CLAUDE.md
View File

@ -35,6 +35,31 @@ Shrimp is a shell-like scripting language that combines command-line simplicity
Key references: [Lezer System Guide](https://lezer.codemirror.net/docs/guide/) | [Lezer API](https://lezer.codemirror.net/docs/ref/)
## Reading the Codebase: What to Look For
When exploring Shrimp, focus on these key files in order:
1. **src/parser/shrimp.grammar** - Language syntax rules
- Note the `expressionWithoutIdentifier` pattern and its comment
- See how `consumeToTerminator` handles statement-level parsing
2. **src/parser/tokenizer.ts** - How Identifier vs Word is determined
- Check the emoji Unicode ranges and surrogate pair handling
- See context-aware termination logic (`;`, `)`, `:`)
3. **src/compiler/compiler.ts** - CST to bytecode transformation
- See how functions become labels in `fnLabels` map
- Check short-circuit logic for `and`/`or` (lines 267-282)
- Notice `TRY_CALL` emission for bare identifiers (line 152)
4. **packages/ReefVM/src/vm.ts** - Bytecode execution
- See `TRY_CALL` fall-through to `CALL` (lines 357-375)
- Check `TRY_LOAD` string coercion (lines 135-145)
- Notice NOSE-style named parameter binding (lines 425-443)
## Development Commands
### Running Files
@ -141,14 +166,69 @@ function parseExpression(input: string) {
**Whitespace-sensitive parsing**: Spaces distinguish operators from identifiers (`x-1` vs `x - 1`). This enables natural shell-like syntax.
**Identifier vs Word tokenization**: Custom tokenizer determines if a token is an assignable identifier (lowercase/emoji start) or a non-assignable word (paths, URLs). This allows `./file.txt` without quotes.
**Identifier vs Word tokenization**: The custom tokenizer (tokenizer.ts) is sophisticated:
**Ambiguous identifier resolution**: Bare identifiers like `myVar` could be function calls or variable references. The parser creates `FunctionCallOrIdentifier` nodes, resolved at runtime.
- **Surrogate pair handling**: Processes emoji as full Unicode code points (lines 51-65)
- **Context-aware termination**: Stops at `;`, `)`, `:` only when followed by whitespace (lines 19-24)
- This allows `basename ./cool;` to parse correctly
- But `basename ./cool; 2` treats the semicolon as a terminator
- **GLR state checking**: Uses `stack.canShift(Word)` to decide whether to track identifier validity
- **Permissive Words**: Anything that's not an identifier is a Word (paths, URLs, @mentions, #hashtags)
**Why this matters**: This complexity is what enables shell-like syntax. Without it, you'd need quotes around `./file.txt` or special handling for paths.
**Identifier rules**: Must start with lowercase letter or emoji, can contain lowercase, digits, dashes, and emoji.
**Word rules**: Everything else that isn't whitespace or a delimiter.
**Ambiguous identifier resolution**: Bare identifiers like `myVar` could be function calls or variable references. The parser creates `FunctionCallOrIdentifier` nodes, resolved at runtime using the `TRY_CALL` opcode.
**How it works**:
- The compiler emits `TRY_CALL varname` for bare identifiers (src/compiler/compiler.ts:152)
- ReefVM checks if the variable is a function at runtime (vm.ts:357-373)
- If it's a function, TRY_CALL intentionally falls through to CALL opcode (no break statement)
- If it's not a function or undefined, it pushes the value/string and returns
- This runtime resolution enables shell-like "echo hello" without quotes
**Unbound symbols become strings**: When `TRY_LOAD` encounters an undefined variable, it pushes the variable name as a string (vm.ts:135-145). This is a first-class language feature implemented as a VM opcode, not a parser trick.
**Expression-oriented design**: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns.
**EOF handling**: The grammar uses `(statement | newlineOrSemicolon)+ eof?` to handle empty lines and end-of-file without infinite loops.
## Compiler Architecture
**Function compilation strategy**: The compiler doesn't create inline function objects. Instead it:
1. Generates unique labels (`.func_0`, `.func_1`) for each function body (compiler.ts:137)
2. Stores function body instructions in `fnLabels` map during compilation
3. Appends all function bodies to the end of bytecode with RETURN instructions (compiler.ts:36-41)
4. Emits `MAKE_FUNCTION` with parameters and label reference
This approach keeps the main program linear and allows ReefVM to jump to function bodies by label.
**Short-circuit logic**: ReefVM has no AND/OR opcodes. The compiler implements short-circuit evaluation using:
```typescript
// For `a and b`:
LOAD a
DUP // Duplicate so we can return it if falsy
JUMP_IF_FALSE skip // If false, skip evaluating b
POP // Remove duplicate if we're continuing
LOAD b // Evaluate right side
skip:
```
See compiler.ts:267-282 for the full implementation. The `or` operator uses `JUMP_IF_TRUE` instead.
**If/else compilation**: The compiler uses label-based jumps:
- `JUMP_IF_FALSE` skips the then-block when condition is false
- Each branch ends with `JUMP endLabel` to skip remaining branches
- The final label marks where all branches converge
- If there's no else branch, compiler emits `PUSH null` as the default value
## Grammar Development
### Grammar Structure
@ -206,6 +286,21 @@ The `toMatchTree` helper compares parser output with expected CST structure.
**Empty line parsing**: The grammar structure `(statement | newlineOrSemicolon)+ eof?` allows proper empty line and EOF handling.
### Why expressionWithoutIdentifier Exists
The grammar has an unusual pattern: `expressionWithoutIdentifier`. This exists to solve a GLR conflict:
```
consumeToTerminator {
ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier
expression // → Identifier
}
```
Without `expressionWithoutIdentifier`, parsing `my-var` at statement level creates two paths that both want the Identifier token. The grammar comment (shrimp.grammar lines 157-164) explains we "gave up trying to use GLR to fix it."
**The solution**: Remove Identifier from the `expression` path by creating `expressionWithoutIdentifier`, forcing standalone identifiers through `ambiguousFunctionCall`. This is pragmatic over theoretical purity.
## Testing Strategy
### Parser Tests (`src/parser/parser.test.ts`)
@ -259,3 +354,15 @@ When grammar isn't parsing correctly:
2. **Test simpler cases first** - build up from basic to complex
3. **Use `toMatchTree` output** - see what the parser actually produces
4. **Check external tokenizer** - identifier vs word logic in `tokenizers.ts`
## Common Misconceptions
**"The parser handles unbound symbols as strings"** → False. The _VM_ does this via `TRY_LOAD` opcode. The parser creates `FunctionCallOrIdentifier` nodes; the compiler emits `TRY_LOAD`/`TRY_CALL`; the VM resolves at runtime.
**"Words are just paths"** → False. Words are _anything_ that isn't an identifier. Paths, URLs, `@mentions`, `#hashtags` all parse as Words. The tokenizer accepts any non-whitespace that doesn't match identifier rules.
**"Functions are first-class values"** → True, but they're compiled to labels, not inline bytecode. The VM creates closures with label references, not embedded instructions.
**"The grammar is simple"** → False. It has pragmatic workarounds for GLR conflicts (`expressionWithoutIdentifier`), complex EOF handling, and relies heavily on the external tokenizer for correctness.
**"Short-circuit logic is a VM feature"** → False. It's a compiler pattern using `DUP`, `JUMP_IF_FALSE/TRUE`, and `POP`. The VM has no AND/OR opcodes.

View File

@ -3,7 +3,7 @@ import { parser } from '#parser/shrimp.ts'
import * as terms from '#parser/shrimp.terms'
import type { SyntaxNode, Tree } from '@lezer/common'
import { assert, errorMessage } from '#utils/utils'
import { toBytecode, type Bytecode } from 'reefvm'
import { toBytecode, type Bytecode, type ProgramItem } from 'reefvm'
import {
checkTreeForErrors,
getAllChildren,
@ -15,9 +15,10 @@ import {
getNamedArgParts,
} from '#compiler/utils'
type Label = `.${string}`
export class Compiler {
instructions: string[] = []
fnLabels = new Map<string, string[]>()
instructions: ProgramItem[] = []
fnLabels = new Map<Label, ProgramItem[]>()
ifLabelCount = 0
bytecode: Bytecode
@ -34,13 +35,14 @@ export class Compiler {
// Add the labels
for (const [label, labelInstructions] of this.fnLabels) {
this.instructions.push(`${label}:`)
this.instructions.push(...labelInstructions.map((instr) => ` ${instr}`))
this.instructions.push(' RETURN')
this.instructions.push([`${label}:`])
this.instructions.push(...labelInstructions)
this.instructions.push(['RETURN'])
}
// console.log(`\n🤖 instructions:\n----------------\n${this.instructions.join('\n')}\n\n`)
this.bytecode = toBytecode(this.instructions.join('\n'))
// logInstructions(this.instructions)
this.bytecode = toBytecode(this.instructions)
} catch (error) {
if (error instanceof CompilerError) {
throw new Error(error.toReadableString(input))
@ -60,46 +62,50 @@ export class Compiler {
child = child.nextSibling
}
this.instructions.push('HALT')
this.instructions.push(['HALT'])
}
#compileNode(node: SyntaxNode, input: string): string[] {
#compileNode(node: SyntaxNode, input: string): ProgramItem[] {
const value = input.slice(node.from, node.to)
switch (node.type.id) {
case terms.Number:
return [`PUSH ${value}`]
const number = Number(value)
if (Number.isNaN(number))
throw new CompilerError(`Invalid number literal: ${value}`, node.from, node.to)
return [[`PUSH`, number]]
case terms.String:
const strValue = value.slice(1, -1).replace(/\\/g, '')
return [`PUSH "${strValue}"`]
return [[`PUSH`, strValue]]
case terms.Boolean: {
return [`PUSH ${value}`]
return [[`PUSH`, value === 'true']]
}
case terms.Identifier: {
return [`TRY_LOAD ${value}`]
return [[`TRY_LOAD`, value]]
}
case terms.BinOp: {
const { left, op, right } = getBinaryParts(node)
const instructions: string[] = []
const instructions: ProgramItem[] = []
instructions.push(...this.#compileNode(left, input))
instructions.push(...this.#compileNode(right, input))
const opValue = input.slice(op.from, op.to)
switch (opValue) {
case '+':
instructions.push('ADD')
instructions.push(['ADD'])
break
case '-':
instructions.push('SUB')
instructions.push(['SUB'])
break
case '*':
instructions.push('MUL')
instructions.push(['MUL'])
break
case '/':
instructions.push('DIV')
instructions.push(['DIV'])
break
default:
throw new CompilerError(`Unsupported binary operator: ${opValue}`, op.from, op.to)
@ -110,10 +116,10 @@ export class Compiler {
case terms.Assign: {
const { identifier, right } = getAssignmentParts(node)
const instructions: string[] = []
const instructions: ProgramItem[] = []
instructions.push(...this.#compileNode(right, input))
const identifierName = input.slice(identifier.from, identifier.to)
instructions.push(`STORE ${identifierName}`)
instructions.push(['STORE', identifierName])
return instructions
}
@ -127,23 +133,23 @@ export class Compiler {
case terms.FunctionDef: {
const { paramNames, bodyNode } = getFunctionDefParts(node, input)
const instructions: string[] = []
const functionName = `.func_${this.fnLabels.size}`
const bodyInstructions: string[] = []
if (this.fnLabels.has(functionName)) {
throw new CompilerError(`Function name collision: ${functionName}`, node.from, node.to)
const instructions: ProgramItem[] = []
const functionLabel: Label = `.func_${this.fnLabels.size}`
const bodyInstructions: ProgramItem[] = []
if (this.fnLabels.has(functionLabel)) {
throw new CompilerError(`Function name collision: ${functionLabel}`, node.from, node.to)
}
this.fnLabels.set(functionName, bodyInstructions)
this.fnLabels.set(functionLabel, bodyInstructions)
instructions.push(`MAKE_FUNCTION (${paramNames}) ${functionName}`)
instructions.push(['MAKE_FUNCTION', paramNames, functionLabel])
bodyInstructions.push(...this.#compileNode(bodyNode, input))
return instructions
}
case terms.FunctionCallOrIdentifier: {
return [`TRY_CALL ${value}`]
return [['TRY_CALL', value]]
}
/*
@ -161,7 +167,7 @@ export class Compiler {
*/
case terms.FunctionCall: {
const { identifierNode, namedArgs, positionalArgs } = getFunctionCallParts(node, input)
const instructions: string[] = []
const instructions: ProgramItem[] = []
instructions.push(...this.#compileNode(identifierNode, input))
positionalArgs.forEach((arg) => {
@ -170,13 +176,13 @@ export class Compiler {
namedArgs.forEach((arg) => {
const { name, valueNode } = getNamedArgParts(arg, input)
instructions.push(`PUSH "${name}"`)
instructions.push(['PUSH', name])
instructions.push(...this.#compileNode(valueNode, input))
})
instructions.push(`PUSH ${positionalArgs.length}`)
instructions.push(`PUSH ${namedArgs.length}`)
instructions.push(`CALL`)
instructions.push(['PUSH', positionalArgs.length])
instructions.push(['PUSH', namedArgs.length])
instructions.push(['CALL'])
return instructions
}
@ -193,86 +199,84 @@ export class Compiler {
node,
input
)
const instructions: string[] = []
const instructions: ProgramItem[] = []
instructions.push(...this.#compileNode(conditionNode, input))
this.ifLabelCount++
const elseLabel = `.else_${this.ifLabelCount}`
const endLabel = `.end_${this.ifLabelCount}`
const endLabel: Label = `.end_${this.ifLabelCount}`
const thenBlockInstructions = this.#compileNode(thenBlock, input)
instructions.push(`JUMP_IF_FALSE #${thenBlockInstructions.length + 1}`)
instructions.push(['JUMP_IF_FALSE', thenBlockInstructions.length + 1])
instructions.push(...thenBlockInstructions)
instructions.push(`JUMP ${endLabel}`)
instructions.push(['JUMP', endLabel])
// Else if
elseIfBlocks.forEach(({ conditional, thenBlock }, index) => {
elseIfBlocks.forEach(({ conditional, thenBlock }) => {
instructions.push(...this.#compileNode(conditional, input))
const elseIfInstructions = this.#compileNode(thenBlock, input)
instructions.push(`JUMP_IF_FALSE #${elseIfInstructions.length + 1}`)
instructions.push(['JUMP_IF_FALSE', elseIfInstructions.length + 1])
instructions.push(...elseIfInstructions)
instructions.push(`JUMP ${endLabel}`)
instructions.push(['JUMP', endLabel])
})
// Else
instructions.push(`${elseLabel}:`)
if (elseThenBlock) {
const elseThenInstructions = this.#compileNode(elseThenBlock, input).map((i) => ` ${i}`)
const elseThenInstructions = this.#compileNode(elseThenBlock, input)
instructions.push(...elseThenInstructions)
} else {
instructions.push(` PUSH null`)
instructions.push(['PUSH', null])
}
instructions.push(`${endLabel}:`)
instructions.push([`${endLabel}:`])
return instructions
}
// - `EQ`, `NEQ`, `LT`, `GT`, `LTE`, `GTE` - Pop 2, push boolean
case terms.ConditionalOp: {
const instructions: string[] = []
const instructions: ProgramItem[] = []
const { left, op, right } = getBinaryParts(node)
const leftInstructions: string[] = this.#compileNode(left, input)
const rightInstructions: string[] = this.#compileNode(right, input)
const leftInstructions: ProgramItem[] = this.#compileNode(left, input)
const rightInstructions: ProgramItem[] = this.#compileNode(right, input)
const opValue = input.slice(op.from, op.to)
switch (opValue) {
case '=':
instructions.push(...leftInstructions, ...rightInstructions, 'EQ')
instructions.push(...leftInstructions, ...rightInstructions, ['EQ'])
break
case '!=':
instructions.push(...leftInstructions, ...rightInstructions, 'NEQ')
instructions.push(...leftInstructions, ...rightInstructions, ['NEQ'])
break
case '<':
instructions.push(...leftInstructions, ...rightInstructions, 'LT')
instructions.push(...leftInstructions, ...rightInstructions, ['LT'])
break
case '>':
instructions.push(...leftInstructions, ...rightInstructions, 'GT')
instructions.push(...leftInstructions, ...rightInstructions, ['GT'])
break
case '<=':
instructions.push(...leftInstructions, ...rightInstructions, 'LTE')
instructions.push(...leftInstructions, ...rightInstructions, ['LTE'])
break
case '>=':
instructions.push(...leftInstructions, ...rightInstructions, 'GTE')
instructions.push(...leftInstructions, ...rightInstructions, ['GTE'])
break
case 'and':
instructions.push(...leftInstructions)
instructions.push('DUP')
instructions.push(`JUMP_IF_FALSE #${rightInstructions.length + 1}`)
instructions.push('POP')
instructions.push(['DUP'])
instructions.push(['JUMP_IF_FALSE', rightInstructions.length + 1])
instructions.push(['POP'])
instructions.push(...rightInstructions)
break
case 'or':
instructions.push(...leftInstructions)
instructions.push('PUSH 9')
instructions.push(`JUMP_IF_TRUE #${rightInstructions.length + 1}`)
instructions.push('POP')
instructions.push(['DUP'])
instructions.push(['JUMP_IF_TRUE', rightInstructions.length + 1])
instructions.push(['POP'])
instructions.push(...rightInstructions)
break
@ -289,3 +293,19 @@ export class Compiler {
}
}
}
const logInstructions = (instructions: ProgramItem[]) => {
const instructionsString = instructions
.map((parts) => {
const isPush = parts[0] === 'PUSH'
return parts
.map((part, i) => {
const partAsString = typeof part == 'string' && isPush ? `'${part}'` : part!.toString()
return i > 0 ? partAsString : part
})
.join(' ')
})
.join('\n')
console.log(`\n🤖 instructions:\n----------------\n${instructionsString}\n\n`)
}

View File

@ -245,16 +245,17 @@ describe('BinOp', () => {
describe('Fn', () => {
test('parses function no parameters', () => {
expect('fn: 1').toMatchTree(`
expect('fn: 1 end').toMatchTree(`
FunctionDef
keyword fn
Params
colon :
Number 1`)
Number 1
end end`)
})
test('parses function with single parameter', () => {
expect('fn x: x + 1').toMatchTree(`
expect('fn x: x + 1 end').toMatchTree(`
FunctionDef
keyword fn
Params
@ -263,11 +264,12 @@ describe('Fn', () => {
BinOp
Identifier x
operator +
Number 1`)
Number 1
end end`)
})
test('parses function with multiple parameters', () => {
expect('fn x y: x * y').toMatchTree(`
expect('fn x y: x * y end').toMatchTree(`
FunctionDef
keyword fn
Params
@ -277,7 +279,8 @@ describe('Fn', () => {
BinOp
Identifier x
operator *
Identifier y`)
Identifier y
end end`)
})
test('parses multiline function with multiple statements', () => {
@ -381,7 +384,7 @@ describe('Assign', () => {
})
test('parses assignment with functions', () => {
expect('add = fn a b: a + b').toMatchTree(`
expect('add = fn a b: a + b end').toMatchTree(`
Assign
Identifier add
operator =
@ -394,7 +397,8 @@ describe('Assign', () => {
BinOp
Identifier a
operator +
Identifier b`)
Identifier b
end end`)
})
})