diff --git a/.gitignore b/.gitignore index 2486e5a..9cb7f68 100644 --- a/.gitignore +++ b/.gitignore @@ -33,4 +33,5 @@ report.[0-9]_.[0-9]_.[0-9]_.[0-9]_.json # Finder (MacOS) folder config .DS_Store -/tmp \ No newline at end of file +/tmp +/docs \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 00532ad..97cde30 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -286,6 +286,110 @@ The `toMatchTree` helper compares parser output with expected CST structure. **Empty line parsing**: The grammar structure `(statement | newlineOrSemicolon)+ eof?` allows proper empty line and EOF handling. +## Lezer: Surprising Behaviors + +These discoveries came from implementing string interpolation with external tokenizers. See `tmp/string-test4.grammar` for working examples. + +### 1. Rule Capitalization Controls Tree Structure + +**The most surprising discovery**: Rule names determine whether nodes appear in the parse tree. + +**Lowercase rules get inlined** (no tree nodes): +```lezer +statement { assign | expr } // ❌ No "statement" node +assign { x "=" y } // ❌ No "assign" node +expr { x | y } // ❌ No "expr" node +``` + +**Capitalized rules create tree nodes**: +```lezer +Statement { Assign | Expr } // ✅ Creates Statement node +Assign { x "=" y } // ✅ Creates Assign node +Expr { x | y } // ✅ Creates Expr node +``` + +**Why this matters**: When debugging grammar that "doesn't match," check capitalization first. The rules might be matching perfectly—they're just being compiled away! + +Example: `x = 42` was parsing as `Program(Identifier,"=",Number)` instead of `Program(Statement(Assign(...)))`. The grammar rules existed and were matching, but they were inlined because they were lowercase. + +### 2. @skip {} Wrapper is Essential for Preserving Whitespace + +**Initial assumption (wrong)**: Could exclude whitespace from token patterns to avoid needing `@skip {}`. + +**Reality**: The `@skip {}` wrapper is absolutely required to preserve whitespace in strings: + +```lezer +@skip {} { + String { "'" StringContent* "'" } +} + +@tokens { + StringFragment { !['\\$]+ } // Matches everything including spaces +} +``` + +**Without the wrapper**: All spaces get stripped by the global `@skip { space }`, even though `StringFragment` can match them. + +**Test that proved it wrong**: `' spaces '` was being parsed as `"spaces"` (leading/trailing spaces removed) until we added `@skip {}`. + +### 3. External Tokenizers Work Inside @skip {} Blocks + +**Initial assumption (wrong)**: External tokenizers can't be used inside `@skip {}` blocks, so identifier patterns need to be duplicated as simple tokens. + +**Reality**: External tokenizers work perfectly inside `@skip {}` blocks! The tokenizer gets called even when skip is disabled. + +**Working pattern**: +```lezer +@external tokens tokenizer from "./tokenizer" { Identifier, Word } + +@skip {} { + String { "'" StringContent* "'" } +} + +Interpolation { + "$" Identifier | // ← Uses external tokenizer! + "$" "(" expr ")" +} +``` + +**Test that proved it**: `'hello $name'` correctly calls the external tokenizer for `name` inside the string, creating an `Identifier` token. No duplication needed! + +### 4. Single-Character Tokens Can Be Literals + +**Initial approach**: Define every single character as a token: +```lezer +@tokens { + dollar[@name="$"] { "$" } + backslash[@name="\\"] { "\\" } +} +``` + +**Simpler approach**: Just use literals in the grammar rules: +```lezer +Interpolation { + "$" Identifier | // Literal "$" + "$" "(" expr ")" +} + +StringEscape { + "\\" ("$" | "n" | ...) // Literal "\\" +} +``` + +This works fine and reduces boilerplate in the @tokens section. + +### 5. StringFragment as Simple Token, Not External + +For string content, use a simple token pattern instead of handling it in the external tokenizer: + +```lezer +@tokens { + StringFragment { !['\\$]+ } // Simple pattern: not quote, backslash, or dollar +} +``` + +The external tokenizer should focus on Identifier/Word distinction at the top level. String content is simpler and doesn't need the complexity of the external tokenizer. + ### Why expressionWithoutIdentifier Exists The grammar has an unusual pattern: `expressionWithoutIdentifier`. This exists to solve a GLR conflict: diff --git a/src/compiler/compiler.test.ts b/src/compiler/compiler.test.ts index 3bd44b8..17cde94 100644 --- a/src/compiler/compiler.test.ts +++ b/src/compiler/compiler.test.ts @@ -76,9 +76,9 @@ describe('compiler', () => { test('function call with named and positional args', () => { expect(`minus = fn a b: a - b end; minus b=2 9`).toEvaluateTo(7) - expect(`minus = fn c d: a - b end; minus 90 b=20`).toEvaluateTo(70) - expect(`minus = fn e f: a - b end; minus a=900 200`).toEvaluateTo(700) - expect(`minus = fn g h: a - b end; minus 2000 a=9000`).toEvaluateTo(7000) + expect(`minus = fn a b: a - b end; minus 90 b=20`).toEvaluateTo(70) + expect(`minus = fn a b: a - b end; minus a=900 200`).toEvaluateTo(700) + expect(`minus = fn a b: a - b end; minus 2000 a=9000`).toEvaluateTo(7000) }) test('function call with no args', () => { diff --git a/src/compiler/compiler.ts b/src/compiler/compiler.ts index 7cdd77d..b2009ac 100644 --- a/src/compiler/compiler.ts +++ b/src/compiler/compiler.ts @@ -16,8 +16,8 @@ import { getPipeExprParts, } from '#compiler/utils' -const DEBUG = false -// const DEBUG = true +// const DEBUG = false +const DEBUG = true type Label = `.${string}` export class Compiler { diff --git a/src/parser/parser.test.ts b/src/parser/parser.test.ts index 60d97ea..7f4472f 100644 --- a/src/parser/parser.test.ts +++ b/src/parser/parser.test.ts @@ -98,7 +98,8 @@ describe('Parentheses', () => { expect("('hello')").toMatchTree(` ParenExpr - String hello`) + String + StringFragment hello`) expect('(true)').toMatchTree(` ParenExpr @@ -413,7 +414,8 @@ describe('if/elsif/else', () => { Number 1 colon : ThenBlock - String cool + String + StringFragment cool `) expect('a = if x: 2').toMatchTree(` @@ -624,8 +626,10 @@ describe('pipe expressions', () => { describe('multiline', () => { test('parses multiline strings', () => { expect(`'first'\n'second'`).toMatchTree(` - String first - String second`) + String + StringFragment first + String + StringFragment second`) }) test('parses multiline functions', () => { @@ -689,3 +693,26 @@ end `) }) }) + +describe('string interpolation', () => { + test('string with variable interpolation', () => { + expect("'hello $name'").toMatchTree(` + String + StringFragment ${'hello '} + Interpolation + Identifier name + `) + }) + + test('string with expression interpolation', () => { + expect("'sum is $(a + b)'").toMatchTree(` + String + StringFragment ${'sum is '} + Interpolation + BinOp + Identifier a + operator + + Identifier b + `) + }) +}) diff --git a/src/parser/shrimp.grammar b/src/parser/shrimp.grammar index 4c92174..73f2603 100644 --- a/src/parser/shrimp.grammar +++ b/src/parser/shrimp.grammar @@ -6,11 +6,11 @@ @tokens { @precedence { Number "-" } - + + StringFragment { !['\\$]+ } NamedArgPrefix { $[a-z]+ "=" } Number { "-"? $[0-9]+ ('.' $[0-9]+)? } Boolean { "true" | "false" } - String { '\'' ![']* '\'' } newlineOrSemicolon { "\n" | ";" } eof { @eof } space { " " | "\t" } @@ -36,6 +36,7 @@ "*"[@name=operator] "/"[@name=operator] "|"[@name=operator] + } @external tokens tokenizer from "./tokenizer" { Identifier, Word } @@ -160,13 +161,36 @@ BinOp { } ParenExpr { - leftParen (ambiguousFunctionCall | BinOp | expressionWithoutIdentifier | ConditionalOp | PipeExpr) rightParen + leftParen parenContent rightParen +} + +parenContent { + (ambiguousFunctionCall | BinOp | expressionWithoutIdentifier | ConditionalOp | PipeExpr) } expression { expressionWithoutIdentifier | Identifier } +@skip {} { + String { "'" stringContent* "'" } +} + +stringContent { + StringFragment | + Interpolation | + StringEscape +} + +Interpolation { + "$" Identifier | + "$" leftParen parenContent rightParen +} + +StringEscape { + "\\" ("$" | "n" | "t" | "r" | "\\" | "'") +} + // We need expressionWithoutIdentifier to avoid conflicts in consumeToTerminator. // Without this, when parsing "my-var" at statement level, the parser can't decide: // - ambiguousFunctionCall → FunctionCallOrIdentifier → Identifier diff --git a/src/parser/shrimp.grammar.d.ts b/src/parser/shrimp.grammar.d.ts new file mode 100644 index 0000000..248618c --- /dev/null +++ b/src/parser/shrimp.grammar.d.ts @@ -0,0 +1,4 @@ +declare module '*.grammar' { + const content: string + export default content +} diff --git a/src/parser/shrimp.terms.ts b/src/parser/shrimp.terms.ts index b6a093f..965a67d 100644 --- a/src/parser/shrimp.terms.ts +++ b/src/parser/shrimp.terms.ts @@ -11,17 +11,20 @@ export const BinOp = 9, ConditionalOp = 14, String = 23, - Number = 24, - Boolean = 25, - FunctionDef = 26, - Params = 28, - colon = 29, - end = 30, - Underscore = 31, - NamedArg = 32, - NamedArgPrefix = 33, - IfExpr = 35, - ThenBlock = 38, - ElsifExpr = 39, - ElseExpr = 41, - Assign = 43 + StringFragment = 24, + Interpolation = 25, + StringEscape = 26, + Number = 27, + Boolean = 28, + FunctionDef = 29, + Params = 31, + colon = 32, + end = 33, + Underscore = 34, + NamedArg = 35, + NamedArgPrefix = 36, + IfExpr = 38, + ThenBlock = 41, + ElsifExpr = 42, + ElseExpr = 44, + Assign = 46 diff --git a/src/parser/shrimp.ts b/src/parser/shrimp.ts index ec565ea..3c14df8 100644 --- a/src/parser/shrimp.ts +++ b/src/parser/shrimp.ts @@ -4,20 +4,20 @@ import {tokenizer} from "./tokenizer" import {highlighting} from "./highlight" export const parser = LRParser.deserialize({ version: 14, - states: ",rQVQTOOO!rQUO'#CdO#SQPO'#CeO#bQPO'#DdO$[QTO'#CcOOQS'#Dh'#DhO$cQPO'#DgO$zQTO'#DkOOQS'#Cv'#CvOOQO'#De'#DeO%SQPO'#DdO%bQTO'#DoOOQO'#DP'#DPOOQO'#Dd'#DdO%iQPO'#DcOOQS'#Dc'#DcOOQS'#DY'#DYQVQTOOOOQS'#Dg'#DgOOQS'#Cb'#CbO%qQTO'#C|OOQS'#Df'#DfOOQS'#DZ'#DZO&OQUO,58{O&oQTO,59sO%bQTO,59PO%bQTO,59PO&|QUO'#CdO(XQPO'#CeO(iQPO,58}O(zQPO,58}O(uQPO,58}O)uQPO,58}OOQS'#D['#D[O)}QTO'#CxO*VQPO,5:VO*[QTO'#D^O*aQPO,58zO*rQPO,5:ZO*yQPO,5:ZOOQS,59},59}OOQS-E7W-E7WOOQS,59h,59hOOQS-E7X-E7XOOQO1G/_1G/_OOQO1G.k1G.kO+OQPO1G.kO%bQTO,59UO%bQTO,59UOOQS1G.i1G.iOOQS-E7Y-E7YO+jQTO1G/qO+zQUO'#CdOOQO,59x,59xOOQO-E7[-E7[O,kQTO1G/uOOQO1G.p1G.pO,{QPO1G.pO-VQPO7+%]O-[QTO7+%^OOQO'#DR'#DROOQO7+%a7+%aO-lQTO7+%bOOQS<dAN>dO%bQTO'#DTOOQO'#D_'#D_O/PQPOAN>hO/[QPO'#DVOOQOAN>hAN>hO/aQPOAN>hO/fQPO,59oO/mQPO,59oOOQO-E7]-E7]OOQOG24SG24SO/rQPOG24SO/wQPO,59qO/|QPO1G/ZOOQOLD)nLD)nO-[QTO1G/]O-lQTO7+$uOOQO7+$w7+$wOOQO<pAN>pO%lQaO'#DWOOQO'#Dc'#DcO/sQPOAN>tO0OQPO'#DYOOQOAN>tAN>tO0TQPOAN>tO0YQPO,59rO0aQPO,59rOOQO-E7a-E7aOOQOG24`G24`O0fQPOG24`O0kQPO,59tO0pQPO1G/^OOQOLD)zLD)zO.kQaO1G/`O.rQaO7+$xOOQO7+$z7+$zOOQO<S[hSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#g2i#g#h>x#h#o2i#o;'S$_;'S;=`$v<%lO$_V>}[hSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#X2i#X#Y?s#Y#o2i#o;'S$_;'S;=`$v<%lO$_V?zYlRhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#o2i#o;'S$_;'S;=`$v<%lO$_V@qYnRhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#o2i#o;'S$_;'S;=`$v<%lO$_VAf[hSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#Y2i#Y#ZB[#Z#o2i#o;'S$_;'S;=`$v<%lO$_VBcYwPhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#o2i#o;'S$_;'S;=`$v<%lO$_^CYY!hWhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#o2i#o;'S$_;'S;=`$v<%lO$_VC}[hSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#f2i#f#gDs#g#o2i#o;'S$_;'S;=`$v<%lO$_VDzYfRhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#o2i#o;'S$_;'S;=`$v<%lO$_^EqY!jWhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#o2i#o;'S$_;'S;=`$v<%lO$__Fh[!iWhSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#f2i#f#gG^#g#o2i#o;'S$_;'S;=`$v<%lO$_VGc[hSOt$_uw$_x!_$_!_!`2O!`#O$_#P#T$_#T#i2i#i#j>x#j#o2i#o;'S$_;'S;=`$v<%lO$_VH`UuRhSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~HwO!q~", + tokenizers: [0, 1, 2, 3, tokenizer], topRules: {"Program":[0,3]}, - tokenPrec: 693 + tokenPrec: 727 }) diff --git a/src/parser/tokenizer.ts b/src/parser/tokenizer.ts index b180580..9ea9c87 100644 --- a/src/parser/tokenizer.ts +++ b/src/parser/tokenizer.ts @@ -11,14 +11,16 @@ export const tokenizer = new ExternalTokenizer((input: InputStream, stack: Stack while (true) { ch = getFullCodePoint(input, pos) - if (isWhitespace(ch) || ch === -1) break + + // Words and identifiers end at whitespace, single quotes, or end of input. + if (isWhitespace(ch) || ch === 39 /* ' */ || ch === -1) break // Certain characters might end a word or identifier if they are followed by whitespace. // This allows things like `a = hello; 2` or a = (basename ./file.txt) // to work as expected. - if ((canBeWord && (ch === 59 /* ; */ || ch === 41)) /* ) */ || ch === 58 /* : */) { + if (canBeWord && (ch === 59 /* ; */ || ch === 41 /* ) */ || ch === 58) /* : */) { const nextCh = getFullCodePoint(input, pos + 1) - if (isWhitespace(nextCh) || nextCh === -1) { + if (isWhitespace(nextCh) || nextCh === 39 /* ' */ || nextCh === -1) { break } } diff --git a/today.md b/today.md index b174618..e69de29 100644 --- a/today.md +++ b/today.md @@ -1,244 +0,0 @@ -# 🌟 Modern Language Inspiration & Implementation Plan - -## Language Research Summary - -### Pipe Operators Across Languages - -| Language | Syntax | Placeholder | Notes | -|----------|--------|-------------|-------| -| **Gleam** | `\|>` | `_` | Placeholder can go anywhere, enables function capture | -| **Elixir** | `\|>` | `&1`, `&2` | Always first arg by default, numbered placeholders | -| **Nushell** | `\|` | structured data | Pipes structured data, not just text | -| **F#** | `\|>` | none | Always first argument | -| **Raku** | `==>` | `*` | Star placeholder for positioning | - -### Conditional Syntax - -| Language | Single-line | Multi-line | Returns Value | -|----------|------------|------------|---------------| -| **Lua** | `if x then y end` | `if..elseif..else..end` | No (statement) | -| **Luau** | `if x then y else z` | Same | Yes (expression) | -| **Ruby** | `x = y if condition` | `if..elsif..else..end` | Yes | -| **Python** | `y if x else z` | `if..elif..else:` | Yes | -| **Gleam** | N/A | `case` expressions | Yes | - -## 🍤 Shrimp Design Decisions - -### Pipe Operator with Placeholder (`|`) - -**Syntax Choice: `|` with `_` placeholder** - -```shrimp -# Basic pipe with placeholder -"hello world" | upcase _ -"log.txt" | tail _ lines=10 - -# Placeholder positioning flexibility -"error.log" | grep "ERROR" _ | head _ 5 -data | process format="json" input=_ - -# Multiple placeholders (future consideration) -value | combine _ _ -``` - -**Why this design:** -- **`|` over `|>`**: Cleaner, more shell-like -- **`_` placeholder**: Explicit, readable, flexible positioning -- **Gleam-inspired**: Best of functional programming meets shell scripting - -### Conditionals - -**Multi-line syntax:** -```shrimp -if condition: - expression -elsif other-condition: - expression -else: - expression -end -``` - -**Single-line syntax (expression form):** -```shrimp -result = if x = 5: "five" -# Returns nil when false - -result = if x > 0: "positive" else: "non-positive" -# Explicit else for non-nil guarantee -``` - -**Design choices:** -- **`elsif` not `else if`**: Avoids nested parsing complexity (Ruby-style) -- **`:` after conditions**: Consistent with function definitions -- **`=` for equality**: Context-sensitive (assignment vs comparison) -- **`nil` for no-value**: Short, clear, well-understood -- **Expressions return values**: Everything is an expression philosophy - -## 📝 Implementation Plan - -### Phase 1: Grammar Foundation - -**1.1 Add Tokens** -```grammar -@tokens { - // Existing... - "|" // Pipe operator - "_" // Placeholder - "if" // Conditionals - "elsif" - "else" - "nil" // Null value -} -``` - -**1.2 Precedence Updates** -```grammar -@precedence { - multiplicative @left, - additive @left, - pipe @left, // After arithmetic, before assignment - assignment @right, - call -} -``` - -### Phase 2: Grammar Rules - -**2.1 Pipe Expression** -```grammar -PipeExpr { - expression !pipe "|" PipeTarget -} - -PipeTarget { - FunctionCallWithPlaceholder | - FunctionCall // Error in compiler if no placeholder -} - -FunctionCallWithPlaceholder { - Identifier PlaceholderArg+ -} - -PlaceholderArg { - PositionalArg | NamedArg | Placeholder -} - -Placeholder { - "_" -} -``` - -**2.2 Conditional Expression** -```grammar -Conditional { - SingleLineIf | MultiLineIf -} - -SingleLineIf { - "if" Comparison ":" expression ElseClause? -} - -MultiLineIf { - "if" Comparison ":" newlineOrSemicolon - (line newlineOrSemicolon)* - ElsifClause* - ElseClause? - "end" -} - -ElsifClause { - "elsif" Comparison ":" newlineOrSemicolon - (line newlineOrSemicolon)* -} - -ElseClause { - "else" ":" (expression | (newlineOrSemicolon (line newlineOrSemicolon)*)) -} - -Comparison { - expression "=" expression // Context-sensitive in if/elsif -} -``` - -**2.3 Update line rule** -```grammar -line { - PipeExpr | - Conditional | - FunctionCall | - // ... existing rules -} -``` - -### Phase 3: Test Cases - -**Pipe Tests:** -```shrimp -# Basic placeholder -"hello" | upcase _ - -# Named arguments with placeholder -"file.txt" | process _ format="json" - -# Chained pipes -data | filter _ "error" | count _ - -# Placeholder in different positions -5 | subtract 10 _ # 10 - 5 = 5 -``` - -**Conditional Tests:** -```shrimp -# Single line -x = if n = 0: "zero" - -# Single line with else -sign = if n > 0: "positive" else: "negative" - -# Multi-line -if score > 90: - grade = "A" -elsif score > 80: - grade = "B" -else: - grade = "C" -end - -# Nested conditionals -if x > 0: - if y > 0: - quadrant = 1 - end -end -``` - -### Phase 4: Compiler Implementation - -**4.1 PipeExpr Handling** -- Find placeholder position in right side -- Insert left side value at placeholder -- Error if no placeholder found - -**4.2 Conditional Compilation** -- Generate JUMP bytecode for branching -- Handle nil returns for missing else -- Context-aware `=` parsing - -## 🎯 Key Decision Points - -1. **Placeholder syntax**: `_` vs `$` vs `?` → **Choose `_` (Gleam-like)** -2. **Pipe operator**: `|` vs `|>` vs `>>` → **Choose `|` (cleaner)** -3. **Nil naming**: `nil` vs `null` vs `none` → **Choose `nil` (Ruby-like)** -4. **Equality**: Keep `=` context-sensitive or add `==`? → **Keep `=` (simpler)** -5. **Single-line if**: Require else or default nil? → **Default nil (flexible)** - -## 🚀 Next Steps - -1. Update grammar file with new tokens and rules -2. Write comprehensive test cases -3. Implement compiler support for pipes -4. Implement conditional bytecode generation -5. Test edge cases and error handling - -This plan combines the best ideas from modern languages while maintaining Shrimp's shell-like simplicity and functional philosophy! \ No newline at end of file