Compare commits

..

No commits in common. "0f7d3126a2c81d5dddeae3a41ee91a921b8467e4" and "a652f83b638be95784ee0e02289f88aa9adde58b" have entirely different histories.

16 changed files with 237 additions and 921 deletions

View File

@ -195,18 +195,6 @@ function parseExpression(input: string) {
**Expression-oriented design**: Everything returns a value - commands, assignments, functions. This enables composition and functional patterns.
**Scope-aware property access (DotGet)**: The parser uses Lezer's `@context` feature to track variable scope at parse time. When it encounters `obj.prop`, it checks if `obj` is in scope:
- **In scope** → Parses as `DotGet(Identifier, Identifier)` → compiles to `TRY_LOAD obj; PUSH 'prop'; DOT_GET`
- **Not in scope** → Parses as `Word("obj.prop")` → compiles to `PUSH 'obj.prop'` (treated as file path/string)
Implementation files:
- **src/parser/scopeTracker.ts**: ContextTracker that maintains immutable scope chain
- **src/parser/tokenizer.ts**: External tokenizer checks `stack.context` to decide if dot creates DotGet or Word
- Scope tracking: Captures variables from assignments (`x = 5`) and function parameters (`fn x:`)
- See `src/parser/tests/dot-get.test.ts` for comprehensive examples
**Why this matters**: This enables shell-like file paths (`readme.txt`) while supporting dictionary/array access (`config.path`) without quotes, determined entirely at parse time based on lexical scope.
**EOF handling**: The grammar uses `(statement | newlineOrSemicolon)+ eof?` to handle empty lines and end-of-file without infinite loops.
## Compiler Architecture

View File

@ -1,557 +0,0 @@
# Shrimp Parser Architecture
This document explains the special cases, tricks, and design decisions in the Shrimp parser and tokenizer.
## Table of Contents
1. [Token Types and Their Purpose](#token-types-and-their-purpose)
2. [External Tokenizer Tricks](#external-tokenizer-tricks)
3. [Grammar Special Cases](#grammar-special-cases)
4. [Scope Tracking Architecture](#scope-tracking-architecture)
5. [Common Pitfalls](#common-pitfalls)
---
## Token Types and Their Purpose
### Four Token Types from External Tokenizer
The external tokenizer (`src/parser/tokenizer.ts`) emits four different token types based on context:
| Token | Purpose | Example |
|-------|---------|---------|
| `Identifier` | Regular identifiers in expressions, function calls | `echo`, `x` in `x + 1` |
| `AssignableIdentifier` | Identifiers on LHS of `=` or in function params | `x` in `x = 5`, params in `fn x y:` |
| `Word` | Anything else: paths, URLs, @mentions, #hashtags | `./file.txt`, `@user`, `#tag` |
| `IdentifierBeforeDot` | Identifier that's in scope, followed by `.` | `obj` in `obj.prop` |
### Why We Need Both Identifier Types
**The Problem:** At the start of a statement like `x ...`, the parser doesn't know if it's:
- An assignment: `x = 5` (needs `AssignableIdentifier`)
- A function call: `x hello world` (needs `Identifier`)
**The Solution:** The external tokenizer uses a three-way decision:
1. **Only `AssignableIdentifier` can shift** (e.g., in `Params` rule) → emit `AssignableIdentifier`
2. **Only `Identifier` can shift** (e.g., in function arguments) → emit `Identifier`
3. **Both can shift** (ambiguous statement start) → peek ahead for `=` to disambiguate
See [`Identifier vs AssignableIdentifier Disambiguation`](#identifier-vs-assignableidentifier-disambiguation) below for implementation details.
---
## External Tokenizer Tricks
### 1. Identifier vs AssignableIdentifier Disambiguation
**Location:** `src/parser/tokenizer.ts` lines 88-118
**The Challenge:** When both `Identifier` and `AssignableIdentifier` are valid (at statement start), how do we choose?
**The Solution:** Three-way branching with lookahead:
```typescript
const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)
if (canAssignable && !canRegular) {
// Only AssignableIdentifier valid (e.g., in Params)
input.acceptToken(AssignableIdentifier)
} else if (canRegular && !canAssignable) {
// Only Identifier valid (e.g., in function args)
input.acceptToken(Identifier)
} else {
// BOTH possible - peek ahead for '='
// Skip whitespace, check if next char is '='
const nextCh = getFullCodePoint(input, peekPos)
if (nextCh === 61 /* = */) {
input.acceptToken(AssignableIdentifier) // It's an assignment
} else {
input.acceptToken(Identifier) // It's a function call
}
}
```
**Key Insight:** `stack.canShift()` returns true for BOTH token types when the grammar has multiple valid paths. We can't just use `canShift()` alone - we need lookahead.
**Why This Works:**
- `fn x y: ...` → In `Params` rule, only `AssignableIdentifier` can shift → no lookahead needed
- `echo hello` → Both can shift, but no `=` ahead → emits `Identifier` → parses as `FunctionCall`
- `x = 5` → Both can shift, finds `=` ahead → emits `AssignableIdentifier` → parses as `Assign`
### 2. Surrogate Pair Handling for Emoji
**Location:** `src/parser/tokenizer.ts` lines 71-84, `getFullCodePoint()` function
**The Problem:** JavaScript strings use UTF-16, but emoji like 🍤 use code points outside the BMP (Basic Multilingual Plane), requiring surrogate pairs.
**The Solution:** When reading characters, check for high surrogates (0xD800-0xDBFF) and combine them with low surrogates (0xDC00-0xDFFF):
```typescript
const getFullCodePoint = (input: InputStream, pos: number): number => {
const ch = input.peek(pos)
// Check if this is a high surrogate (0xD800-0xDBFF)
if (ch >= 0xd800 && ch <= 0xdbff) {
const low = input.peek(pos + 1)
// Check if next is low surrogate (0xDC00-0xDFFF)
if (low >= 0xdc00 && low <= 0xdfff) {
// Combine surrogate pair into full code point
return 0x10000 + ((ch & 0x3ff) << 10) + (low & 0x3ff)
}
}
return ch
}
```
**Why This Matters:** Without this, `shrimp-🍤` would be treated as `shrimp-<high><low>` (4 characters) instead of `shrimp-🍤` (2 characters).
### 3. Context-Aware Termination for Semicolon and Colon
**Location:** `src/parser/tokenizer.ts` lines 51-57
**The Problem:** How do we parse `basename ./cool;` vs `basename ./cool; 2`?
**The Solution:** Only treat `;` and `:` as terminators if they're followed by whitespace (or EOF):
```typescript
if (canBeWord && (ch === 59 /* ; */ || ch === 58) /* : */) {
const nextCh = getFullCodePoint(input, pos + 1)
if (!isWordChar(nextCh)) break // It's a terminator
// Otherwise, continue consuming as part of the Word
}
```
**Examples:**
- `basename ./cool;``;` is followed by EOF → terminates the word at `./cool`
- `basename ./cool;2``;` is followed by `2` → included in word as `./cool;2`
- `basename ./cool; 2``;` is followed by space → terminates at `./cool`, `2` is next arg
### 4. Scope-Aware Property Access (DotGet)
**Location:** `src/parser/tokenizer.ts` lines 19-48
**The Problem:** How do we distinguish `obj.prop` (property access) from `readme.txt` (filename)?
**The Solution:** When we see a `.` after an identifier, check if that identifier is in scope:
```typescript
if (ch === 46 /* . */ && isValidIdentifier) {
// Build identifier text
let identifierText = '...' // (surrogate-pair aware)
const scopeContext = stack.context as ScopeContext | undefined
const scope = scopeContext?.scope
if (scope?.has(identifierText)) {
// In scope - stop here, emit IdentifierBeforeDot
// Grammar will parse as DotGet
input.acceptToken(IdentifierBeforeDot)
return
}
// Not in scope - continue consuming as Word
// Will parse as Word("readme.txt")
}
```
**Examples:**
- `config = {path: "..."}; config.path``config` is in scope → parses as `DotGet(IdentifierBeforeDot, Identifier)`
- `cat readme.txt``readme` is not in scope → parses as `Word("readme.txt")`
---
## Grammar Special Cases
### 1. expressionWithoutIdentifier Pattern
**Location:** `src/parser/shrimp.grammar` lines 200-210
**The Problem:** GLR conflict in `consumeToTerminator` rule:
```lezer
consumeToTerminator {
ambiguousFunctionCall | // → FunctionCallOrIdentifier → Identifier
expression // → Identifier
}
```
When parsing `my-var` at statement level, both paths want the same `Identifier` token, causing a conflict.
**The Solution:** Remove `Identifier` from the `expression` path by creating `expressionWithoutIdentifier`:
```lezer
expression {
expressionWithoutIdentifier | DotGet | Identifier
}
expressionWithoutIdentifier {
ParenExpr | Word | String | Number | Boolean | Regex | Null
}
```
Then use `expressionWithoutIdentifier` in places where we don't want bare identifiers:
```lezer
consumeToTerminator {
PipeExpr |
ambiguousFunctionCall | // ← Handles standalone identifiers
DotGet |
IfExpr |
FunctionDef |
Assign |
BinOp |
expressionWithoutIdentifier // ← No bare Identifier here
}
```
**Why This Works:** Now standalone identifiers MUST go through `ambiguousFunctionCall`, which is semantically what we want (they're either function calls or variable references).
### 2. @skip {} Wrapper for DotGet
**Location:** `src/parser/shrimp.grammar` lines 176-183
**The Problem:** DotGet needs to be whitespace-sensitive (no spaces allowed around `.`), but the global `@skip { space }` would remove them.
**The Solution:** Use `@skip {}` (empty skip) wrapper to disable automatic whitespace skipping:
```lezer
@skip {} {
DotGet {
IdentifierBeforeDot "." Identifier
}
String { "'" stringContent* "'" }
}
```
**Why This Matters:**
- `obj.prop` → Parses as `DotGet`
- `obj. prop` → Would parse as `obj` followed by `. prop` (error) if whitespace was skipped
- `obj .prop` → Would parse as `obj` followed by `.prop` (error) if whitespace was skipped
### 3. EOF Handling in item Rule
**Location:** `src/parser/shrimp.grammar` lines 54-58
**The Problem:** How do we handle empty lines and end-of-file without infinite loops?
**The Solution:** Use alternatives instead of repetition for EOF:
```lezer
item {
consumeToTerminator newlineOrSemicolon | // Statement with newline/semicolon
consumeToTerminator eof | // Statement at end of file
newlineOrSemicolon // Allow blank lines
}
```
**Why Not Just `item { (statement | newlineOrSemicolon)+ eof? }`?**
That would match EOF multiple times (once after each statement), causing parser errors. By making EOF part of an alternative, it's only matched once per item.
### 4. Params Uses AssignableIdentifier
**Location:** `src/parser/shrimp.grammar` lines 153-155
```lezer
Params {
AssignableIdentifier*
}
```
**Why This Matters:** Function parameters are in "assignable" positions - they're being bound to values when the function is called. Using `AssignableIdentifier` here:
1. Makes the grammar explicit about which identifiers create bindings
2. Enables the tokenizer to use `canShift(AssignableIdentifier)` to detect param context
3. Allows the scope tracker to only capture `AssignableIdentifier` tokens
### 5. String Interpolation Inside @skip {}
**Location:** `src/parser/shrimp.grammar` lines 181-198
**The Problem:** String contents need to preserve whitespace, but string interpolation `$identifier` needs to use the external tokenizer.
**The Solution:** Put `String` inside `@skip {}` and use the external tokenizer for `Identifier` within interpolation:
```lezer
@skip {} {
String { "'" stringContent* "'" }
}
stringContent {
StringFragment | // Matches literal text (preserves spaces)
Interpolation | // $identifier or $(expr)
EscapeSeq // \$, \n, etc.
}
Interpolation {
"$" Identifier | // Uses external tokenizer!
"$" ParenExpr
}
```
**Key Insight:** External tokenizers work inside `@skip {}` blocks! The tokenizer gets called even when skip is disabled.
---
## Scope Tracking Architecture
### Overview
Scope tracking uses Lezer's `@context` feature to maintain a scope chain during parsing. This enables:
- Distinguishing `obj.prop` (property access) from `readme.txt` (filename)
- Tracking which variables are in scope for each position in the parse tree
### Architecture: Scope vs ScopeContext
**Two-Class Design:**
```typescript
// Pure, hashable scope - only variable tracking
class Scope {
constructor(
public parent: Scope | null,
public vars: Set<string>
) {}
has(name: string): boolean
add(...names: string[]): Scope
push(): Scope // Create child scope
pop(): Scope // Return to parent
hash(): number // For incremental parsing
}
// Wrapper with temporary state
export class ScopeContext {
constructor(
public scope: Scope,
public pendingIds: string[] = []
) {}
}
```
**Why This Separation?**
1. **Scope is pure and hashable** - Only contains committed variable bindings, no temporary state
2. **ScopeContext holds temporary state** - The `pendingIds` array captures identifiers during parsing but isn't part of the hash
3. **Hash function only hashes Scope** - Incremental parsing only cares about actual scope, not pending identifiers
### How Scope Tracking Works
**1. Capture Phase (shift):**
When the parser shifts an `AssignableIdentifier` token, the scope tracker captures its text:
```typescript
shift(context, term, stack, input) {
if (term === terms.AssignableIdentifier) {
// Build text by peeking at input
let text = '...' // (read from input.pos to stack.pos)
return new ScopeContext(
context.scope,
[...context.pendingIds, text] // Append to pending
)
}
return context
}
```
**2. Commit Phase (reduce):**
When the parser reduces to `Assign` or `Params`, the scope tracker commits pending identifiers:
```typescript
reduce(context, term, stack, input) {
// Assignment: pop last identifier, add to scope
if (term === terms.Assign && context.pendingIds.length > 0) {
const varName = context.pendingIds[context.pendingIds.length - 1]!
return new ScopeContext(
context.scope.add(varName), // Add to scope
context.pendingIds.slice(0, -1) // Remove from pending
)
}
// Function params: add all identifiers, push new scope
if (term === terms.Params) {
const newScope = context.scope.push()
return new ScopeContext(
context.pendingIds.length > 0
? newScope.add(...context.pendingIds)
: newScope,
[] // Clear pending
)
}
// Function exit: pop scope
if (term === terms.FunctionDef) {
return new ScopeContext(context.scope.pop(), [])
}
return context
}
```
**3. Usage in Tokenizer:**
The tokenizer accesses scope to check if identifiers are bound:
```typescript
const scopeContext = stack.context as ScopeContext | undefined
const scope = scopeContext?.scope
if (scope?.has(identifierText)) {
// Identifier is in scope - can use in DotGet
input.acceptToken(IdentifierBeforeDot)
}
```
### Why Only Track AssignableIdentifier?
**Before (complex):**
- Tracked ALL identifiers with `term === terms.Identifier`
- Used `isInParams` flag to know which ones to keep
- Had to manually clear "stale" identifiers after DotGet, FunctionCall, etc.
**After (simple):**
- Only track `AssignableIdentifier` tokens
- These only appear in `Params` and `Assign` (by grammar design)
- No stale identifiers - they're consumed immediately
**Example:**
```shrimp
fn x y: echo x end
```
Scope tracking:
1. Shift `AssignableIdentifier("x")` → pending = ["x"]
2. Shift `AssignableIdentifier("y")` → pending = ["x", "y"]
3. Reduce `Params` → scope = {x, y}, pending = []
4. Shift `Identifier("echo")`**not captured** (not AssignableIdentifier)
5. Shift `Identifier("x")` → **not captured**
6. Reduce `FunctionDef` → pop scope
No stale identifier clearing needed!
---
## Common Pitfalls
### 1. Forgetting Surrogate Pairs
**Problem:** Using `input.peek(i)` directly gives UTF-16 code units, not Unicode code points.
**Solution:** Always use `getFullCodePoint(input, pos)` when working with emoji.
**Example:**
```typescript
// ❌ Wrong - breaks on emoji
const ch = input.peek(pos)
if (isEmoji(ch)) { ... }
// ✓ Right - handles surrogate pairs
const ch = getFullCodePoint(input, pos)
if (isEmoji(ch)) { ... }
pos += getCharSize(ch) // Advance by 1 or 2 code units
```
### 2. Adding Pending State to Hash
**Problem:** Including `pendingIds` or `isInParams` in the hash function breaks incremental parsing.
**Why?** The hash is used to determine if a cached parse tree node can be reused. If the hash includes temporary state that doesn't affect parsing decisions, nodes will be invalidated unnecessarily.
**Solution:** Only hash the `Scope` (vars + parent chain), not the `ScopeContext` wrapper.
```typescript
// ✓ Right
const hashScope = (context: ScopeContext): number => {
return context.scope.hash() // Only hash committed scope
}
// ❌ Wrong
const hashScope = (context: ScopeContext): number => {
let h = context.scope.hash()
h = (h << 5) - h + context.pendingIds.length // Don't do this!
return h
}
```
### 3. Using canShift() Alone for Disambiguation
**Problem:** `stack.canShift(AssignableIdentifier)` returns true when BOTH paths are possible (e.g., at statement start).
**Why?** The GLR parser maintains multiple parse states. If any state can shift the token, `canShift()` returns true.
**Solution:** Check BOTH token types and use lookahead when both are possible:
```typescript
const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)
if (canAssignable && canRegular) {
// Both possible - need lookahead
const hasEquals = peekForEquals(input, pos)
input.acceptToken(hasEquals ? AssignableIdentifier : Identifier)
}
```
### 4. Clearing Pending Identifiers Too Eagerly
**Problem:** In the old code, we had to clear pending identifiers after DotGet, FunctionCall, etc. to prevent state leakage. This was fragile and easy to forget.
**Why This Happened:** We were tracking ALL identifiers, not just assignable ones.
**Solution:** Only track `AssignableIdentifier` tokens. They only appear in contexts where they'll be consumed (Params, Assign), so no clearing needed.
### 5. Line Number Confusion in Edit Tool
**Problem:** The Edit tool shows line numbers with a prefix (like ` 5→`), but these aren't the real line numbers.
**How to Read:**
- The number before `→` is the actual line number
- Use that number when referencing code in comments or documentation
- Example: ` 5→export const foo` means the code is on line 5
---
## Testing Strategy
### Parser Tests
Use the `toMatchTree` helper to verify parse tree structure:
```typescript
test('assignment with AssignableIdentifier', () => {
expect('x = 5').toMatchTree(`
Assign
AssignableIdentifier x
operator =
Number 5
`)
})
```
**Key Testing Patterns:**
- Test both token type expectations (Identifier vs AssignableIdentifier)
- Test scope-aware features (DotGet for in-scope vs Word for out-of-scope)
- Test edge cases (empty lines, EOF, surrogate pairs)
### Debugging Parser Issues
1. **Check token types:** Run parser on input and examine tree structure
2. **Test canShift():** Add logging to tokenizer to see what `canShift()` returns
3. **Verify scope state:** Log scope contents during parsing
4. **Use GLR visualization:** Lezer has tools for visualizing parse states
---
## Further Reading
- [Lezer System Guide](https://lezer.codemirror.net/docs/guide/)
- [Lezer API Reference](https://lezer.codemirror.net/docs/ref/)
- [CLAUDE.md](../CLAUDE.md) - General project guidance
- [Scope Tracker Source](../src/parser/scopeTracker.ts)
- [Tokenizer Source](../src/parser/tokenizer.ts)

View File

@ -9,7 +9,6 @@ import {
getAllChildren,
getAssignmentParts,
getBinaryParts,
getDotGetParts,
getFunctionCallParts,
getFunctionDefParts,
getIfExprParts,
@ -18,8 +17,8 @@ import {
getStringParts,
} from '#compiler/utils'
const DEBUG = false
// const DEBUG = true
// const DEBUG = false
const DEBUG = true
type Label = `.${string}`
@ -190,19 +189,6 @@ export class Compiler {
return [[`TRY_LOAD`, value]]
}
case terms.Word: {
return [['PUSH', value]]
}
case terms.DotGet: {
const { objectName, propertyName } = getDotGetParts(node, input)
const instructions: ProgramItem[] = []
instructions.push(['TRY_LOAD', objectName])
instructions.push(['PUSH', propertyName])
instructions.push(['DOT_GET'])
return instructions
}
case terms.BinOp: {
const { left, op, right } = getBinaryParts(node)
const instructions: ProgramItem[] = []

View File

@ -213,7 +213,7 @@ describe('Regex', () => {
})
})
describe.skip('native functions', () => {
describe.only('native functions', () => {
test('print function', () => {
const add = (x: number, y: number) => x + y
expect(`add 5 9`).toEvaluateTo(14, { add })

View File

@ -40,9 +40,9 @@ export const getAssignmentParts = (node: SyntaxNode) => {
const children = getAllChildren(node)
const [left, equals, right] = children
if (!left || left.type.id !== terms.AssignableIdentifier) {
if (!left || left.type.id !== terms.Identifier) {
throw new CompilerError(
`Assign left child must be an AssignableIdentifier, got ${left ? left.type.name : 'none'}`,
`Assign left child must be an Identifier, got ${left ? left.type.name : 'none'}`,
node.from,
node.to
)
@ -70,9 +70,9 @@ export const getFunctionDefParts = (node: SyntaxNode, input: string) => {
}
const paramNames = getAllChildren(paramsNode).map((param) => {
if (param.type.id !== terms.AssignableIdentifier) {
if (param.type.id !== terms.Identifier) {
throw new CompilerError(
`FunctionDef params must be AssignableIdentifiers, got ${param.type.name}`,
`FunctionDef params must be Identifiers, got ${param.type.name}`,
param.from,
param.to
)
@ -198,37 +198,3 @@ export const getStringParts = (node: SyntaxNode, input: string) => {
return { parts, hasInterpolation: parts.length > 0 }
}
export const getDotGetParts = (node: SyntaxNode, input: string) => {
const children = getAllChildren(node)
const [object, property] = children
if (children.length !== 2) {
throw new CompilerError(
`DotGet expected 2 identifier children, got ${children.length}`,
node.from,
node.to
)
}
if (object.type.id !== terms.IdentifierBeforeDot) {
throw new CompilerError(
`DotGet object must be an IdentifierBeforeDot, got ${object.type.name}`,
object.from,
object.to
)
}
if (property.type.id !== terms.Identifier) {
throw new CompilerError(
`DotGet property must be an Identifier, got ${property.type.name}`,
property.from,
property.to
)
}
const objectName = input.slice(object.from, object.to)
const propertyName = input.slice(property.from, property.to)
return { objectName, propertyName }
}

View File

@ -1,11 +1,42 @@
import { ContextTracker, InputStream } from '@lezer/lr'
import { ContextTracker } from '@lezer/lr'
import * as terms from './shrimp.terms'
export class Scope {
constructor(public parent: Scope | null, public vars = new Set<string>()) {}
constructor(
public parent: Scope | null,
public vars: Set<string>,
public pendingIdentifiers: string[] = [],
public isInParams: boolean = false
) {}
has(name: string): boolean {
return this.vars.has(name) || (this.parent?.has(name) ?? false)
return this.vars.has(name) ?? this.parent?.has(name)
}
add(...names: string[]): Scope {
const newVars = new Set(this.vars)
names.forEach((name) => newVars.add(name))
return new Scope(this.parent, newVars, [], this.isInParams)
}
push(): Scope {
return new Scope(this, new Set(), [], false)
}
pop(): Scope {
return this.parent ?? new Scope(null, new Set(), [], false)
}
withPendingIdentifiers(ids: string[]): Scope {
return new Scope(this.parent, this.vars, ids, this.isInParams)
}
withIsInParams(value: boolean): Scope {
return new Scope(this.parent, this.vars, this.pendingIdentifiers, value)
}
clearPending(): Scope {
return new Scope(this.parent, this.vars, [], this.isInParams)
}
hash(): number {
@ -20,77 +51,76 @@ export class Scope {
h = (h << 5) - h + this.parent.hash()
h |= 0
}
// Include pendingIdentifiers and isInParams in hash
h = (h << 5) - h + this.pendingIdentifiers.length
h = (h << 5) - h + (this.isInParams ? 1 : 0)
h |= 0
return h
}
// Static methods that return new Scopes (immutable operations)
static add(scope: Scope, ...names: string[]): Scope {
const newVars = new Set(scope.vars)
names.forEach((name) => newVars.add(name))
return new Scope(scope.parent, newVars)
}
push(): Scope {
return new Scope(this, new Set())
export const trackScope = new ContextTracker<Scope>({
start: new Scope(null, new Set(), [], false),
shift(context, term, stack, input) {
// Track fn keyword to enter param capture mode
if (term === terms.Fn) {
return context.withIsInParams(true).withPendingIdentifiers([])
}
pop(): Scope {
return this.parent ?? this
}
}
// Tracker context that combines Scope with temporary pending identifiers
class TrackerContext {
constructor(public scope: Scope, public pendingIds: string[] = []) {}
}
// Extract identifier text from input stream
const readIdentifierText = (input: InputStream, start: number, end: number): string => {
// Capture identifiers
if (term === terms.Identifier) {
// Build text by peeking backwards from stack.pos to input.pos
let text = ''
const start = input.pos
const end = stack.pos
for (let i = start; i < end; i++) {
const offset = i - input.pos
const ch = input.peek(offset)
if (ch === -1) break
text += String.fromCharCode(ch)
}
return text
// Capture ALL identifiers when in params
if (context.isInParams) {
return context.withPendingIdentifiers([...context.pendingIdentifiers, text])
}
export const trackScope = new ContextTracker<TrackerContext>({
start: new TrackerContext(new Scope(null, new Set())),
shift(context, term, stack, input) {
if (term !== terms.AssignableIdentifier) return context
const text = readIdentifierText(input, input.pos, stack.pos)
return new TrackerContext(context.scope, [...context.pendingIds, text])
},
reduce(context, term) {
// Add assignment variable to scope
if (term === terms.Assign) {
const varName = context.pendingIds.at(-1)
if (!varName) return context
return new TrackerContext(Scope.add(context.scope, varName), context.pendingIds.slice(0, -1))
// Capture FIRST identifier for assignments
else if (context.pendingIdentifiers.length === 0) {
return context.withPendingIdentifiers([text])
}
// Push new scope and add all parameters
if (term === terms.Params) {
let newScope = context.scope.push()
if (context.pendingIds.length > 0) {
newScope = Scope.add(newScope, ...context.pendingIds)
}
return new TrackerContext(newScope, [])
}
// Pop scope when exiting function
if (term === terms.FunctionDef) {
return new TrackerContext(context.scope.pop(), [])
}
return context
},
hash: (context) => context.scope.hash(),
reduce(context, term, stack, input) {
// Add assignment variable to scope
if (term === terms.Assign && context.pendingIdentifiers.length > 0) {
return context.add(context.pendingIdentifiers[0]!)
}
// Push new scope and add parameters
if (term === terms.Params) {
const newScope = context.push()
if (context.pendingIdentifiers.length > 0) {
return newScope.add(...context.pendingIdentifiers).withIsInParams(false)
}
return newScope.withIsInParams(false)
}
// Pop scope when exiting function
if (term === terms.FunctionDef) {
return context.pop()
}
// Clear stale identifiers after non-assignment statements
if (term === terms.DotGet || term === terms.FunctionCallOrIdentifier || term === terms.FunctionCall) {
return context.clearPending()
}
return context
},
hash: (context) => context.hash(),
})

View File

@ -43,7 +43,7 @@
}
@external tokens tokenizer from "./tokenizer" { Identifier, AssignableIdentifier, Word, IdentifierBeforeDot }
@external tokens tokenizer from "./tokenizer" { Identifier, Word, IdentifierBeforeDot }
@precedence {
pipe @left,
@ -151,11 +151,11 @@ ConditionalOp {
}
Params {
AssignableIdentifier*
Identifier*
}
Assign {
AssignableIdentifier "=" consumeToTerminator
Identifier "=" consumeToTerminator
}
BinOp {

View File

@ -1,36 +1,35 @@
// This file was generated by lezer-generator. You probably shouldn't edit it.
export const
Identifier = 1,
AssignableIdentifier = 2,
Word = 3,
IdentifierBeforeDot = 4,
Program = 5,
PipeExpr = 6,
FunctionCall = 7,
PositionalArg = 8,
ParenExpr = 9,
FunctionCallOrIdentifier = 10,
BinOp = 11,
ConditionalOp = 16,
String = 25,
StringFragment = 26,
Interpolation = 27,
EscapeSeq = 28,
Number = 29,
Boolean = 30,
Regex = 31,
Null = 32,
DotGet = 33,
FunctionDef = 34,
Fn = 35,
Params = 36,
colon = 37,
end = 38,
Underscore = 39,
NamedArg = 40,
NamedArgPrefix = 41,
IfExpr = 43,
ThenBlock = 46,
ElsifExpr = 47,
ElseExpr = 49,
Assign = 51
Word = 2,
IdentifierBeforeDot = 3,
Program = 4,
PipeExpr = 5,
FunctionCall = 6,
PositionalArg = 7,
ParenExpr = 8,
FunctionCallOrIdentifier = 9,
BinOp = 10,
ConditionalOp = 15,
String = 24,
StringFragment = 25,
Interpolation = 26,
EscapeSeq = 27,
Number = 28,
Boolean = 29,
Regex = 30,
Null = 31,
DotGet = 32,
FunctionDef = 33,
Fn = 34,
Params = 35,
colon = 36,
end = 37,
Underscore = 38,
NamedArg = 39,
NamedArgPrefix = 40,
IfExpr = 42,
ThenBlock = 45,
ElsifExpr = 46,
ElseExpr = 48,
Assign = 50

View File

@ -5,21 +5,21 @@ import {trackScope} from "./scopeTracker"
import {highlighting} from "./highlight"
export const parser = LRParser.deserialize({
version: 14,
states: ".jQVQaOOO#XQbO'#CfO$RQPO'#CgO$aQPO'#DmO$xQaO'#CeO%gOSO'#CuOOQ`'#Dq'#DqO%uOPO'#C}O%zQPO'#DpO&cQaO'#D|OOQ`'#DO'#DOOOQO'#Dn'#DnO&kQPO'#DmO&yQaO'#EQOOQO'#DX'#DXO'hQPO'#DaOOQO'#Dm'#DmO'mQPO'#DlOOQ`'#Dl'#DlOOQ`'#Db'#DbQVQaOOOOQ`'#Dp'#DpOOQ`'#Cd'#CdO'uQaO'#DUOOQ`'#Do'#DoOOQ`'#Dc'#DcO(PQbO,58}O&yQaO,59RO&yQaO,59RO)XQPO'#CgO)iQPO,59PO)zQPO,59PO)uQPO,59PO*uQPO,59PO*}QaO'#CwO+VQWO'#CxOOOO'#Du'#DuOOOO'#Dd'#DdO+kOSO,59aOOQ`,59a,59aO+yO`O,59iOOQ`'#De'#DeO,OQaO'#DQO,WQPO,5:hO,]QaO'#DgO,bQPO,58|O,sQPO,5:lO,zQPO,5:lO-PQaO,59{OOQ`,5:W,5:WOOQ`-E7`-E7`OOQ`,59p,59pOOQ`-E7a-E7aOOQO1G.m1G.mO-^QPO1G.mO&yQaO,59WO&yQaO,59WOOQ`1G.k1G.kOOOO,59c,59cOOOO,59d,59dOOOO-E7b-E7bOOQ`1G.{1G.{OOQ`1G/T1G/TOOQ`-E7c-E7cO-xQaO1G0SO!QQbO'#CfOOQO,5:R,5:ROOQO-E7e-E7eO.YQaO1G0WOOQO1G/g1G/gOOQO1G.r1G.rO.jQPO1G.rO.tQPO7+%nO.yQaO7+%oOOQO'#DZ'#DZOOQO7+%r7+%rO/ZQaO7+%sOOQ`<<IY<<IYO/qQPO'#DfO/vQaO'#EPO0^QPO<<IZOOQO'#D['#D[O0cQPO<<I_OOQ`,5:Q,5:QOOQ`-E7d-E7dOOQ`AN>uAN>uO&yQaO'#D]OOQO'#Dh'#DhO0nQPOAN>yO0yQPO'#D_OOQOAN>yAN>yO1OQPOAN>yO1TQPO,59wO1[QPO,59wOOQO-E7f-E7fOOQOG24eG24eO1aQPOG24eO1fQPO,59yO1kQPO1G/cOOQOLD*PLD*PO.yQaO1G/eO/ZQaO7+$}OOQO7+%P7+%POOQO<<Hi<<Hi",
stateData: "1v~O!_OS~OPPOQ_ORUOSVOmUOnUOoUOpUOsXO|]O!fSO!hTO!rbO~OPeORUOSVOmUOnUOoUOpUOsXOwfOygO!fSO!hTOzYX!rYX!vYX!gYXvYX~O[!dX]!dX^!dX_!dXa!dXb!dXc!dXd!dXe!dXf!dXg!dXh!dX~P!QO[kO]kO^lO_lO~O[kO]kO^lO_lO!r!aX!v!aXv!aX~OPPORUOSVOmUOnUOoUOpUO!fSO!hTO~OjtO!hwO!jrO!ksO~O!oxO~O[!dX]!dX^!dX_!dX!r!aX!v!aXv!aX~OQyOutP~Oz|O!r!aX!v!aXv!aX~OPeORUOSVOmUOnUOoUOpUO!fSO!hTO~Oa!QO~O!r!RO!v!RO~OsXOw!TO~P&yOsXOwfOygOzVa!rVa!vVa!gVavVa~P&yOa!XOb!XOc!XOd!XOe!XOf!XOg!YOh!YO~O[kO]kO^lO_lO~P(mO[kO]kO^lO_lO!g!ZO~O!g!ZO[!dX]!dX^!dX_!dXa!dXb!dXc!dXd!dXe!dXf!dXg!dXh!dX~Oz|O!g!ZO~OP![O!fSO~O!h!]O!j!]O!k!]O!l!]O!m!]O!n!]O~OjtO!h!_O!jrO!ksO~OP!`O~OQyOutX~Ou!bO~OP!cO~Oz|O!rUa!vUa!gUavUa~Ou!fO~P(mOu!fO~OQ_OsXO|]O~P$xO[kO]kO^Zi_Zi!rZi!vZi!gZivZi~OQ_OsXO|]O!r!kO~P$xOQ_OsXO|]O!r!nO~P$xO!g`iu`i~P(mOv!oO~OQ_OsXO|]Ov!sP~P$xOQ_OsXO|]Ov!sP!Q!sP!S!sP~P$xO!r!uO~OQ_OsXO|]Ov!sX!Q!sX!S!sX~P$xOv!wO~Ov!|O!Q!xO!S!{O~Ov#RO!Q!xO!S!{O~Ou#TO~Ov#RO~Ou#UO~P(mOu#UO~Ov#VO~O!r#WO~O!r#XO~Om_o]o~",
goto: "+m!vPPPPPP!w#W#f#k#W$VPPPP$lPPPPPPPP$xP%a%aPPPP%e&OP&dPPP#fPP&gP&s&v'PP'TP&g'Z'a'h'n't'}(UPPP([(`(t)W)]*WPPP*sPPPPPP*w*wP+X+a+ad`Od!Q!b!f!k!n!q#W#XRpSiZOSd|!Q!b!f!k!n!q#W#XVhPj!czUOPS]dgjkl!Q!X!Y!b!c!f!k!n!q!x#W#XR![rdROd!Q!b!f!k!n!q#W#XQnSQ!VkR!WlQpSQ!P]Q!h!YR#P!x{UOPS]dgjkl!Q!X!Y!b!c!f!k!n!q!x#W#XTtTvdWOd!Q!b!f!k!n!q#W#XgePS]gjkl!X!Y!c!xd`Od!Q!b!f!k!n!q#W#XUfPj!cR!TgR{Xe`Od!Q!b!f!k!n!q#W#XR!m!fQ!t!nQ#Y#WR#Z#XT!y!t!zQ!}!tR#S!zQdOR!SdSjP!cR!UjQvTR!^vQzXR!azW!q!k!n#W#XR!v!qS}[qR!e}Q!z!tR#Q!zTcOdSaOdQ!g!QQ!j!bQ!l!fZ!p!k!n!q#W#Xd[Od!Q!b!f!k!n!q#W#XQqSR!d|ViPj!cdQOd!Q!b!f!k!n!q#W#XUfPj!cQmSQ!O]Q!TgQ!VkQ!WlQ!h!XQ!i!YR#O!xdWOd!Q!b!f!k!n!q#W#XdeP]gjkl!X!Y!c!xRoSTuTvmYOPdgj!Q!b!c!f!k!n!q#W#XQ!r!kV!s!n#W#Xe^Od!Q!b!f!k!n!q#W#X",
nodeNames: "⚠ Identifier AssignableIdentifier Word IdentifierBeforeDot Program PipeExpr FunctionCall PositionalArg ParenExpr FunctionCallOrIdentifier BinOp operator operator operator operator ConditionalOp operator operator operator operator operator operator operator operator String StringFragment Interpolation EscapeSeq Number Boolean Regex Null DotGet FunctionDef keyword Params colon end Underscore NamedArg NamedArgPrefix operator IfExpr keyword ThenBlock ThenBlock ElsifExpr keyword ElseExpr keyword Assign",
maxTerm: 84,
states: ".jQVQaOOO#UQbO'#CeO#fQPO'#CfO#tQPO'#DlO$wQaO'#CdO%OOSO'#CtOOQ`'#Dp'#DpO%^OPO'#C|O%cQPO'#DoO%zQaO'#D{OOQ`'#C}'#C}OOQO'#Dm'#DmO&SQPO'#DlO&bQaO'#EPOOQO'#DW'#DWOOQO'#Dl'#DlO&iQPO'#DkOOQ`'#Dk'#DkOOQ`'#Da'#DaQVQaOOOOQ`'#Do'#DoOOQ`'#Cc'#CcO&qQaO'#DTOOQ`'#Dn'#DnOOQ`'#Db'#DbO'OQbO,58|O'oQaO,59zO&bQaO,59QO&bQaO,59QO'|QbO'#CeO)XQPO'#CfO)iQPO,59OO)zQPO,59OO)uQPO,59OO*uQPO,59OO*}QaO'#CvO+VQWO'#CwOOOO'#Dt'#DtOOOO'#Dc'#DcO+kOSO,59`OOQ`,59`,59`O+yO`O,59hOOQ`'#Dd'#DdO,OQaO'#DPO,WQPO,5:gO,]QaO'#DfO,bQPO,58{O,sQPO,5:kO,zQPO,5:kOOQ`,5:V,5:VOOQ`-E7_-E7_OOQ`,59o,59oOOQ`-E7`-E7`OOQO1G/f1G/fOOQO1G.l1G.lO-PQPO1G.lO&bQaO,59VO&bQaO,59VOOQ`1G.j1G.jOOOO,59b,59bOOOO,59c,59cOOOO-E7a-E7aOOQ`1G.z1G.zOOQ`1G/S1G/SOOQ`-E7b-E7bO-kQaO1G0RO-{QbO'#CeOOQO,5:Q,5:QOOQO-E7d-E7dO.lQaO1G0VOOQO1G.q1G.qO.|QPO1G.qO/WQPO7+%mO/]QaO7+%nOOQO'#DY'#DYOOQO7+%q7+%qO/mQaO7+%rOOQ`<<IX<<IXO0TQPO'#DeO0YQaO'#EOO0pQPO<<IYOOQO'#DZ'#DZO0uQPO<<I^OOQ`,5:P,5:POOQ`-E7c-E7cOOQ`AN>tAN>tO&bQaO'#D[OOQO'#Dg'#DgO1QQPOAN>xO1]QPO'#D^OOQOAN>xAN>xO1bQPOAN>xO1gQPO,59vO1nQPO,59vOOQO-E7e-E7eOOQOG24dG24dO1sQPOG24dO1xQPO,59xO1}QPO1G/bOOQOLD*OLD*OO/]QaO1G/dO/mQaO7+$|OOQO7+%O7+%OOOQO<<Hh<<Hh",
stateData: "2Y~O!^OS~OPPOQUORVOlUOmUOnUOoUOrXO{]O!eSO!gTO!qaO~OPdOQUORVOlUOmUOnUOoUOrXOveOxfO!eSO!gTOZ!cX[!cX]!cX^!cXyXX~O`jO!qXX!uXXuXX~P}OZkO[kO]lO^lO~OZkO[kO]lO^lO!q!`X!u!`Xu!`X~OQUORVOlUOmUOnUOoUO!eSO!gTO~OPmO~P$]OiuO!gxO!isO!jtO~O!nyO~OZ!cX[!cX]!cX^!cX!q!`X!u!`Xu!`X~OPzOtsP~Oy}O!q!`X!u!`Xu!`X~OPdO~P$]O!q!RO!u!RO~OPdOrXOv!TO~P$]OPdOrXOveOxfOyUa!qUa!uUa!fUauUa~P$]OPPOrXO{]O~P$]O`!cXa!cXb!cXc!cXd!cXe!cXf!cXg!cX!fXX~P}O`!YOa!YOb!YOc!YOd!YOe!YOf!ZOg!ZO~OZkO[kO]lO^lO~P(mOZkO[kO]lO^lO!f![O~O!f![OZ!cX[!cX]!cX^!cX`!cXa!cXb!cXc!cXd!cXe!cXf!cXg!cX~Oy}O!f![O~OP!]O!eSO~O!g!^O!i!^O!j!^O!k!^O!l!^O!m!^O~OiuO!g!`O!isO!jtO~OP!aO~OPzOtsX~Ot!cO~OP!dO~Oy}O!qTa!uTa!fTauTa~Ot!gO~P(mOt!gO~OZkO[kO]Yi^Yi!qYi!uYi!fYiuYi~OPPOrXO{]O!q!kO~P$]OPdOrXOveOxfOyXX!qXX!uXX!fXXuXX~P$]OPPOrXO{]O!q!nO~P$]O!f_it_i~P(mOu!oO~OPPOrXO{]Ou!rP~P$]OPPOrXO{]Ou!rP!P!rP!R!rP~P$]O!q!uO~OPPOrXO{]Ou!rX!P!rX!R!rX~P$]Ou!wO~Ou!|O!P!xO!R!{O~Ou#RO!P!xO!R!{O~Ot#TO~Ou#RO~Ot#UO~P(mOt#UO~Ou#VO~O!q#WO~O!q#XO~Ol^n[n~",
goto: "+v!uPPPPP!v#V#e#k#V$WPPPP$mPPPPPPPP$yP%c%cPPPP%g&RP&hPPP#ePP&kP&w&z'TP'XP&k'_'e'm's'y(S(ZPPP(a(e(y)])c*_PPP*{PPPPPP+P+PP+b+j+jd_Ocj!c!g!k!n!q#W#XRqSiZOScj}!c!g!k!n!q#W#XXgPim!d|UOPS]cfijklm!Y!Z!c!d!g!k!n!q!x#W#XR!]sdROcj!c!g!k!n!q#W#XQoSQ!WkR!XlQqSQ!Q]Q!h!ZR#P!x}UOPS]cfijklm!Y!Z!c!d!g!k!n!q!x#W#XTuTwdWOcj!c!g!k!n!q#W#XidPS]fiklm!Y!Z!d!xd_Ocj!c!g!k!n!q#W#XWePim!dR!TfR|Xe_Ocj!c!g!k!n!q#W#XR!m!gQ!t!nQ#Y#WR#Z#XT!y!t!zQ!}!tR#S!zQcOR!ScUiPm!dR!UiQwTR!_wQ{XR!b{W!q!k!n#W#XR!v!qS!O[rR!f!OQ!z!tR#Q!zTbOcS`OcQ!VjQ!j!cQ!l!gZ!p!k!n!q#W#Xd[Ocj!c!g!k!n!q#W#XQrSR!e}XhPim!ddQOcj!c!g!k!n!q#W#XWePim!dQnSQ!P]Q!TfQ!WkQ!XlQ!h!YQ!i!ZR#O!xdWOcj!c!g!k!n!q#W#XfdP]fiklm!Y!Z!d!xRpSTvTwoYOPcfijm!c!d!g!k!n!q#W#XQ!r!kV!s!n#W#Xe^Ocj!c!g!k!n!q#W#X",
nodeNames: "⚠ Identifier Word IdentifierBeforeDot Program PipeExpr FunctionCall PositionalArg ParenExpr FunctionCallOrIdentifier BinOp operator operator operator operator ConditionalOp operator operator operator operator operator operator operator operator String StringFragment Interpolation EscapeSeq Number Boolean Regex Null DotGet FunctionDef keyword Params colon end Underscore NamedArg NamedArgPrefix operator IfExpr keyword ThenBlock ThenBlock ElsifExpr keyword ElseExpr keyword Assign",
maxTerm: 83,
context: trackScope,
nodeProps: [
["closedBy", 37,"end"],
["openedBy", 38,"colon"]
["closedBy", 36,"end"],
["openedBy", 37,"colon"]
],
propSources: [highlighting],
skippedNodes: [0],
repeatNodeCount: 7,
tokenData: "!&X~R!SOX$_XY$|YZ%gZp$_pq$|qr&Qrt$_tu'Yuw$_wx'_xy'dyz'}z{(h{|)R|}$_}!O)l!O!P,b!P!Q,{!Q![*]![!]5j!]!^%g!^!_6T!_!`7_!`!a7x!a#O$_#O#P9S#P#R$_#R#S9X#S#T$_#T#U9r#U#X;W#X#Y=m#Y#ZDs#Z#];W#]#^JO#^#b;W#b#cKp#c#d! Y#d#f;W#f#g!!z#g#h;W#h#i!#q#i#o;W#o#p$_#p#q!%i#q;'S$_;'S;=`$v<%l~$_~O$_~~!&SS$dUjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_S$yP;=`<%l$__%TUjS!_ZOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V%nUjS!rROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V&VWjSOt$_uw$_x!_$_!_!`&o!`#O$_#P;'S$_;'S;=`$v<%lO$_V&vUbRjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~'_O!j~~'dO!h~V'kUjS!fROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V(UUjS!gROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V(oU[RjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V)YU^RjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V)sWjS_ROt$_uw$_x!Q$_!Q![*]![#O$_#P;'S$_;'S;=`$v<%lO$_V*dYjSmROt$_uw$_x!O$_!O!P+S!P!Q$_!Q![*]![#O$_#P;'S$_;'S;=`$v<%lO$_V+XWjSOt$_uw$_x!Q$_!Q![+q![#O$_#P;'S$_;'S;=`$v<%lO$_V+xWjSmROt$_uw$_x!Q$_!Q![+q![#O$_#P;'S$_;'S;=`$v<%lO$_T,iU!oPjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V-SWjS]ROt$_uw$_x!P$_!P!Q-l!Q#O$_#P;'S$_;'S;=`$v<%lO$_V-q^jSOY.mYZ$_Zt.mtu/puw.mwx/px!P.m!P!Q$_!Q!}.m!}#O4c#O#P2O#P;'S.m;'S;=`5d<%lO.mV.t^jSoROY.mYZ$_Zt.mtu/puw.mwx/px!P.m!P!Q2e!Q!}.m!}#O4c#O#P2O#P;'S.m;'S;=`5d<%lO.mR/uXoROY/pZ!P/p!P!Q0b!Q!}/p!}#O1P#O#P2O#P;'S/p;'S;=`2_<%lO/pR0eP!P!Q0hR0mUoR#Z#[0h#]#^0h#a#b0h#g#h0h#i#j0h#m#n0hR1SVOY1PZ#O1P#O#P1i#P#Q/p#Q;'S1P;'S;=`1x<%lO1PR1lSOY1PZ;'S1P;'S;=`1x<%lO1PR1{P;=`<%l1PR2RSOY/pZ;'S/p;'S;=`2_<%lO/pR2bP;=`<%l/pV2jWjSOt$_uw$_x!P$_!P!Q3S!Q#O$_#P;'S$_;'S;=`$v<%lO$_V3ZbjSoROt$_uw$_x#O$_#P#Z$_#Z#[3S#[#]$_#]#^3S#^#a$_#a#b3S#b#g$_#g#h3S#h#i$_#i#j3S#j#m$_#m#n3S#n;'S$_;'S;=`$v<%lO$_V4h[jSOY4cYZ$_Zt4ctu1Puw4cwx1Px#O4c#O#P1i#P#Q.m#Q;'S4c;'S;=`5^<%lO4cV5aP;=`<%l4cV5gP;=`<%l.mT5qUjSuPOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V6[WcRjSOt$_uw$_x!_$_!_!`6t!`#O$_#P;'S$_;'S;=`$v<%lO$_V6{UdRjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V7fUaRjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V8PWeRjSOt$_uw$_x!_$_!_!`8i!`#O$_#P;'S$_;'S;=`$v<%lO$_V8pUfRjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~9XO!k~V9`UjSwROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V9w[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#b;W#b#c;{#c#o;W#o;'S$_;'S;=`$v<%lO$_U:tUyQjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_U;]YjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_V<Q[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#W;W#W#X<v#X#o;W#o;'S$_;'S;=`$v<%lO$_V<}YgRjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_V=r^jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#a>n#a#b;W#b#cCR#c#o;W#o;'S$_;'S;=`$v<%lO$_V>s[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#g;W#g#h?i#h#o;W#o;'S$_;'S;=`$v<%lO$_V?n^jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#X;W#X#Y@j#Y#];W#]#^Aa#^#o;W#o;'S$_;'S;=`$v<%lO$_V@qY!SPjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VAf[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#Y;W#Y#ZB[#Z#o;W#o;'S$_;'S;=`$v<%lO$_VBcY!QPjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VCW[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#W;W#W#XC|#X#o;W#o;'S$_;'S;=`$v<%lO$_VDTYjSvROt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VDx]jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#UEq#U#b;W#b#cIX#c#o;W#o;'S$_;'S;=`$v<%lO$_VEv[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#aFl#a#o;W#o;'S$_;'S;=`$v<%lO$_VFq[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#g;W#g#hGg#h#o;W#o;'S$_;'S;=`$v<%lO$_VGl[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#X;W#X#YHb#Y#o;W#o;'S$_;'S;=`$v<%lO$_VHiYnRjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VI`YsRjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VJT[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#Y;W#Y#ZJy#Z#o;W#o;'S$_;'S;=`$v<%lO$_VKQY|PjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$__Kw[!lWjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#i;W#i#jLm#j#o;W#o;'S$_;'S;=`$v<%lO$_VLr[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#aMh#a#o;W#o;'S$_;'S;=`$v<%lO$_VMm[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#aNc#a#o;W#o;'S$_;'S;=`$v<%lO$_VNjYpRjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_V! _[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#f;W#f#g!!T#g#o;W#o;'S$_;'S;=`$v<%lO$_V!![YhRjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_^!#RY!nWjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$__!#x[!mWjSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#f;W#f#g!$n#g#o;W#o;'S$_;'S;=`$v<%lO$_V!$s[jSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#i;W#i#jGg#j#o;W#o;'S$_;'S;=`$v<%lO$_V!%pUzRjSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~!&XO!v~",
tokenData: "!&X~R!SOX$_XY$|YZ%gZp$_pq$|qr&Qrt$_tu'Yuw$_wx'_xy'dyz'}z{(h{|)R|}$_}!O)l!O!P,b!P!Q,{!Q![*]![!]5j!]!^%g!^!_6T!_!`7_!`!a7x!a#O$_#O#P9S#P#R$_#R#S9X#S#T$_#T#U9r#U#X;W#X#Y=m#Y#ZDs#Z#];W#]#^JO#^#b;W#b#cKp#c#d! Y#d#f;W#f#g!!z#g#h;W#h#i!#q#i#o;W#o#p$_#p#q!%i#q;'S$_;'S;=`$v<%l~$_~O$_~~!&SS$dUiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_S$yP;=`<%l$__%TUiS!^ZOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V%nUiS!qROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V&VWiSOt$_uw$_x!_$_!_!`&o!`#O$_#P;'S$_;'S;=`$v<%lO$_V&vUaRiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~'_O!i~~'dO!g~V'kUiS!eROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V(UUiS!fROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V(oUZRiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V)YU]RiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V)sWiS^ROt$_uw$_x!Q$_!Q![*]![#O$_#P;'S$_;'S;=`$v<%lO$_V*dYiSlROt$_uw$_x!O$_!O!P+S!P!Q$_!Q![*]![#O$_#P;'S$_;'S;=`$v<%lO$_V+XWiSOt$_uw$_x!Q$_!Q![+q![#O$_#P;'S$_;'S;=`$v<%lO$_V+xWiSlROt$_uw$_x!Q$_!Q![+q![#O$_#P;'S$_;'S;=`$v<%lO$_T,iU!nPiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V-SWiS[ROt$_uw$_x!P$_!P!Q-l!Q#O$_#P;'S$_;'S;=`$v<%lO$_V-q^iSOY.mYZ$_Zt.mtu/puw.mwx/px!P.m!P!Q$_!Q!}.m!}#O4c#O#P2O#P;'S.m;'S;=`5d<%lO.mV.t^iSnROY.mYZ$_Zt.mtu/puw.mwx/px!P.m!P!Q2e!Q!}.m!}#O4c#O#P2O#P;'S.m;'S;=`5d<%lO.mR/uXnROY/pZ!P/p!P!Q0b!Q!}/p!}#O1P#O#P2O#P;'S/p;'S;=`2_<%lO/pR0eP!P!Q0hR0mUnR#Z#[0h#]#^0h#a#b0h#g#h0h#i#j0h#m#n0hR1SVOY1PZ#O1P#O#P1i#P#Q/p#Q;'S1P;'S;=`1x<%lO1PR1lSOY1PZ;'S1P;'S;=`1x<%lO1PR1{P;=`<%l1PR2RSOY/pZ;'S/p;'S;=`2_<%lO/pR2bP;=`<%l/pV2jWiSOt$_uw$_x!P$_!P!Q3S!Q#O$_#P;'S$_;'S;=`$v<%lO$_V3ZbiSnROt$_uw$_x#O$_#P#Z$_#Z#[3S#[#]$_#]#^3S#^#a$_#a#b3S#b#g$_#g#h3S#h#i$_#i#j3S#j#m$_#m#n3S#n;'S$_;'S;=`$v<%lO$_V4h[iSOY4cYZ$_Zt4ctu1Puw4cwx1Px#O4c#O#P1i#P#Q.m#Q;'S4c;'S;=`5^<%lO4cV5aP;=`<%l4cV5gP;=`<%l.mT5qUiStPOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V6[WbRiSOt$_uw$_x!_$_!_!`6t!`#O$_#P;'S$_;'S;=`$v<%lO$_V6{UcRiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V7fU`RiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V8PWdRiSOt$_uw$_x!_$_!_!`8i!`#O$_#P;'S$_;'S;=`$v<%lO$_V8pUeRiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~9XO!j~V9`UiSvROt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_V9w[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#b;W#b#c;{#c#o;W#o;'S$_;'S;=`$v<%lO$_U:tUxQiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_U;]YiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_V<Q[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#W;W#W#X<v#X#o;W#o;'S$_;'S;=`$v<%lO$_V<}YfRiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_V=r^iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#a>n#a#b;W#b#cCR#c#o;W#o;'S$_;'S;=`$v<%lO$_V>s[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#g;W#g#h?i#h#o;W#o;'S$_;'S;=`$v<%lO$_V?n^iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#X;W#X#Y@j#Y#];W#]#^Aa#^#o;W#o;'S$_;'S;=`$v<%lO$_V@qY!RPiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VAf[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#Y;W#Y#ZB[#Z#o;W#o;'S$_;'S;=`$v<%lO$_VBcY!PPiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VCW[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#W;W#W#XC|#X#o;W#o;'S$_;'S;=`$v<%lO$_VDTYiSuROt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VDx]iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#UEq#U#b;W#b#cIX#c#o;W#o;'S$_;'S;=`$v<%lO$_VEv[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#aFl#a#o;W#o;'S$_;'S;=`$v<%lO$_VFq[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#g;W#g#hGg#h#o;W#o;'S$_;'S;=`$v<%lO$_VGl[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#X;W#X#YHb#Y#o;W#o;'S$_;'S;=`$v<%lO$_VHiYmRiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VI`YrRiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_VJT[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#Y;W#Y#ZJy#Z#o;W#o;'S$_;'S;=`$v<%lO$_VKQY{PiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$__Kw[!kWiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#i;W#i#jLm#j#o;W#o;'S$_;'S;=`$v<%lO$_VLr[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#aMh#a#o;W#o;'S$_;'S;=`$v<%lO$_VMm[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#`;W#`#aNc#a#o;W#o;'S$_;'S;=`$v<%lO$_VNjYoRiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_V! _[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#f;W#f#g!!T#g#o;W#o;'S$_;'S;=`$v<%lO$_V!![YgRiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$_^!#RY!mWiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#o;W#o;'S$_;'S;=`$v<%lO$__!#x[!lWiSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#f;W#f#g!$n#g#o;W#o;'S$_;'S;=`$v<%lO$_V!$s[iSOt$_uw$_x!_$_!_!`:m!`#O$_#P#T$_#T#i;W#i#jGg#j#o;W#o;'S$_;'S;=`$v<%lO$_V!%pUyRiSOt$_uw$_x#O$_#P;'S$_;'S;=`$v<%lO$_~!&XO!u~",
tokenizers: [0, 1, 2, 3, tokenizer],
topRules: {"Program":[0,5]},
tokenPrec: 768
topRules: {"Program":[0,4]},
tokenPrec: 786
})

View File

@ -10,7 +10,7 @@ describe('null', () => {
test('parses null in assignments', () => {
expect('a = null').toMatchTree(`
Assign
AssignableIdentifier a
Identifier a
operator =
Null null`)
})
@ -212,11 +212,11 @@ describe('newlines', () => {
expect(`x = 5
y = 2`).toMatchTree(`
Assign
AssignableIdentifier x
Identifier x
operator =
Number 5
Assign
AssignableIdentifier y
Identifier y
operator =
Number 2`)
})
@ -224,11 +224,11 @@ y = 2`).toMatchTree(`
test('parses statements separated by semicolons', () => {
expect(`x = 5; y = 2`).toMatchTree(`
Assign
AssignableIdentifier x
Identifier x
operator =
Number 5
Assign
AssignableIdentifier y
Identifier y
operator =
Number 2`)
})
@ -236,7 +236,7 @@ y = 2`).toMatchTree(`
test('parses statement with word and a semicolon', () => {
expect(`a = hello; 2`).toMatchTree(`
Assign
AssignableIdentifier a
Identifier a
operator =
FunctionCallOrIdentifier
Identifier hello
@ -248,7 +248,7 @@ describe('Assign', () => {
test('parses simple assignment', () => {
expect('x = 5').toMatchTree(`
Assign
AssignableIdentifier x
Identifier x
operator =
Number 5`)
})
@ -256,7 +256,7 @@ describe('Assign', () => {
test('parses assignment with addition', () => {
expect('x = 5 + 3').toMatchTree(`
Assign
AssignableIdentifier x
Identifier x
operator =
BinOp
Number 5
@ -267,13 +267,13 @@ describe('Assign', () => {
test('parses assignment with functions', () => {
expect('add = fn a b: a + b end').toMatchTree(`
Assign
AssignableIdentifier add
Identifier add
operator =
FunctionDef
keyword fn
Params
AssignableIdentifier a
AssignableIdentifier b
Identifier a
Identifier b
colon :
BinOp
Identifier a
@ -287,7 +287,7 @@ describe('DotGet whitespace sensitivity', () => {
test('no whitespace - DotGet works when identifier in scope', () => {
expect('basename = 5; basename.prop').toMatchTree(`
Assign
AssignableIdentifier basename
Identifier basename
operator =
Number 5
DotGet
@ -298,7 +298,7 @@ describe('DotGet whitespace sensitivity', () => {
test('space before dot - NOT DotGet, parses as division', () => {
expect('basename = 5; basename / prop').toMatchTree(`
Assign
AssignableIdentifier basename
Identifier basename
operator =
Number 5
BinOp

View File

@ -19,7 +19,7 @@ describe('if/elsif/else', () => {
expect('a = if x: 2').toMatchTree(`
Assign
AssignableIdentifier a
Identifier a
operator =
IfExpr
keyword if

View File

@ -17,7 +17,7 @@ describe('DotGet', () => {
test('obj.prop is DotGet when obj is assigned', () => {
expect('obj = 5; obj.prop').toMatchTree(`
Assign
AssignableIdentifier obj
Identifier obj
operator =
Number 5
DotGet
@ -31,7 +31,7 @@ describe('DotGet', () => {
FunctionDef
keyword fn
Params
AssignableIdentifier config
Identifier config
colon :
DotGet
IdentifierBeforeDot config
@ -45,7 +45,7 @@ describe('DotGet', () => {
FunctionDef
keyword fn
Params
AssignableIdentifier x
Identifier x
colon :
DotGet
IdentifierBeforeDot x
@ -63,8 +63,8 @@ end`).toMatchTree(`
FunctionDef
keyword fn
Params
AssignableIdentifier x
AssignableIdentifier y
Identifier x
Identifier y
colon :
DotGet
IdentifierBeforeDot x
@ -84,7 +84,7 @@ end`).toMatchTree(`
FunctionDef
keyword fn
Params
AssignableIdentifier x
Identifier x
colon :
DotGet
IdentifierBeforeDot x
@ -92,7 +92,7 @@ end`).toMatchTree(`
FunctionDef
keyword fn
Params
AssignableIdentifier y
Identifier y
colon :
DotGet
IdentifierBeforeDot y
@ -105,7 +105,7 @@ end`).toMatchTree(`
test('dot get works as function argument', () => {
expect('config = 42; echo config.path').toMatchTree(`
Assign
AssignableIdentifier config
Identifier config
operator =
Number 42
FunctionCall
@ -120,7 +120,7 @@ end`).toMatchTree(`
test('mixed file paths and dot get', () => {
expect('config = 42; cat readme.txt; echo config.path').toMatchTree(`
Assign
AssignableIdentifier config
Identifier config
operator =
Number 42
FunctionCall

View File

@ -72,7 +72,7 @@ describe('Fn', () => {
FunctionDef
keyword fn
Params
AssignableIdentifier x
Identifier x
colon :
BinOp
Identifier x
@ -86,8 +86,8 @@ describe('Fn', () => {
FunctionDef
keyword fn
Params
AssignableIdentifier x
AssignableIdentifier y
Identifier x
Identifier y
colon :
BinOp
Identifier x
@ -104,8 +104,8 @@ end`).toMatchTree(`
FunctionDef
keyword fn
Params
AssignableIdentifier x
AssignableIdentifier y
Identifier x
Identifier y
colon :
BinOp
Identifier x

View File

@ -21,16 +21,16 @@ describe('multiline', () => {
add 3 4
`).toMatchTree(`
Assign
AssignableIdentifier add
Identifier add
operator =
FunctionDef
keyword fn
Params
AssignableIdentifier a
AssignableIdentifier b
Identifier a
Identifier b
colon :
Assign
AssignableIdentifier result
Identifier result
operator =
BinOp
Identifier a
@ -63,8 +63,8 @@ end
FunctionDef
keyword fn
Params
AssignableIdentifier x
AssignableIdentifier y
Identifier x
Identifier y
colon :
FunctionCallOrIdentifier
Identifier x

View File

@ -50,7 +50,7 @@ describe('pipe expressions', () => {
test('pipe expression in assignment', () => {
expect('result = echo hello | grep h').toMatchTree(`
Assign
AssignableIdentifier result
Identifier result
operator =
PipeExpr
FunctionCall
@ -77,7 +77,7 @@ describe('pipe expressions', () => {
FunctionDef
keyword fn
Params
AssignableIdentifier x
Identifier x
colon :
FunctionCallOrIdentifier
Identifier x

View File

@ -1,107 +1,63 @@
import { ExternalTokenizer, InputStream, Stack } from '@lezer/lr'
import { Identifier, AssignableIdentifier, Word, IdentifierBeforeDot } from './shrimp.terms'
import { Identifier, Word, IdentifierBeforeDot } from './shrimp.terms'
import type { Scope } from './scopeTracker'
// The only chars that can't be words are whitespace, apostrophes, closing parens, and EOF.
export const tokenizer = new ExternalTokenizer(
(input: InputStream, stack: Stack) => {
const ch = getFullCodePoint(input, 0)
let ch = getFullCodePoint(input, 0)
console.log(`🌭 checking char ${String.fromCodePoint(ch)}`)
if (!isWordChar(ch)) return
const isValidStart = isLowercaseLetter(ch) || isEmoji(ch)
let pos = getCharSize(ch)
let isValidIdentifier = isLowercaseLetter(ch) || isEmoji(ch)
const canBeWord = stack.canShift(Word)
// Consume all word characters, tracking if it remains a valid identifier
const { pos, isValidIdentifier, stoppedAtDot } = consumeWordToken(
input,
isValidStart,
canBeWord
)
while (true) {
ch = getFullCodePoint(input, pos)
// Check if we should emit IdentifierBeforeDot for property access
if (stoppedAtDot) {
const dotGetToken = checkForDotGet(input, stack, pos)
if (dotGetToken) {
input.advance(pos)
input.acceptToken(dotGetToken)
} else {
// Not in scope - continue consuming the dot as part of the word
const afterDot = consumeRestOfWord(input, pos + 1, canBeWord)
input.advance(afterDot)
input.acceptToken(Word)
}
return
}
// Advance past the token we consumed
input.advance(pos)
// Choose which token to emit
if (isValidIdentifier) {
const token = chooseIdentifierToken(input, stack)
input.acceptToken(token)
} else {
input.acceptToken(Word)
}
},
{ contextual: true }
)
// Build identifier text from input stream, handling surrogate pairs for emoji
const buildIdentifierText = (input: InputStream, length: number): string => {
let text = ''
for (let i = 0; i < length; i++) {
// Check for dot and scope - property access detection
if (ch === 46 /* . */ && isValidIdentifier) {
// Build identifier text by peeking character by character
let identifierText = ''
for (let i = 0; i < pos; i++) {
const charCode = input.peek(i)
if (charCode === -1) break
// Handle surrogate pairs for emoji (UTF-16 encoding)
if (charCode >= 0xd800 && charCode <= 0xdbff && i + 1 < length) {
// Handle surrogate pairs for emoji
if (charCode >= 0xd800 && charCode <= 0xdbff && i + 1 < pos) {
const low = input.peek(i + 1)
if (low >= 0xdc00 && low <= 0xdfff) {
text += String.fromCharCode(charCode, low)
identifierText += String.fromCharCode(charCode, low)
i++ // Skip the low surrogate
continue
}
}
text += String.fromCharCode(charCode)
}
return text
identifierText += String.fromCharCode(charCode)
}
// Consume word characters, tracking if it remains a valid identifier
// Returns the position after consuming, whether it's a valid identifier, and if we stopped at a dot
const consumeWordToken = (
input: InputStream,
isValidStart: boolean,
canBeWord: boolean
): { pos: number; isValidIdentifier: boolean; stoppedAtDot: boolean } => {
let pos = getCharSize(getFullCodePoint(input, 0))
let isValidIdentifier = isValidStart
let stoppedAtDot = false
const scope = stack.context as Scope | undefined
while (true) {
const ch = getFullCodePoint(input, pos)
// Stop at dot if we have a valid identifier (might be property access)
if (ch === 46 /* . */ && isValidIdentifier) {
stoppedAtDot = true
break
if (scope?.has(identifierText)) {
// In scope - stop here, let grammar parse property access
input.advance(pos)
input.acceptToken(IdentifierBeforeDot)
return
}
// Not in scope - continue consuming as Word (fall through)
}
// Stop if we hit a non-word character
if (!isWordChar(ch)) break
// Context-aware termination: semicolon/colon can end a word if followed by whitespace
// This allows `hello; 2` to parse correctly while `hello;world` stays as one word
// Certain characters might end a word or identifier if they are followed by whitespace.
// This allows things like `a = hello; 2` of if `x: y` to parse correctly.
if (canBeWord && (ch === 59 /* ; */ || ch === 58) /* : */) {
const nextCh = getFullCodePoint(input, pos + 1)
if (!isWordChar(nextCh)) break
}
// Track identifier validity: must be lowercase, digit, dash, or emoji
if (!isLowercaseLetter(ch) && !isDigit(ch) && ch !== 45 /* - */ && !isEmoji(ch)) {
// Track identifier validity
if (!isLowercaseLetter(ch) && !isDigit(ch) && ch !== 45 && !isEmoji(ch)) {
if (!canBeWord) break
isValidIdentifier = false
}
@ -109,73 +65,21 @@ const consumeWordToken = (
pos += getCharSize(ch)
}
return { pos, isValidIdentifier, stoppedAtDot }
}
input.advance(pos)
input.acceptToken(isValidIdentifier ? Identifier : Word)
},
{ contextual: true }
)
// Consume the rest of a word after we've decided not to treat a dot as DotGet
// Used when we have "file.txt" - we already consumed "file", now consume ".txt"
const consumeRestOfWord = (input: InputStream, startPos: number, canBeWord: boolean): number => {
let pos = startPos
while (true) {
const ch = getFullCodePoint(input, pos)
// Stop if we hit a non-word character
if (!isWordChar(ch)) break
// Context-aware termination for semicolon/colon
if (canBeWord && (ch === 59 /* ; */ || ch === 58) /* : */) {
const nextCh = getFullCodePoint(input, pos + 1)
if (!isWordChar(nextCh)) break
}
pos += getCharSize(ch)
}
return pos
}
// Check if this identifier is in scope (for property access detection)
// Returns IdentifierBeforeDot token if in scope, null otherwise
const checkForDotGet = (input: InputStream, stack: Stack, pos: number): number | null => {
const identifierText = buildIdentifierText(input, pos)
const context = stack.context as { scope: { has(name: string): boolean } } | undefined
// If identifier is in scope, this is property access (e.g., obj.prop)
// If not in scope, it should be consumed as a Word (e.g., file.txt)
return context?.scope.has(identifierText) ? IdentifierBeforeDot : null
}
// Decide between AssignableIdentifier and Identifier using grammar state + peek-ahead
const chooseIdentifierToken = (input: InputStream, stack: Stack): number => {
const canAssignable = stack.canShift(AssignableIdentifier)
const canRegular = stack.canShift(Identifier)
// Only one option is valid - use it
if (canAssignable && !canRegular) return AssignableIdentifier
if (canRegular && !canAssignable) return Identifier
// Both possible (ambiguous context) - peek ahead for '=' to disambiguate
// This happens at statement start where both `x = 5` (assign) and `echo x` (call) are valid
let peekPos = 0
while (true) {
const ch = getFullCodePoint(input, peekPos)
if (isWhiteSpace(ch)) {
peekPos += getCharSize(ch)
} else {
break
}
}
const nextCh = getFullCodePoint(input, peekPos)
return nextCh === 61 /* = */ ? AssignableIdentifier : Identifier
}
// Character classification helpers
const isWhiteSpace = (ch: number): boolean => {
return ch === 32 /* space */ || ch === 9 /* tab */ || ch === 13 /* \r */
return ch === 32 /* space */ || ch === 10 /* \n */ || ch === 9 /* tab */ || ch === 13 /* \r */
}
const isWordChar = (ch: number): boolean => {
return !isWhiteSpace(ch) && ch !== 10 /* \n */ && ch !== 41 /* ) */ && ch !== -1 /* EOF */
const closingParen = ch === 41 /* ) */
const eof = ch === -1
return !isWhiteSpace(ch) && !closingParen && !eof
}
const isLowercaseLetter = (ch: number): boolean => {
@ -199,7 +103,7 @@ const getFullCodePoint = (input: InputStream, pos: number): number => {
}
}
return ch
return ch // Single code unit
}
const isEmoji = (ch: number): boolean => {