Reduce per-Token allocation in the lexer by chrisdp · Pull Request #1712 · rokucommunity/brighterscript

chrisdp · 2026-05-14T00:04:24Z

Summary

Five small lexer/parser changes targeting per-Token wrapper allocations on lexer-bound builds. Net effect on a namespace-bound ukor build is within run-to-run noise; the wins should concentrate on codebases where the lexer is the dominant allocator (e.g. peacock, where the report from PR #1685 follow-up shows ~270 MB of lexer self-heap).

Changes

FixedTokenText canonical strings. TokenKind.ts exposes a Partial<Record<TokenKind, string>> mapping ~42 punctuation/operator token kinds to their canonical text (Dot → '.', LeftParen → '(', etc.). Lexer.addToken uses the canonical instance instead of allocating a fresh source.slice wrapper for these kinds. Excludes keywords (text varies by case in source) and identifiers / literals / comments (text is content-dependent).
LexerTextCache for Newline + Whitespace. Lazy intern table for the two kinds with bounded-but-not-fixed text variety (3 valid newline forms, ~50 typical indent patterns in real source). Module-scope, retention is trivial.
Lexer.whitespace() fast path when includeWhitespace=false. The default path previously built a full Whitespace Token (with Range tree of 5 nested objects) only to immediately pop() it from the stream. The fast path now computes the canonical text directly and updates start/lineBegin/columnBegin to match what addToken's sync() would have done. Saves a Token + Range allocation per whitespace run on default builds. This is the change most likely to move peak RSS on peacock-shaped projects.
EOF token field order. Lexer.scan()'s EOF token construction now matches addToken's field order (kind, text, isReserved, range, leadingWhitespace), keeping all lexer-produced Tokens on a single V8 hidden class.
Parser-synthesized Token shapes. Four synthesis sites in Parser.ts (the recovered Function token and three compound-assignment Equal tokens) are normalized to the lexer's 5-field shape with isReserved: false and leadingWhitespace: '' to keep V8 representation type consistent with the lexer's writes.

Test plan

All 2910 specs pass
Heap profile on a lexer-heavy codebase to confirm the per-token wrapper savings show up where expected — the build used during development is namespace-bound so the signal isn't as visible there
Sanity check that consumers reading token.range, token.text, token.leadingWhitespace still behave identically (the lexer's externally-visible Token contract is unchanged; only internal allocation patterns changed)

Five small changes targeting per-Token wrapper allocations on lexer-bound builds (~270 MB of self-heap on the peacock build from PR #1685 follow-up data). Net effect on a namespace-bound ukor build is within run-to-run noise; the wins concentrate on codebases where the lexer is the dominant allocator. - `FixedTokenText` in `TokenKind.ts` maps ~42 punctuation/operator token kinds to canonical string instances. `Lexer.addToken` uses the canonical string for these kinds instead of allocating a fresh `source.slice` wrapper per token. - `LexerTextCache` lazily interns `Newline` and `Whitespace` token text (3 valid newline forms, ~50 typical indent patterns in real source). Avoids the global token-text interner cost since identifier and literal kinds are excluded. - `Lexer.whitespace()` skips `Token` + `Range` allocation entirely when `includeWhitespace=false` (the default). Previously the function built a full Whitespace token only to immediately pop it from the stream; now it computes the canonical text and updates `start/lineBegin/columnBegin` inline to match what `addToken`'s `sync()` would have done. - EOF token construction in `Lexer.scan()` and four parser-synthesized token sites in `Parser.ts` (the recovered `Function` token and three compound-assignment `Equal` tokens) are normalized to the lexer's 5-field shape (`kind, text, isReserved, range, leadingWhitespace`). Keeps Tokens monomorphic in V8 so downstream `token.text`/`token.range` accesses stay on the fast path. Tests: 2910 passing.

rokucommunity-bot · 2026-05-14T00:06:50Z

Hey there! I just built a new temporary npm package based on f20bb0d. You can download it here or install it by running the following command:

npm install https://github.com/rokucommunity/brighterscript/releases/download/v0.0.0-packages/brighterscript-0.72.1-perf-lexer-allocation-cleanup.20260514000517.tgz

TwitchBronBron

lgtm. once we test this, i'm good to merge.

chrisdp added the create-package create a temporary npm package on every commit label May 14, 2026

TwitchBronBron approved these changes May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce per-Token allocation in the lexer#1712

Reduce per-Token allocation in the lexer#1712
chrisdp wants to merge 1 commit into
masterfrom
perf/lexer-allocation-cleanup

chrisdp commented May 14, 2026

Uh oh!

rokucommunity-bot Bot commented May 14, 2026

Uh oh!

TwitchBronBron left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrisdp commented May 14, 2026

Summary

Changes

Test plan

Uh oh!

rokucommunity-bot Bot commented May 14, 2026

Uh oh!

TwitchBronBron left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants