Reduce per-Token allocation in the lexer#1712
Draft
chrisdp wants to merge 1 commit into
Draft
Conversation
Five small changes targeting per-Token wrapper allocations on lexer-bound builds (~270 MB of self-heap on the peacock build from PR #1685 follow-up data). Net effect on a namespace-bound ukor build is within run-to-run noise; the wins concentrate on codebases where the lexer is the dominant allocator. - `FixedTokenText` in `TokenKind.ts` maps ~42 punctuation/operator token kinds to canonical string instances. `Lexer.addToken` uses the canonical string for these kinds instead of allocating a fresh `source.slice` wrapper per token. - `LexerTextCache` lazily interns `Newline` and `Whitespace` token text (3 valid newline forms, ~50 typical indent patterns in real source). Avoids the global token-text interner cost since identifier and literal kinds are excluded. - `Lexer.whitespace()` skips `Token` + `Range` allocation entirely when `includeWhitespace=false` (the default). Previously the function built a full Whitespace token only to immediately pop it from the stream; now it computes the canonical text and updates `start/lineBegin/columnBegin` inline to match what `addToken`'s `sync()` would have done. - EOF token construction in `Lexer.scan()` and four parser-synthesized token sites in `Parser.ts` (the recovered `Function` token and three compound-assignment `Equal` tokens) are normalized to the lexer's 5-field shape (`kind, text, isReserved, range, leadingWhitespace`). Keeps Tokens monomorphic in V8 so downstream `token.text`/`token.range` accesses stay on the fast path. Tests: 2910 passing.
Contributor
|
Hey there! I just built a new temporary npm package based on f20bb0d. You can download it here or install it by running the following command: npm install https://github.com/rokucommunity/brighterscript/releases/download/v0.0.0-packages/brighterscript-0.72.1-perf-lexer-allocation-cleanup.20260514000517.tgz |
TwitchBronBron
approved these changes
May 14, 2026
Member
TwitchBronBron
left a comment
There was a problem hiding this comment.
lgtm. once we test this, i'm good to merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five small lexer/parser changes targeting per-Token wrapper allocations on lexer-bound builds. Net effect on a namespace-bound ukor build is within run-to-run noise; the wins should concentrate on codebases where the lexer is the dominant allocator (e.g. peacock, where the report from PR #1685 follow-up shows ~270 MB of lexer self-heap).
Changes
FixedTokenTextcanonical strings.TokenKind.tsexposes aPartial<Record<TokenKind, string>>mapping ~42 punctuation/operator token kinds to their canonical text (Dot→'.',LeftParen→'(', etc.).Lexer.addTokenuses the canonical instance instead of allocating a freshsource.slicewrapper for these kinds. Excludes keywords (text varies by case in source) and identifiers / literals / comments (text is content-dependent).LexerTextCachefor Newline + Whitespace. Lazy intern table for the two kinds with bounded-but-not-fixed text variety (3 valid newline forms, ~50 typical indent patterns in real source). Module-scope, retention is trivial.Lexer.whitespace()fast path whenincludeWhitespace=false. The default path previously built a full Whitespace Token (withRangetree of 5 nested objects) only to immediatelypop()it from the stream. The fast path now computes the canonical text directly and updatesstart/lineBegin/columnBeginto match whataddToken'ssync()would have done. Saves a Token + Range allocation per whitespace run on default builds. This is the change most likely to move peak RSS on peacock-shaped projects.EOF token field order.
Lexer.scan()'s EOF token construction now matchesaddToken's field order (kind, text, isReserved, range, leadingWhitespace), keeping all lexer-produced Tokens on a single V8 hidden class.Parser-synthesized Token shapes. Four synthesis sites in
Parser.ts(the recoveredFunctiontoken and three compound-assignmentEqualtokens) are normalized to the lexer's 5-field shape withisReserved: falseandleadingWhitespace: ''to keep V8 representation type consistent with the lexer's writes.Test plan
token.range,token.text,token.leadingWhitespacestill behave identically (the lexer's externally-visible Token contract is unchanged; only internal allocation patterns changed)