Skip to content

Reduce per-Token allocation in the lexer#1712

Draft
chrisdp wants to merge 1 commit into
masterfrom
perf/lexer-allocation-cleanup
Draft

Reduce per-Token allocation in the lexer#1712
chrisdp wants to merge 1 commit into
masterfrom
perf/lexer-allocation-cleanup

Conversation

@chrisdp
Copy link
Copy Markdown
Contributor

@chrisdp chrisdp commented May 14, 2026

Summary

Five small lexer/parser changes targeting per-Token wrapper allocations on lexer-bound builds. Net effect on a namespace-bound ukor build is within run-to-run noise; the wins should concentrate on codebases where the lexer is the dominant allocator (e.g. peacock, where the report from PR #1685 follow-up shows ~270 MB of lexer self-heap).

Changes

  1. FixedTokenText canonical strings. TokenKind.ts exposes a Partial<Record<TokenKind, string>> mapping ~42 punctuation/operator token kinds to their canonical text (Dot'.', LeftParen'(', etc.). Lexer.addToken uses the canonical instance instead of allocating a fresh source.slice wrapper for these kinds. Excludes keywords (text varies by case in source) and identifiers / literals / comments (text is content-dependent).

  2. LexerTextCache for Newline + Whitespace. Lazy intern table for the two kinds with bounded-but-not-fixed text variety (3 valid newline forms, ~50 typical indent patterns in real source). Module-scope, retention is trivial.

  3. Lexer.whitespace() fast path when includeWhitespace=false. The default path previously built a full Whitespace Token (with Range tree of 5 nested objects) only to immediately pop() it from the stream. The fast path now computes the canonical text directly and updates start/lineBegin/columnBegin to match what addToken's sync() would have done. Saves a Token + Range allocation per whitespace run on default builds. This is the change most likely to move peak RSS on peacock-shaped projects.

  4. EOF token field order. Lexer.scan()'s EOF token construction now matches addToken's field order (kind, text, isReserved, range, leadingWhitespace), keeping all lexer-produced Tokens on a single V8 hidden class.

  5. Parser-synthesized Token shapes. Four synthesis sites in Parser.ts (the recovered Function token and three compound-assignment Equal tokens) are normalized to the lexer's 5-field shape with isReserved: false and leadingWhitespace: '' to keep V8 representation type consistent with the lexer's writes.

Test plan

  • All 2910 specs pass
  • Heap profile on a lexer-heavy codebase to confirm the per-token wrapper savings show up where expected — the build used during development is namespace-bound so the signal isn't as visible there
  • Sanity check that consumers reading token.range, token.text, token.leadingWhitespace still behave identically (the lexer's externally-visible Token contract is unchanged; only internal allocation patterns changed)

Five small changes targeting per-Token wrapper allocations on lexer-bound
builds (~270 MB of self-heap on the peacock build from PR #1685 follow-up
data). Net effect on a namespace-bound ukor build is within run-to-run
noise; the wins concentrate on codebases where the lexer is the dominant
allocator.

- `FixedTokenText` in `TokenKind.ts` maps ~42 punctuation/operator token
  kinds to canonical string instances. `Lexer.addToken` uses the canonical
  string for these kinds instead of allocating a fresh `source.slice`
  wrapper per token.

- `LexerTextCache` lazily interns `Newline` and `Whitespace` token text
  (3 valid newline forms, ~50 typical indent patterns in real source).
  Avoids the global token-text interner cost since identifier and literal
  kinds are excluded.

- `Lexer.whitespace()` skips `Token` + `Range` allocation entirely when
  `includeWhitespace=false` (the default). Previously the function built
  a full Whitespace token only to immediately pop it from the stream;
  now it computes the canonical text and updates `start/lineBegin/columnBegin`
  inline to match what `addToken`'s `sync()` would have done.

- EOF token construction in `Lexer.scan()` and four parser-synthesized
  token sites in `Parser.ts` (the recovered `Function` token and three
  compound-assignment `Equal` tokens) are normalized to the lexer's
  5-field shape (`kind, text, isReserved, range, leadingWhitespace`).
  Keeps Tokens monomorphic in V8 so downstream `token.text`/`token.range`
  accesses stay on the fast path.

Tests: 2910 passing.
@chrisdp chrisdp added the create-package create a temporary npm package on every commit label May 14, 2026
@rokucommunity-bot
Copy link
Copy Markdown
Contributor

Hey there! I just built a new temporary npm package based on f20bb0d. You can download it here or install it by running the following command:

npm install https://github.com/rokucommunity/brighterscript/releases/download/v0.0.0-packages/brighterscript-0.72.1-perf-lexer-allocation-cleanup.20260514000517.tgz

Copy link
Copy Markdown
Member

@TwitchBronBron TwitchBronBron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. once we test this, i'm good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

create-package create a temporary npm package on every commit

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants