Skip to content

fix(csv-parse): align trim with ECMAScript whitespace (fix #482)#483

Merged
wdavidw merged 2 commits into
adaltas:masterfrom
vishwakt:fix/csv-parse-trim-ecmascript-whitespace
May 11, 2026
Merged

fix(csv-parse): align trim with ECMAScript whitespace (fix #482)#483
wdavidw merged 2 commits into
adaltas:masterfrom
vishwakt:fix/csv-parse-trim-ecmascript-whitespace

Conversation

@vishwakt
Copy link
Copy Markdown
Contributor

Closes #482.

Aligns trim / ltrim / rtrim with String.prototype.trim(). Previously only \r \n \f \t and space were treated as trimmable, so anything else JS considers whitespace (  from the issue, plus NBSP, vertical tab, ogham, U+2000-U+200A, line/paragraph separators, ZWNBSP) passed through.

Changes

  • __isCharTrimable now matches the full ES2015+ whitespace + line terminator set.
  • Added a first-byte Uint8Array lookup so non-whitespace bytes bail out in O(1). Without this, going from 5 to 25 trim chars slowed the trim path by ~10% on clean ASCII data.
  • Codepoints that can't be represented in the parser's encoding (e.g.   under latin1, which Node encodes as ?) are filtered at init time so literal ? bytes aren't trimmed under non-Unicode encodings.
  • Bumped needMoreDataSize to cover the longest trim char (up to 3 bytes in UTF-8) so multi-byte whitespace split across stream writes still gets caught.

Tests

New unicode whitespace block under Option trim:

  • trim U+3000, U+000B, U+00A0 individually
  • mixed ES whitespace at field boundaries
  • ? is not trimmed under latin1 (covers the encoding filter)
  •   split across parser.write() calls still gets trimmed

Full suite: 591 passing, 3 pending (same as master).

Perf

Quick local bench (200k rows, 10 cols), median of 5 runs:

master this branch
no trim, clean 416k rows/s 425k rows/s
no trim, padded 311k rows/s 320k rows/s
trim, clean 352k rows/s 356k rows/s
trim, padded 214k rows/s 255k rows/s

trim, padded benefits the most because the first-byte table fires constantly when scanning past trim chars.

@wdavidw wdavidw force-pushed the fix/csv-parse-trim-ecmascript-whitespace branch from 15c4120 to e683b04 Compare May 11, 2026 21:27
@wdavidw wdavidw merged commit d9f724c into adaltas:master May 11, 2026
@wdavidw
Copy link
Copy Markdown
Member

wdavidw commented May 11, 2026

Thank you @vishwakt for your contribution, well done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add u{3000} (a.k.a full-width space) to trimable characters.

2 participants