fix(csv-parse): align trim with ECMAScript whitespace (fix #482)#483
Merged
wdavidw merged 2 commits intoMay 11, 2026
Merged
Conversation
15c4120 to
e683b04
Compare
Member
|
Thank you @vishwakt for your contribution, well done |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #482.
Aligns
trim/ltrim/rtrimwithString.prototype.trim(). Previously only\r \n \f \tand space were treated as trimmable, so anything else JS considers whitespace (from the issue, plus NBSP, vertical tab, ogham, U+2000-U+200A, line/paragraph separators, ZWNBSP) passed through.Changes
__isCharTrimablenow matches the full ES2015+ whitespace + line terminator set.Uint8Arraylookup so non-whitespace bytes bail out in O(1). Without this, going from 5 to 25 trim chars slowed the trim path by ~10% on clean ASCII data.underlatin1, which Node encodes as?) are filtered at init time so literal?bytes aren't trimmed under non-Unicode encodings.needMoreDataSizeto cover the longest trim char (up to 3 bytes in UTF-8) so multi-byte whitespace split across stream writes still gets caught.Tests
New
unicode whitespaceblock underOption trim:?is not trimmed underlatin1(covers the encoding filter)split acrossparser.write()calls still gets trimmedFull suite: 591 passing, 3 pending (same as master).
Perf
Quick local bench (200k rows, 10 cols), median of 5 runs:
trim, paddedbenefits the most because the first-byte table fires constantly when scanning past trim chars.