fix: Respect typed-array byteOffset in bigNumsToStrings#439
Open
Karakatiza666 wants to merge 1 commit into
Open
fix: Respect typed-array byteOffset in bigNumsToStrings#439Karakatiza666 wants to merge 1 commit into
Karakatiza666 wants to merge 1 commit into
Conversation
'new Uint32Array(values.buffer)' ignored byteOffset/byteLength, so any 64-bit DATA or OFFSET backed by a subarray view (every binary-IPC-loaded vector, every sliced vector) was serialized from the wrong window. Affects all callers: visitInt (Int64/Uint64), visitDate (ms), visitTimestamp, visitTime (>=us), visitDecimal, visitDuration, visitLargeUtf8, visitLargeBinary, visitLargeList. Add test/unit/ipc/writer/json-offset-views-tests.ts: each type built twice (fresh array vs .subarray() view inside a padded buffer) must serialize to identical JSON. Reverting the fix fails 17/18.
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR was co-authored with Claude Code.
Fix incorrect 64-bit JSON IPC output for tables loaded from binary Arrow
The bug
When the JSON IPC writer serializes a column whose values are stored as 64-bit words (offsets or data), it reads the buffer's underlying ArrayBuffer directly without honoring the typed-array's byteOffset / byteLength. If the column happens to be a view into a larger shared buffer rather than a freshly-allocated array, the writer reads from the wrong window of memory and emits garbage for the affected DATA / OFFSET strings.
When it actually triggers
Only on the flow "read data in binary Arrow IPC, write data in JSON IPC." That's the only public path that produces vectors backed by offset views — the binary IPC reader carves every buffer out of a single shared byte array, so anything loaded from a file, stream, or another Arrow implementation arrives at the JSON writer in exactly the at-risk shape.
Tables built directly in JS memory (e.g. via the existing JSON round-trip tests, or tableFromArrays) are not affected, which is why the bug must have gone unnoticed: the RecordBatchJSONWriter test suite synthesizes its inputs in memory and never sees the offset-view state. The flow is mostly used in cross-language integration testing, not by typical end users — which may be the second reason it lingered.
What's affected
Any column whose values are 64-bit words processed through the writer's bignum helper:
Fixed types unaffected: 32-bit ints/floats, DateDay, TimeSecond/Millisecond, regular Utf8/Binary, anything that doesn't go through the 64-bit serialization path.
Test coverage
A new
test/unit/ipc/writer/json-offset-views-tests.tsadds 16 focused unit tests — one per type whose JSON serialization goes through the buggy 64-bit helper. Each test takes a freshly-generated vector from the shared generate-test-data helpers, rewraps every backing typed array at a nonzero byteOffset, and asserts the JSON writer produces the same output for the rewrapped vector as for the original. The list of covered types is a flat table at the top of the file; adding a future affected type is a one-line addition.The existing
RecordBatchJSONWritersuite intest/unit/ipc/writer/json-writer-tests.tswas also extended: it now routes its generated source tables through binary IPC before the JSON round-trip, so the JSON writer sees the offset-view typed arrays that real-world inputs always carry. This widens end-to-end coverage to every type the random/dictionary table generators produce, on the user-facing API path.The two layers are complementary:
The focused unit tests pinpoint exactly which type fails on regression and stay sound even if the binary IPC reader's memory layout ever changes — they construct offset views directly, without depending on
tableFromIPC.The extended round-trip suite catches the same regression on a much broader type matrix through the actual public flow a user would use.
With the fix reverted, all 16 focused tests fail and over 100 cases in the extended round-trip suite fail. With the fix applied, all tests across 46 suites pass.
The new test file is not strictly necessary right now, but it makes catching a regression more robust.