fix(py): report element count in buffer.shape, not byte length (#405)#545
Open
LeSingh1 wants to merge 1 commit into
Open
fix(py): report element count in buffer.shape, not byte length (#405)#545LeSingh1 wants to merge 1 commit into
buffer.shape, not byte length (#405)#545LeSingh1 wants to merge 1 commit into
Conversation
PEP 3118 (buffer protocol) requires `prod(shape) * itemsize == nbytes`. The previous implementation pointed `view.shape` at the same storage as `view.len` (the byte length), so `memoryview(buf).shape` was reported as the number of bytes rather than the number of tokens, and `memoryview(buf).tolist()` returned 4x as many entries as expected (one per byte of each u32 token) on consumers that respect `shape`. NumPy is unaffected because `np.frombuffer` computes the count from `len / itemsize` directly, which is why this slipped through `Encoding.encode_to_numpy` — but any CPython consumer that goes through the buffer's `shape` (e.g. `memoryview(...).tolist()`, `array.array.frombuffer` on Python >=3.13) reads the wrong count. Fix: store the element count in a stable `Box<Py_ssize_t>` on the `TiktokenBuffer` itself, and point `view.shape` at it. The box's memory lives as long as the Python object (which is kept alive via `view.obj`), so the pointer remains valid for the buffer view's lifetime. Fixes openai#405.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #405.
TiktokenBuffer.__getbuffer__pointedview.shapeat the samePy_ssize_tstorage asview.len, so the shape it reported was the byte length rather than the element count. PEP 3118 requiresprod(shape) * itemsize == nbytes, and any consumer that respectsshape(most prominently CPython's ownmemoryview(...).tolist()) reads the wrong count.NumPy is unaffected because
np.frombufferderives the count fromlen / itemsizedirectly — which is how this slipped throughEncoding.encode_to_numpy.Before
After
Implementation
The shape value needs storage that outlives the function call (the buffer view's lifetime is tied to the Python object via
view.obj, not the function frame). I added aBox<Py_ssize_t>field toTiktokenBufferholding the element count, and pointview.shapeat it. The box lives as long asself, so the pointer is valid for the buffer view's lifetime.Tests
Added a regression test
test_tiktoken_buffer_shape_is_element_countintests/test_misc.py:Verification
cargo check --all-featuresclean (only pre-existing unrelated pyo3 deprecation warning).pytest tests/test_misc.py::test_tiktoken_buffer_shape_is_element_count -vpasses after the fix; the same test fails onmainbefore the fix (prod((8,)) * 4 == 8→32 != 8).