Skip to content

fix(py): report element count in buffer.shape, not byte length (#405)#545

Open
LeSingh1 wants to merge 1 commit into
openai:mainfrom
LeSingh1:fix-buffer-protocol-shape-element-count
Open

fix(py): report element count in buffer.shape, not byte length (#405)#545
LeSingh1 wants to merge 1 commit into
openai:mainfrom
LeSingh1:fix-buffer-protocol-shape-element-count

Conversation

@LeSingh1
Copy link
Copy Markdown

Summary

Fixes #405.

TiktokenBuffer.__getbuffer__ pointed view.shape at the same Py_ssize_t storage as view.len, so the shape it reported was the byte length rather than the element count. PEP 3118 requires prod(shape) * itemsize == nbytes, and any consumer that respects shape (most prominently CPython's own memoryview(...).tolist()) reads the wrong count.

NumPy is unaffected because np.frombuffer derives the count from len / itemsize directly — which is how this slipped through Encoding.encode_to_numpy.

Before

>>> import tiktoken
>>> enc = tiktoken.get_encoding("gpt2")
>>> buf = enc._core_bpe.encode_to_tiktoken_buffer("hello world", frozenset())
>>> mv = memoryview(buf)
>>> mv.shape          # bytes, not tokens
(8,)
>>> mv.tolist()       # expands each token to 4 byte-sized ints
[31373, 0, 0, 0, 995, 0, 0, 0]

After

>>> mv.shape
(2,)
>>> mv.tolist()
[31373, 995]

Implementation

The shape value needs storage that outlives the function call (the buffer view's lifetime is tied to the Python object via view.obj, not the function frame). I added a Box<Py_ssize_t> field to TiktokenBuffer holding the element count, and point view.shape at it. The box lives as long as self, so the pointer is valid for the buffer view's lifetime.

Tests

Added a regression test test_tiktoken_buffer_shape_is_element_count in tests/test_misc.py:

def test_tiktoken_buffer_shape_is_element_count():
    enc = tiktoken.get_encoding("gpt2")
    expected = enc.encode("hello world")
    buf = enc._core_bpe.encode_to_tiktoken_buffer("hello world", frozenset())
    mv = memoryview(buf)
    assert math.prod(mv.shape) * mv.itemsize == mv.nbytes  # PEP 3118 invariant
    assert mv.tolist() == expected

Verification

cargo check --all-features clean (only pre-existing unrelated pyo3 deprecation warning).
pytest tests/test_misc.py::test_tiktoken_buffer_shape_is_element_count -v passes after the fix; the same test fails on main before the fix (prod((8,)) * 4 == 832 != 8).

PEP 3118 (buffer protocol) requires `prod(shape) * itemsize == nbytes`.
The previous implementation pointed `view.shape` at the same storage as
`view.len` (the byte length), so `memoryview(buf).shape` was reported as
the number of bytes rather than the number of tokens, and
`memoryview(buf).tolist()` returned 4x as many entries as expected (one
per byte of each u32 token) on consumers that respect `shape`.

NumPy is unaffected because `np.frombuffer` computes the count from
`len / itemsize` directly, which is why this slipped through
`Encoding.encode_to_numpy` — but any CPython consumer that goes through
the buffer's `shape` (e.g. `memoryview(...).tolist()`,
`array.array.frombuffer` on Python >=3.13) reads the wrong count.

Fix: store the element count in a stable `Box<Py_ssize_t>` on the
`TiktokenBuffer` itself, and point `view.shape` at it. The box's
memory lives as long as the Python object (which is kept alive via
`view.obj`), so the pointer remains valid for the buffer view's
lifetime.

Fixes openai#405.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug in buffer protocol implementation

1 participant