fix(py): report element count in `buffer.shape`, not byte length (#405) by LeSingh1 · Pull Request #545 · openai/tiktoken

LeSingh1 · 2026-05-18T07:04:52Z

Summary

Fixes #405.

TiktokenBuffer.__getbuffer__ pointed view.shape at the same Py_ssize_t storage as view.len, so the shape it reported was the byte length rather than the element count. PEP 3118 requires prod(shape) * itemsize == nbytes, and any consumer that respects shape (most prominently CPython's own memoryview(...).tolist()) reads the wrong count.

NumPy is unaffected because np.frombuffer derives the count from len / itemsize directly — which is how this slipped through Encoding.encode_to_numpy.

Before

>>> import tiktoken
>>> enc = tiktoken.get_encoding("gpt2")
>>> buf = enc._core_bpe.encode_to_tiktoken_buffer("hello world", frozenset())
>>> mv = memoryview(buf)
>>> mv.shape          # bytes, not tokens
(8,)
>>> mv.tolist()       # expands each token to 4 byte-sized ints
[31373, 0, 0, 0, 995, 0, 0, 0]

After

>>> mv.shape
(2,)
>>> mv.tolist()
[31373, 995]

Implementation

The shape value needs storage that outlives the function call (the buffer view's lifetime is tied to the Python object via view.obj, not the function frame). I added a Box<Py_ssize_t> field to TiktokenBuffer holding the element count, and point view.shape at it. The box lives as long as self, so the pointer is valid for the buffer view's lifetime.

Tests

Added a regression test test_tiktoken_buffer_shape_is_element_count in tests/test_misc.py:

def test_tiktoken_buffer_shape_is_element_count():
    enc = tiktoken.get_encoding("gpt2")
    expected = enc.encode("hello world")
    buf = enc._core_bpe.encode_to_tiktoken_buffer("hello world", frozenset())
    mv = memoryview(buf)
    assert math.prod(mv.shape) * mv.itemsize == mv.nbytes  # PEP 3118 invariant
    assert mv.tolist() == expected

Verification

cargo check --all-features clean (only pre-existing unrelated pyo3 deprecation warning).
pytest tests/test_misc.py::test_tiktoken_buffer_shape_is_element_count -v passes after the fix; the same test fails on main before the fix (prod((8,)) * 4 == 8 → 32 != 8).

PEP 3118 (buffer protocol) requires `prod(shape) * itemsize == nbytes`. The previous implementation pointed `view.shape` at the same storage as `view.len` (the byte length), so `memoryview(buf).shape` was reported as the number of bytes rather than the number of tokens, and `memoryview(buf).tolist()` returned 4x as many entries as expected (one per byte of each u32 token) on consumers that respect `shape`. NumPy is unaffected because `np.frombuffer` computes the count from `len / itemsize` directly, which is why this slipped through `Encoding.encode_to_numpy` — but any CPython consumer that goes through the buffer's `shape` (e.g. `memoryview(...).tolist()`, `array.array.frombuffer` on Python >=3.13) reads the wrong count. Fix: store the element count in a stable `Box<Py_ssize_t>` on the `TiktokenBuffer` itself, and point `view.shape` at it. The box's memory lives as long as the Python object (which is kept alive via `view.obj`), so the pointer remains valid for the buffer view's lifetime. Fixes openai#405.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(py): report element count in `buffer.shape`, not byte length (#405)#545

fix(py): report element count in `buffer.shape`, not byte length (#405)#545
LeSingh1 wants to merge 1 commit into
openai:mainfrom
LeSingh1:fix-buffer-protocol-shape-element-count

LeSingh1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeSingh1 commented May 18, 2026

Summary

Before

After

Implementation

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant