Skip to content

OOM-path hardening: ~45 sites crash on allocation failure (most critically: ZstdError NULL at module init cascades into ~50 error paths) #298

@devdanzin

Description

@devdanzin

Summary

~45 sites across module init, type creation, and the decompression state machine either don't check the return value of an allocating API, or check it in a way that returns NULL without setting an exception. On OOM (or memory pressure in constrained environments), the resulting behavior ranges from SystemError: returned NULL without setting an exception (fatal on debug builds) to dereferencing NULL pointers and crashing.

The most catastrophic of these is Category 1: ZstdError going NULL at module init. Every subsequent PyErr_Format(ZstdError, ...) across ~50 error paths would then pass NULL as the exception class. The failure is silent at import time (see Category 2).

Impact

  • Severity: Crashes / SystemError on OOM. Latent on programs that don't hit OOM during normal operation.
  • Reachability: OOM conditions, memory-constrained containers, processes with ulimit, small-VM CI, memory-pressure chaos-testing harnesses.
  • Version: 0.25.0 (commit 7a77a75).
  • Platform: Platform-independent.

Category 1: ZstdError NULL at module init — catastrophic

If PyErr_NewException("zstandard.ZstdError", ...) fails during module init (OOM), ZstdError stays NULL. Every subsequent PyErr_Format(ZstdError, ...) across ~50 error sites then passes NULL as the exception class → crash. Coupled with Category 2 (failure doesn't propagate out of module init), import appears to succeed and the crash deferred until the first error surfaces.

Site: c-ext/constants.c:28.

Fix:

ZstdError = PyErr_NewException("zstandard.ZstdError", NULL, NULL);
if (!ZstdError) {
    return -1;            /* propagate out of constants_module_init */
}

Plus every caller of constants_module_init up the chain must check and propagate.

Category 2: zstd_module_init continues after sub-init failures

16 type-init sub-calls in zstd_module_init; if any fails the module still loads. The features set also leaks on early returns from module init.

Fix pattern:

if (compressor_module_init(m) < 0) return NULL;
if (decompressor_module_init(m) < 0) return NULL;
/* ... 14 more ... */

Category 3: 19 PyModule_AddObject without ref-guard

PyModule_AddObject steals the reference on success and leaks it on failure. pythoncapi_compat.h (already included in the codebase) provides PyModule_AddObjectRef which doesn't steal.

Fix:

/* was: */ PyModule_AddObject(m, "X", X);
/* use: */ if (PyModule_AddObjectRef(m, "X", X) < 0) { Py_DECREF(X); goto err; }
           Py_DECREF(X);   /* AddObjectRef doesn't steal */

For type-registration sites specifically, PyModule_AddType(m, &T) (3.10+) is simpler.

Category 4: 3 unchecked PyType_FromSpec in bufferutil_module_init

On OOM, PyType_FromSpec returns NULL. On Python < 3.9 the code immediately dereferences NULL (->tp_as_buffer); on 3.9+ NULL is passed to PyType_Ready which also crashes.

Sites: c-ext/bufferutil.c:531, 544, 557.

Fix:

BufferWithSegments_Type = (PyTypeObject *)PyType_FromSpec(&BufferWithSegments_Spec);
if (!BufferWithSegments_Type) return -1;

Category 5: Unchecked PyTuple_New in constants_module_init

PyTuple_New(0) returns NULL on OOM; PyTuple_SET_ITEM then writes to NULL.

Site: c-ext/constants.c.

Fix: NULL-check the return before PyTuple_SET_ITEM.

Category 6: 4 PyMem_Malloc / PyMem_Realloc without PyErr_NoMemory

When PyMem_Malloc / PyMem_Realloc returns NULL, the code takes goto finally which returns NULL without setting an exception → SystemError.

Sites: c-ext/decompressor.c:704, 739 (PyMem_Malloc), :800, 839 (PyMem_Realloc) — all inside decompress_content_dict_chain.

Fix:

buf = PyMem_Malloc(size);
if (!buf) { PyErr_NoMemory(); goto finally; }

new_buf = PyMem_Realloc(buf, new_size);
if (!new_buf) { PyErr_NoMemory(); goto finally; }
buf = new_buf;

Category 7: Py_DECREF(NULL) via PyLong_FromSsize_t OOM in copy_stream

PyLong_FromSsize_t(totalRead) can return NULL on OOM. Current code stores NULL into a tuple via PyTuple_SET_ITEM(result, 0, totalRead); the subsequent Py_DECREF(result) traverses the tuple and dereferences NULL. Same pattern in the decompressor copy_stream.

Sites: c-ext/compressor.c:411-418; c-ext/decompressor.c:247-251.

Fix:

PyObject *totalReadObj  = PyLong_FromSsize_t(totalRead);
PyObject *totalWriteObj = PyLong_FromSsize_t(totalWrite);
if (!totalReadObj || !totalWriteObj) {
    Py_XDECREF(totalReadObj);
    Py_XDECREF(totalWriteObj);
    goto except;
}
result = PyTuple_New(2);
if (!result) {
    Py_DECREF(totalReadObj);
    Py_DECREF(totalWriteObj);
    goto except;
}
PyTuple_SET_ITEM(result, 0, totalReadObj);
PyTuple_SET_ITEM(result, 1, totalWriteObj);

Category 8: Unchecked PyBytes_FromStringAndSize for unused_data

Silent data loss on OOM.

Site: c-ext/decompressobj.c:97.

Fix: NULL-check the return and propagate PyErr_NoMemory / return NULL with exception set.

Suggested PR shape

Most impactful is Category 1 (it cascades into ~50 other error paths). Categories 2-8 are independent and small. Happy to do either one big hardening PR or one PR per category — probably the big PR reads better because OOM hardening is best evaluated holistically.

Methodology

Found via cext-review-toolkit (Tree-sitter-based static analysis with structured naive/informed review passes). All 8 categories reviewed statically; faithfully reproducing OOM requires _testcapi.set_nomemory or libfiu-style allocator injection, both of which are feasible but out of scope for the live-reproducer pass that covered this analysis. Happy to open a PR — the fixes are mechanical and collectively ~100 lines of diff.

Discovery, root-cause analysis, and issue drafting were performed by Claude Code and reviewed by a human before filing.

Full report

Complete multi-agent analysis (48 FIX findings across 13 categories, plus a reproducer appendix): https://gist.github.com/devdanzin/b86039ac097141579590c1a0f3a43605

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions