Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions .github/scripts/build-cuda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,16 @@ set -xeuo pipefail
if [[ -v cuda_targets ]]; then
build_capability="${cuda_targets}"
elif [ "${build_arch}" = "aarch64" ]; then
build_capability="75;80;90"
# sm87 (Jetson Orin) is aarch64-only and needs an explicit cubin: only
# the latest capability gets PTX, so sm87 hardware cannot JIT from
# sm90+ PTX. See bitsandbytes-foundation/bitsandbytes#1930.
build_capability="75;80;87;90"

# CUDA 12.8-12.9: Add sm100/sm120
[[ "${cuda_version}" == 12.8.* || "${cuda_version}" == 12.9.* ]] && build_capability="75;80;90;100;120"
[[ "${cuda_version}" == 12.8.* || "${cuda_version}" == 12.9.* ]] && build_capability="75;80;87;90;100;120"

# CUDA 13.0+: Add sm100/sm110/sm120
[[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;90;100;110;120;121"
[[ "${cuda_version}" == 13.*.* ]] && build_capability="75;80;87;90;100;110;120;121"
else
# By default, target Pascal through Hopper.
build_capability="60;70;75;80;86;89;90"
Expand Down
4 changes: 2 additions & 2 deletions docs/source/installation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ The currently distributed `bitsandbytes` packages are built with the following c
| **Linux x86-64** | 11.8 - 12.6 | GCC 11.2 | sm60, sm70, sm75, sm80, sm86, sm89, sm90
| **Linux x86-64** | 12.8 - 12.9 | GCC 11.2 | sm70, sm75, sm80, sm86, sm89, sm90, sm100, sm120
| **Linux x86-64** | 13.0 | GCC 11.2 | sm75, sm80, sm86, sm89, sm90, sm100, sm120
| **Linux aarch64** | 11.8 - 12.6 | GCC 11.2 | sm75, sm80, sm90
| **Linux aarch64** | 12.8 - 13.0 | GCC 11.2 | sm75, sm80, sm90, sm100, sm110, sm120, sm121
| **Linux aarch64** | 11.8 - 12.6 | GCC 11.2 | sm75, sm80, sm87, sm90
| **Linux aarch64** | 12.8 - 13.0 | GCC 11.2 | sm75, sm80, sm87, sm90, sm100, sm110, sm120, sm121
| **Windows x86-64** | 11.8 - 12.6 | MSVC 19.43+ (VS2022) | sm50, sm60, sm75, sm80, sm86, sm89, sm90
| **Windows x86-64** | 12.8 - 12.9 | MSVC 19.43+ (VS2022) | sm70, sm75, sm80, sm86, sm89, sm90, sm100, sm120
| **Windows x86-64** | 13.0 | MSVC 19.43+ (VS2022) | sm75, sm80, sm86, sm89, sm90, sm100, sm120
Expand Down
103 changes: 103 additions & 0 deletions tests/test_linear4bit_sm87_multishape_regression.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
"""Regression test for #1936: sm_87 multi-shape Linear4bit reboot.

Bug history
-----------
On bnb 0.46.1 against an NVIDIA Jetson Orin (sm_87) at nvpmodel
MAXN_SUPER, the host kernel reboots when three Linear4bit layers are
constructed and forwarded in sequence with the following recipe:

- shape order: monotonically-increasing by output-feature product (A → B → C below)
- quant_type: NF4
- quant_storage: torch.bfloat16 (FSDP-compatible)
- compute_dtype: torch.bfloat16
- compress_statistics (double_quant): True
- no inter-layer memory hygiene (no `del layer` or `empty_cache`)
- batch size: 1

The fault is overwhelmingly cold-start-specific (~78% reboot rate
at cold-start, 0% at warm state across N=29+ warm-state samples).
bnb 0.49.2 also reboots at cold-start (N=1) — the original "fixed
in 0.49.2" framing was a warm-state artifact and was retracted in
a 2026-05-05 comment on the issue. A 256x256 NF4 forward executed
as the first GPU op after boot closes the cold-start race window
(N=3 verified). This test runs the original failing recipe at
sm_87 so any future regression surfaces in CI; CI runners targeting
sm_87 are expected to provide cold-state (e.g., reboot before the
test run) for it to fire reliably.

References
----------
- Issue: https://github.com/bitsandbytes-foundation/bitsandbytes/issues/1936
- Reproduced N=4 across two physically separate Jetson Orin Nano
Super 8GB units; fault travels with silicon + bnb build, not one
defective board.
- A 13-test orthogonal-axis bisection found six axes (shape order,
quant_type, quant_storage, compute_dtype, double_quant, hygiene)
each independently sufficient to prevent the fault. The recipe in
this test is the unique intersection that triggers it on 0.46.1.

Caveats
-------
- On a *broken* bnb the test crashes the host (system reboot), not
just the test runner — failure mode is OS-level. pytest cannot
capture this; absence of test output IS the regression signal.
- The bug is timing-sensitive (race condition); lower-power
nvpmodel modes (15W / 25W) prevent it. The test does not enforce
a power mode — sm_87 CI is expected to run at MAXN.
- Skipped on all non-sm_87 hardware. sm_87 is exclusive to the
Jetson Orin family (Nano / NX / AGX).
"""

import pytest
import torch

import bitsandbytes as bnb

pytestmark = pytest.mark.skipif(
not torch.cuda.is_available() or torch.cuda.get_device_capability() != (8, 7),
reason="Regression test for sm_87 (NVIDIA Jetson Orin family) only",
)


# Real-world shapes from Llama-3 / Qwen-2.5 / Mistral lm_head dimensions,
# selected to match the historical #1936 reproducer. Monotonically
# increasing by output-feature product is load-bearing — bisection found
# all five non-ABC permutations (ACB, BAC, BCA, CAB, CBA) pass cleanly.
# Do not substitute toy shapes here without re-validating against the
# historical repro.
SHAPES_ABC = [
(4096, 32768), # A
(4096, 128256), # B
(3584, 152064), # C
]


def test_linear4bit_multishape_bf16_storage_no_fault():
"""Run the #1936 recipe; assert each forward completes without fault.

Cold-start race: at sm_87 cold-state on a buggy bnb the host reboots
before the third forward returns. At warm-state (any prior bnb-NF4 op
in the same session) every bnb version tested passes — see module
docstring for the cold/warm characterization.
"""
for in_features, out_features in SHAPES_ABC:
layer = bnb.nn.Linear4bit(
in_features,
out_features,
bias=False,
compute_dtype=torch.bfloat16,
compress_statistics=True, # double_quant
quant_type="nf4",
quant_storage=torch.bfloat16, # FSDP-compatible storage path
).to("cuda")

x = torch.randn(1, in_features, dtype=torch.bfloat16, device="cuda")
y = layer(x)

assert y.shape == (1, out_features)
assert y.dtype == torch.bfloat16

# Deliberately no `del layer`, `torch.cuda.empty_cache()`, or
# `torch.cuda.synchronize()` between iterations. Any of those
# individually prevents the historical fault and would mask a
# regression of the 0.49.2 fix.