Skip to content

Commit c102506

Browse files
committed
Release v0.3.0: NF4 packed storage, QLoRA pipeline example
- Add QLoRA Support section to README with NF4 quantization, packed storage, QLoRA base model, and full-stack compression docs - Update repo URL to github.com/vlora-dev/vlora - Update API reference for new quantize/save_quantized/qlora_info APIs - Add CHANGELOG.md covering all 0.3.0 changes - Add VLoRACallback integration test proving loadings actually change after forward+backward+on_step_end (the old callback was a no-op)
1 parent ef57eb6 commit c102506

3 files changed

Lines changed: 215 additions & 12 deletions

File tree

CHANGELOG.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
Format follows [Keep a Changelog](https://keepachangelog.com/).
6+
7+
## [0.3.0] - 2026-03-30
8+
9+
### Added
10+
- **NF4 quantization** — 4-bit NormalFloat quantization from QLoRA (Dettmers et al., 2023). `subspace.quantize(method="nf4")` uses 16 quantile levels optimized for normally-distributed weights, with per-block absmax scaling. Lower error than symmetric int4.
11+
- **Double quantization** — quantize per-block NF4 scales to FP8 via `double_quant=True`, reducing scale overhead from 0.5 to ~0.127 bits/param.
12+
- **NF4 packed storage**`subspace.save_quantized()` packs components as uint8 (two 4-bit indices per byte) for ~7x disk savings. `SharedSubspace.load()` auto-detects format.
13+
- **QLoRA-aware VLoRAModel**`compute_dtype` parameter for mixed-precision LoRA computation with quantized base models; `qlora_info` property for base model introspection.
14+
- **`full_stack_compression()`** — report combined base model quantization + adapter compression savings.
15+
- **`quantize_loadings` parameter** — optionally quantize per-task loadings (not just components).
16+
- **`nf4_pack` / `nf4_unpack`** — low-level ops for 4-bit packing to uint8.
17+
- **Layer shapes stored in metadata**`reconstruct()` uses stored shapes instead of deriving from `numel() // rank`, supporting per-layer rank configs.
18+
- **`__repr__` on core objects**`SharedSubspace`, `TaskProjection`, `LoRAWeights` now print useful info.
19+
- **`adaptive_k` preserved through `absorb()`** — subspaces built with `adaptive_k=True` retain that setting after absorption.
20+
- QLoRA + vLoRA pipeline example (`examples/qlora_pipeline.py`).
21+
22+
### Fixed
23+
- **`absorb_incremental` re-projection bug** — existing tasks were having loadings padded/truncated instead of properly re-projected when the basis rotated. Now reconstructs from old basis and projects onto updated basis.
24+
- **`VLoRACallback` was a no-op** — the HF Trainer callback created an optimizer but never stepped it. Now registers differentiable forward hooks so the Trainer's backward pass produces gradients on loadings, and steps the optimizer in `on_step_end`.
25+
- **TIES merge normalization**`n / contributor_count` over-scaled output when elements were trimmed. Fixed to `1 / contributor_count`.
26+
- **`__version__` mismatch**`__init__.py` said 0.1.0 while `pyproject.toml` said 0.2.1.
27+
- **`check_tensor_health` never called** — imported but unused; now wired up after SVD in `from_adapters`.
28+
- **Task ID collision**`absorb()` and `absorb_incremental()` now warn when overwriting an existing task ID.
29+
- **Filesystem-unsafe task IDs**`save()` now sanitizes task IDs for filenames (handles `/`, `:`, spaces) with a mapping in metadata for lossless round-trip.
30+
- **`from_adapters_streaming` missing validation** — now checks `len(task_ids) == len(adapter_paths)`.
31+
32+
### Changed
33+
- **`gram_schmidt` uses QR factorization** — replaced O(k^2 * D) inner loop with `torch.linalg.qr` for better performance and numerical stability.
34+
- **VLoRAModel caches module handles**`_apply_hooks` no longer scans all `named_modules()` on every task switch.
35+
- **VLoRAModel inference hooks wrapped in `torch.no_grad()`** — prevents unnecessary autograd tracking.
36+
- **NF4 quantization uses `torch.bucketize`** — replaced O(N*16) distance broadcast with binary search, reducing memory from O(N*16) to O(N).
37+
- **`_LORA_KEY_RE` handles multi-adapter PEFT format** — supports `base_model.model.{layer}.lora_A.{adapter_name}.weight`.
38+
- **`save_adapter` no longer hardcodes `CAUSAL_LM`** — task type left for PEFT to infer.
39+
- Repo URL updated to `github.com/vlora-dev/vlora`.
40+
41+
## [0.2.1] - 2026-02-10
42+
43+
Initial public release on PyPI as `vlora-dev`.
44+
45+
### Added
46+
- `SharedSubspace` — 3-step algorithm: from_adapters, project, absorb
47+
- `VLoRAModel` — inference wrapper with forward hooks
48+
- `SubspaceTrainer` — loadings-only training
49+
- `TaskRouter` — per-input adapter routing
50+
- `task_arithmetic`, `ties_merge`, `dare_merge` — adapter merging
51+
- Analysis tools: similarity matrix, clustering, outlier detection
52+
- CLI with 9 commands
53+
- HuggingFace Trainer integration via `VLoRACallback`
54+
- Streaming and incremental subspace construction

README.md

Lines changed: 82 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ pip install vlora-dev
1616

1717
Or from source:
1818
```bash
19-
git clone https://github.com/tveseli/vlora.git
19+
git clone https://github.com/vlora-dev/vlora.git
2020
cd vlora
2121
pip install -e ".[dev]"
2222
```
@@ -101,6 +101,77 @@ output = model(input_ids)
101101
print(model.available_tasks) # ["task_0", "task_1", ...]
102102
```
103103

104+
## QLoRA Support
105+
106+
vLoRA has first-class support for [QLoRA](https://arxiv.org/abs/2305.14314) workflows. QLoRA compresses the **base model** (FP16 → 4-bit NF4), while vLoRA compresses the **adapter space** — these are orthogonal and stack multiplicatively.
107+
108+
### NF4 Quantization
109+
110+
Quantize subspace components using the same NF4 data type from QLoRA — 16 quantile levels optimized for normally-distributed weights:
111+
112+
```python
113+
# NF4 quantization (better than symmetric int4 for normal-ish weights)
114+
subspace.quantize(method="nf4")
115+
116+
# With double quantization (quantize the per-block scales too)
117+
subspace.quantize(method="nf4", double_quant=True)
118+
119+
# Also quantize loadings (effective when loadings are approximately normal)
120+
subspace.quantize(method="nf4", quantize_loadings=True)
121+
```
122+
123+
### Packed NF4 Storage
124+
125+
Save subspace in packed 4-bit format for ~7× disk savings:
126+
127+
```python
128+
# Save: packs components as uint8 (two 4-bit values per byte)
129+
subspace.save_quantized("shared_subspace/")
130+
131+
# Load: auto-detects format, dequantizes on the fly
132+
subspace = SharedSubspace.load("shared_subspace/")
133+
```
134+
135+
### QLoRA Base Model
136+
137+
`VLoRAModel` works with quantized base models loaded via bitsandbytes:
138+
139+
```python
140+
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
141+
from vlora import VLoRAModel, SharedSubspace
142+
143+
# Load 4-bit base model
144+
bnb_config = BitsAndBytesConfig(
145+
load_in_4bit=True,
146+
bnb_4bit_quant_type="nf4",
147+
bnb_4bit_compute_dtype=torch.bfloat16,
148+
)
149+
base_model = AutoModelForCausalLM.from_pretrained("model-name", quantization_config=bnb_config)
150+
151+
# Wrap with vLoRA — compute_dtype ensures LoRA math runs in BF16
152+
subspace = SharedSubspace.load("shared_subspace/")
153+
model = VLoRAModel(base_model, subspace, compute_dtype=torch.bfloat16)
154+
155+
print(model.qlora_info) # {'quantized': True, 'method': 'nf4', ...}
156+
model.set_task("task_0")
157+
output = model(input_ids)
158+
```
159+
160+
### Full-Stack Compression
161+
162+
Report combined savings across base model quantization and adapter compression:
163+
164+
```python
165+
stats = subspace.full_stack_compression(
166+
base_model_params=7_000_000_000, # 7B model
167+
base_model_bits=16, # original FP16
168+
quantized_bits=4, # QLoRA NF4
169+
)
170+
# → {'total_compression_ratio': 4.0, 'total_original_bytes': 14.0 GB, ...}
171+
```
172+
173+
See [`examples/qlora_pipeline.py`](examples/qlora_pipeline.py) for a complete end-to-end example.
174+
104175
## Training in the Subspace
105176

106177
Train only the loadings vector (k params per layer) instead of full LoRA matrices — 100×+ parameter reduction:
@@ -183,8 +254,10 @@ merged = dare_merge(adapters, drop_rate=0.5, seed=42)
183254
# Adaptive k: different components per layer based on explained variance
184255
subspace = SharedSubspace.from_adapters(adapters, adaptive_k=True, variance_threshold=0.9)
185256

186-
# Quantize components for smaller memory footprint
187-
subspace.quantize(bits=8) # or bits=4
257+
# Quantize components — symmetric (int8/int4) or NF4
258+
subspace.quantize(bits=8) # symmetric int8
259+
subspace.quantize(method="nf4") # NF4 4-bit (better for normal weights)
260+
subspace.quantize(method="nf4", double_quant=True) # + quantize the scales
188261

189262
# Check compression stats
190263
stats = subspace.compression_stats()
@@ -231,14 +304,16 @@ subspace.to(device="cuda", dtype=torch.float16)
231304
- `.absorb(adapter, task_id)` — Incorporate + recompute (full SVD)
232305
- `.absorb_incremental(adapter, task_id)` — Fast incremental update
233306
- `.get_trainable_params(task_id)` — For training integration
234-
- `.quantize(bits=8)` — Quantize components (int8/int4)
307+
- `.quantize(bits=8, method="symmetric")` — Quantize components (int8/int4/NF4)
235308
- `.compression_stats()` — Compression ratio and parameter counts
309+
- `.full_stack_compression(base_model_params)` — Combined base + adapter stats
236310
- `.to(device, dtype)` — Move tensors to device/dtype
237-
- `.save(path)` / `.load(path)` — Serialization
311+
- `.save(path)` / `.save_quantized(path)` / `.load(path)` — Serialization (NF4-packed auto-detected)
238312

239313
### Model Integration
240314

241-
- **`VLoRAModel(base_model, subspace, lora_alpha=None)`** — Inference wrapper with forward hooks
315+
- **`VLoRAModel(base_model, subspace, lora_alpha=None, compute_dtype=None)`** — Inference wrapper with forward hooks
316+
- `.qlora_info` — Base model quantization metadata
242317
- `.set_task(task_id)` — Switch adapter (cached)
243318
- `.clear_task()` — Remove adapter
244319
- `.available_tasks` — List task IDs
@@ -289,6 +364,7 @@ subspace.to(device="cuda", dtype=torch.float16)
289364
- `compute_svd`, `project_onto_subspace`, `reconstruct_from_subspace`
290365
- `gram_schmidt`, `explained_variance_ratio`, `select_num_components`
291366
- `incremental_svd_update`
367+
- `nf4_quantize_dequantize`, `nf4_pack`, `nf4_unpack` — NF4 quantization (QLoRA)
292368

293369
## Benchmarks — Real-World Adapters
294370

tests/test_huggingface.py

Lines changed: 79 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,52 @@
11
"""Tests for vlora.integrations.huggingface — HF Trainer callback."""
22

33
import torch
4+
import torch.nn as nn
45
import pytest
56

67
from vlora.io import LoRAWeights
78
from vlora.subspace import SharedSubspace
89
from vlora.training import orthogonal_init
910

1011

12+
LAYERS = ["layer.0.q_proj", "layer.0.v_proj"]
13+
DIM = 32
14+
RANK = 4
15+
16+
1117
def _make_subspace():
1218
"""Create a small subspace for testing."""
13-
layers = ["layer.0.q_proj", "layer.0.v_proj"]
14-
shared_a = {l: torch.randn(3, 4 * 32) for l in layers}
15-
shared_b = {l: torch.randn(3, 32 * 4) for l in layers}
19+
shared_a = {l: torch.randn(3, RANK * DIM) for l in LAYERS}
20+
shared_b = {l: torch.randn(3, DIM * RANK) for l in LAYERS}
1621

1722
adapters = []
1823
for i in range(3):
19-
lora_a = {l: (torch.randn(3) @ shared_a[l]).reshape(4, 32) for l in layers}
20-
lora_b = {l: (torch.randn(3) @ shared_b[l]).reshape(32, 4) for l in layers}
21-
adapters.append(LoRAWeights(layer_names=layers, lora_a=lora_a, lora_b=lora_b, rank=4))
24+
lora_a = {l: (torch.randn(3) @ shared_a[l]).reshape(RANK, DIM) for l in LAYERS}
25+
lora_b = {l: (torch.randn(3) @ shared_b[l]).reshape(DIM, RANK) for l in LAYERS}
26+
adapters.append(LoRAWeights(layer_names=LAYERS, lora_a=lora_a, lora_b=lora_b, rank=RANK))
2227

2328
return SharedSubspace.from_adapters(adapters, num_components=2)
2429

2530

31+
class _TinyModel(nn.Module):
32+
"""Minimal model with named Linear layers matching the subspace."""
33+
34+
def __init__(self):
35+
super().__init__()
36+
# Build nested structure so named_modules() produces "layer.0.q_proj" etc.
37+
layer_0 = nn.Module()
38+
layer_0.add_module("q_proj", nn.Linear(DIM, DIM, bias=False))
39+
layer_0.add_module("v_proj", nn.Linear(DIM, DIM, bias=False))
40+
layer = nn.Module()
41+
layer.add_module("0", layer_0)
42+
self.add_module("layer", layer)
43+
44+
def forward(self, x):
45+
x = self.layer._modules["0"].q_proj(x)
46+
x = self.layer._modules["0"].v_proj(x)
47+
return x
48+
49+
2650
class TestVLoRACallbackImport:
2751
def test_import_without_transformers(self):
2852
"""VLoRACallback should be importable even without transformers."""
@@ -110,3 +134,52 @@ def test_callback_logs_metrics(self):
110134
vlora_logs = [l for l in state.log_history if "vlora/loadings_norm" in l]
111135
assert len(vlora_logs) == 1
112136
assert vlora_logs[0]["vlora/loadings_norm"] >= 0
137+
138+
def test_callback_actually_trains_loadings(self):
139+
"""Verify loadings change after forward+backward+on_step_end.
140+
141+
This is the critical integration test: the old callback was a
142+
no-op that never stepped its optimizer. The new callback registers
143+
differentiable hooks so the Trainer's backward pass produces
144+
gradients on loadings, and on_step_end steps the optimizer.
145+
"""
146+
from vlora.integrations.huggingface import VLoRACallback
147+
from transformers import TrainerState, TrainerControl, TrainingArguments
148+
149+
sub = _make_subspace()
150+
orthogonal_init(sub, "test_task")
151+
152+
# Snapshot loadings before training
153+
initial_loadings = {
154+
l: sub.tasks["test_task"].loadings_a[l].clone()
155+
for l in LAYERS
156+
}
157+
158+
callback = VLoRACallback(sub, "test_task", lr=1e-2, log_every=999)
159+
args = TrainingArguments(output_dir="/tmp/test", use_cpu=True)
160+
state = TrainerState()
161+
state.global_step = 1
162+
control = TrainerControl()
163+
164+
# on_train_begin registers differentiable hooks on the model
165+
model = _TinyModel()
166+
callback.on_train_begin(args, state, control, model=model)
167+
168+
# Simulate a training step: forward → loss → backward
169+
x = torch.randn(2, DIM)
170+
output = model(x)
171+
loss = output.sum()
172+
loss.backward()
173+
174+
# on_step_end should step the loadings optimizer
175+
callback.on_step_end(args, state, control)
176+
177+
# Write back and check loadings changed
178+
callback.trainer.write_back()
179+
changed = False
180+
for l in LAYERS:
181+
if not torch.equal(sub.tasks["test_task"].loadings_a[l], initial_loadings[l]):
182+
changed = True
183+
break
184+
185+
assert changed, "Loadings did not change after training step — optimizer may not be stepping"

0 commit comments

Comments
 (0)