Release v0.3.0: NF4 packed storage, QLoRA pipeline example

tveseli · tveseli · commit c10250637178 · 2026-03-30T23:11:22.000-04:00
- Add QLoRA Support section to README with NF4 quantization,
  packed storage, QLoRA base model, and full-stack compression docs
- Update repo URL to github.com/vlora-dev/vlora
- Update API reference for new quantize/save_quantized/qlora_info APIs
- Add CHANGELOG.md covering all 0.3.0 changes
- Add VLoRACallback integration test proving loadings actually change
  after forward+backward+on_step_end (the old callback was a no-op)
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,54 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+Format follows [Keep a Changelog](https://keepachangelog.com/).
+
+## [0.3.0] - 2026-03-30
+
+### Added
+- **NF4 quantization** — 4-bit NormalFloat quantization from QLoRA (Dettmers et al., 2023). `subspace.quantize(method="nf4")` uses 16 quantile levels optimized for normally-distributed weights, with per-block absmax scaling. Lower error than symmetric int4.
+- **Double quantization** — quantize per-block NF4 scales to FP8 via `double_quant=True`, reducing scale overhead from 0.5 to ~0.127 bits/param.
+- **NF4 packed storage** — `subspace.save_quantized()` packs components as uint8 (two 4-bit indices per byte) for ~7x disk savings. `SharedSubspace.load()` auto-detects format.
+- **QLoRA-aware VLoRAModel** — `compute_dtype` parameter for mixed-precision LoRA computation with quantized base models; `qlora_info` property for base model introspection.
+- **`full_stack_compression()`** — report combined base model quantization + adapter compression savings.
+- **`quantize_loadings` parameter** — optionally quantize per-task loadings (not just components).
+- **`nf4_pack` / `nf4_unpack`** — low-level ops for 4-bit packing to uint8.
+- **Layer shapes stored in metadata** — `reconstruct()` uses stored shapes instead of deriving from `numel() // rank`, supporting per-layer rank configs.
+- **`__repr__` on core objects** — `SharedSubspace`, `TaskProjection`, `LoRAWeights` now print useful info.
+- **`adaptive_k` preserved through `absorb()`** — subspaces built with `adaptive_k=True` retain that setting after absorption.
+- QLoRA + vLoRA pipeline example (`examples/qlora_pipeline.py`).
+
+### Fixed
+- **`absorb_incremental` re-projection bug** — existing tasks were having loadings padded/truncated instead of properly re-projected when the basis rotated. Now reconstructs from old basis and projects onto updated basis.
+- **`VLoRACallback` was a no-op** — the HF Trainer callback created an optimizer but never stepped it. Now registers differentiable forward hooks so the Trainer's backward pass produces gradients on loadings, and steps the optimizer in `on_step_end`.
+- **TIES merge normalization** — `n / contributor_count` over-scaled output when elements were trimmed. Fixed to `1 / contributor_count`.
+- **`__version__` mismatch** — `__init__.py` said 0.1.0 while `pyproject.toml` said 0.2.1.
+- **`check_tensor_health` never called** — imported but unused; now wired up after SVD in `from_adapters`.
+- **Task ID collision** — `absorb()` and `absorb_incremental()` now warn when overwriting an existing task ID.
+- **Filesystem-unsafe task IDs** — `save()` now sanitizes task IDs for filenames (handles `/`, `:`, spaces) with a mapping in metadata for lossless round-trip.
+- **`from_adapters_streaming` missing validation** — now checks `len(task_ids) == len(adapter_paths)`.
+
+### Changed
+- **`gram_schmidt` uses QR factorization** — replaced O(k^2 * D) inner loop with `torch.linalg.qr` for better performance and numerical stability.
+- **VLoRAModel caches module handles** — `_apply_hooks` no longer scans all `named_modules()` on every task switch.
+- **VLoRAModel inference hooks wrapped in `torch.no_grad()`** — prevents unnecessary autograd tracking.
+- **NF4 quantization uses `torch.bucketize`** — replaced O(N*16) distance broadcast with binary search, reducing memory from O(N*16) to O(N).
+- **`_LORA_KEY_RE` handles multi-adapter PEFT format** — supports `base_model.model.{layer}.lora_A.{adapter_name}.weight`.
+- **`save_adapter` no longer hardcodes `CAUSAL_LM`** — task type left for PEFT to infer.
+- Repo URL updated to `github.com/vlora-dev/vlora`.
+
+## [0.2.1] - 2026-02-10
+
+Initial public release on PyPI as `vlora-dev`.
+
+### Added
+- `SharedSubspace` — 3-step algorithm: from_adapters, project, absorb
+- `VLoRAModel` — inference wrapper with forward hooks
+- `SubspaceTrainer` — loadings-only training
+- `TaskRouter` — per-input adapter routing
+- `task_arithmetic`, `ties_merge`, `dare_merge` — adapter merging
+- Analysis tools: similarity matrix, clustering, outlier detection
+- CLI with 9 commands
+- HuggingFace Trainer integration via `VLoRACallback`
+- Streaming and incremental subspace construction
diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ pip install vlora-dev
 
 Or from source:
 ```bash
-git clone https://github.com/tveseli/vlora.git
+git clone https://github.com/vlora-dev/vlora.git
 cd vlora
 pip install -e ".[dev]"
 ```
@@ -101,6 +101,77 @@ output = model(input_ids)
 print(model.available_tasks)  # ["task_0", "task_1", ...]
 ```
 
+## QLoRA Support
+
+vLoRA has first-class support for [QLoRA](https://arxiv.org/abs/2305.14314) workflows. QLoRA compresses the **base model** (FP16 → 4-bit NF4), while vLoRA compresses the **adapter space** — these are orthogonal and stack multiplicatively.
+
+### NF4 Quantization
+
+Quantize subspace components using the same NF4 data type from QLoRA — 16 quantile levels optimized for normally-distributed weights:
+
+```python
+# NF4 quantization (better than symmetric int4 for normal-ish weights)
+subspace.quantize(method="nf4")
+
+# With double quantization (quantize the per-block scales too)
+subspace.quantize(method="nf4", double_quant=True)
+
+# Also quantize loadings (effective when loadings are approximately normal)
+subspace.quantize(method="nf4", quantize_loadings=True)
+```
+
+### Packed NF4 Storage
+
+Save subspace in packed 4-bit format for ~7× disk savings:
+
+```python
+# Save: packs components as uint8 (two 4-bit values per byte)
+subspace.save_quantized("shared_subspace/")
+
+# Load: auto-detects format, dequantizes on the fly
+subspace = SharedSubspace.load("shared_subspace/")
+```
+
+### QLoRA Base Model
+
+`VLoRAModel` works with quantized base models loaded via bitsandbytes:
+
+```python
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+from vlora import VLoRAModel, SharedSubspace
+
+# Load 4-bit base model
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+base_model = AutoModelForCausalLM.from_pretrained("model-name", quantization_config=bnb_config)
+
+# Wrap with vLoRA — compute_dtype ensures LoRA math runs in BF16
+subspace = SharedSubspace.load("shared_subspace/")
+model = VLoRAModel(base_model, subspace, compute_dtype=torch.bfloat16)
+
+print(model.qlora_info)  # {'quantized': True, 'method': 'nf4', ...}
+model.set_task("task_0")
+output = model(input_ids)
+```
+
+### Full-Stack Compression
+
+Report combined savings across base model quantization and adapter compression:
+
+```python
+stats = subspace.full_stack_compression(
+    base_model_params=7_000_000_000,  # 7B model
+    base_model_bits=16,               # original FP16
+    quantized_bits=4,                 # QLoRA NF4
+)
+# → {'total_compression_ratio': 4.0, 'total_original_bytes': 14.0 GB, ...}
+```
+
+See [`examples/qlora_pipeline.py`](examples/qlora_pipeline.py) for a complete end-to-end example.
+
 ## Training in the Subspace
 
 Train only the loadings vector (k params per layer) instead of full LoRA matrices — 100×+ parameter reduction:
@@ -183,8 +254,10 @@ merged = dare_merge(adapters, drop_rate=0.5, seed=42)
 # Adaptive k: different components per layer based on explained variance
 subspace = SharedSubspace.from_adapters(adapters, adaptive_k=True, variance_threshold=0.9)
 
-# Quantize components for smaller memory footprint
-subspace.quantize(bits=8)  # or bits=4
+# Quantize components — symmetric (int8/int4) or NF4
+subspace.quantize(bits=8)                        # symmetric int8
+subspace.quantize(method="nf4")                  # NF4 4-bit (better for normal weights)
+subspace.quantize(method="nf4", double_quant=True)  # + quantize the scales
 
 # Check compression stats
 stats = subspace.compression_stats()
@@ -231,14 +304,16 @@ subspace.to(device="cuda", dtype=torch.float16)
   - `.absorb(adapter, task_id)` — Incorporate + recompute (full SVD)
   - `.absorb_incremental(adapter, task_id)` — Fast incremental update
   - `.get_trainable_params(task_id)` — For training integration
-  - `.quantize(bits=8)` — Quantize components (int8/int4)
+  - `.quantize(bits=8, method="symmetric")` — Quantize components (int8/int4/NF4)
   - `.compression_stats()` — Compression ratio and parameter counts
+  - `.full_stack_compression(base_model_params)` — Combined base + adapter stats
   - `.to(device, dtype)` — Move tensors to device/dtype
-  - `.save(path)` / `.load(path)` — Serialization
+  - `.save(path)` / `.save_quantized(path)` / `.load(path)` — Serialization (NF4-packed auto-detected)
 
 ### Model Integration
 
-- **`VLoRAModel(base_model, subspace, lora_alpha=None)`** — Inference wrapper with forward hooks
+- **`VLoRAModel(base_model, subspace, lora_alpha=None, compute_dtype=None)`** — Inference wrapper with forward hooks
+  - `.qlora_info` — Base model quantization metadata
   - `.set_task(task_id)` — Switch adapter (cached)
   - `.clear_task()` — Remove adapter
   - `.available_tasks` — List task IDs
@@ -289,6 +364,7 @@ subspace.to(device="cuda", dtype=torch.float16)
 - `compute_svd`, `project_onto_subspace`, `reconstruct_from_subspace`
 - `gram_schmidt`, `explained_variance_ratio`, `select_num_components`
 - `incremental_svd_update`
+- `nf4_quantize_dequantize`, `nf4_pack`, `nf4_unpack` — NF4 quantization (QLoRA)
 
 ## Benchmarks — Real-World Adapters
 
diff --git a/tests/test_huggingface.py b/tests/test_huggingface.py
@@ -1,28 +1,52 @@
 """Tests for vlora.integrations.huggingface — HF Trainer callback."""
 
 import torch
+import torch.nn as nn
 import pytest
 
 from vlora.io import LoRAWeights
 from vlora.subspace import SharedSubspace
 from vlora.training import orthogonal_init
 
 
+LAYERS = ["layer.0.q_proj", "layer.0.v_proj"]
+DIM = 32
+RANK = 4
+
+
 def _make_subspace():
     """Create a small subspace for testing."""
-    layers = ["layer.0.q_proj", "layer.0.v_proj"]
-    shared_a = {l: torch.randn(3, 4 * 32) for l in layers}
-    shared_b = {l: torch.randn(3, 32 * 4) for l in layers}
+    shared_a = {l: torch.randn(3, RANK * DIM) for l in LAYERS}
+    shared_b = {l: torch.randn(3, DIM * RANK) for l in LAYERS}
 
     adapters = []
     for i in range(3):
-        lora_a = {l: (torch.randn(3) @ shared_a[l]).reshape(4, 32) for l in layers}
-        lora_b = {l: (torch.randn(3) @ shared_b[l]).reshape(32, 4) for l in layers}
-        adapters.append(LoRAWeights(layer_names=layers, lora_a=lora_a, lora_b=lora_b, rank=4))
+        lora_a = {l: (torch.randn(3) @ shared_a[l]).reshape(RANK, DIM) for l in LAYERS}
+        lora_b = {l: (torch.randn(3) @ shared_b[l]).reshape(DIM, RANK) for l in LAYERS}
+        adapters.append(LoRAWeights(layer_names=LAYERS, lora_a=lora_a, lora_b=lora_b, rank=RANK))
 
     return SharedSubspace.from_adapters(adapters, num_components=2)
 
 
+class _TinyModel(nn.Module):
+    """Minimal model with named Linear layers matching the subspace."""
+
+    def __init__(self):
+        super().__init__()
+        # Build nested structure so named_modules() produces "layer.0.q_proj" etc.
+        layer_0 = nn.Module()
+        layer_0.add_module("q_proj", nn.Linear(DIM, DIM, bias=False))
+        layer_0.add_module("v_proj", nn.Linear(DIM, DIM, bias=False))
+        layer = nn.Module()
+        layer.add_module("0", layer_0)
+        self.add_module("layer", layer)
+
+    def forward(self, x):
+        x = self.layer._modules["0"].q_proj(x)
+        x = self.layer._modules["0"].v_proj(x)
+        return x
+
+
 class TestVLoRACallbackImport:
     def test_import_without_transformers(self):
         """VLoRACallback should be importable even without transformers."""
@@ -110,3 +134,52 @@ def test_callback_logs_metrics(self):
         vlora_logs = [l for l in state.log_history if "vlora/loadings_norm" in l]
         assert len(vlora_logs) == 1
         assert vlora_logs[0]["vlora/loadings_norm"] >= 0
+
+    def test_callback_actually_trains_loadings(self):
+        """Verify loadings change after forward+backward+on_step_end.
+
+        This is the critical integration test: the old callback was a
+        no-op that never stepped its optimizer. The new callback registers
+        differentiable hooks so the Trainer's backward pass produces
+        gradients on loadings, and on_step_end steps the optimizer.
+        """
+        from vlora.integrations.huggingface import VLoRACallback
+        from transformers import TrainerState, TrainerControl, TrainingArguments
+
+        sub = _make_subspace()
+        orthogonal_init(sub, "test_task")
+
+        # Snapshot loadings before training
+        initial_loadings = {
+            l: sub.tasks["test_task"].loadings_a[l].clone()
+            for l in LAYERS
+        }
+
+        callback = VLoRACallback(sub, "test_task", lr=1e-2, log_every=999)
+        args = TrainingArguments(output_dir="/tmp/test", use_cpu=True)
+        state = TrainerState()
+        state.global_step = 1
+        control = TrainerControl()
+
+        # on_train_begin registers differentiable hooks on the model
+        model = _TinyModel()
+        callback.on_train_begin(args, state, control, model=model)
+
+        # Simulate a training step: forward → loss → backward
+        x = torch.randn(2, DIM)
+        output = model(x)
+        loss = output.sum()
+        loss.backward()
+
+        # on_step_end should step the loadings optimizer
+        callback.on_step_end(args, state, control)
+
+        # Write back and check loadings changed
+        callback.trainer.write_back()
+        changed = False
+        for l in LAYERS:
+            if not torch.equal(sub.tasks["test_task"].loadings_a[l], initial_loadings[l]):
+                changed = True
+                break
+
+        assert changed, "Loadings did not change after training step — optimizer may not be stepping"