You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Release v0.3.0: NF4 packed storage, QLoRA pipeline example
- Add QLoRA Support section to README with NF4 quantization,
packed storage, QLoRA base model, and full-stack compression docs
- Update repo URL to github.com/vlora-dev/vlora
- Update API reference for new quantize/save_quantized/qlora_info APIs
- Add CHANGELOG.md covering all 0.3.0 changes
- Add VLoRACallback integration test proving loadings actually change
after forward+backward+on_step_end (the old callback was a no-op)
All notable changes to this project will be documented in this file.
4
+
5
+
Format follows [Keep a Changelog](https://keepachangelog.com/).
6
+
7
+
## [0.3.0] - 2026-03-30
8
+
9
+
### Added
10
+
-**NF4 quantization** — 4-bit NormalFloat quantization from QLoRA (Dettmers et al., 2023). `subspace.quantize(method="nf4")` uses 16 quantile levels optimized for normally-distributed weights, with per-block absmax scaling. Lower error than symmetric int4.
11
+
-**Double quantization** — quantize per-block NF4 scales to FP8 via `double_quant=True`, reducing scale overhead from 0.5 to ~0.127 bits/param.
12
+
-**NF4 packed storage** — `subspace.save_quantized()` packs components as uint8 (two 4-bit indices per byte) for ~7x disk savings. `SharedSubspace.load()` auto-detects format.
13
+
-**QLoRA-aware VLoRAModel** — `compute_dtype` parameter for mixed-precision LoRA computation with quantized base models; `qlora_info` property for base model introspection.
14
+
-**`full_stack_compression()`** — report combined base model quantization + adapter compression savings.
15
+
-**`quantize_loadings` parameter** — optionally quantize per-task loadings (not just components).
16
+
-**`nf4_pack` / `nf4_unpack`** — low-level ops for 4-bit packing to uint8.
17
+
-**Layer shapes stored in metadata** — `reconstruct()` uses stored shapes instead of deriving from `numel() // rank`, supporting per-layer rank configs.
18
+
-**`__repr__` on core objects** — `SharedSubspace`, `TaskProjection`, `LoRAWeights` now print useful info.
19
+
-**`adaptive_k` preserved through `absorb()`** — subspaces built with `adaptive_k=True` retain that setting after absorption.
20
+
- QLoRA + vLoRA pipeline example (`examples/qlora_pipeline.py`).
21
+
22
+
### Fixed
23
+
-**`absorb_incremental` re-projection bug** — existing tasks were having loadings padded/truncated instead of properly re-projected when the basis rotated. Now reconstructs from old basis and projects onto updated basis.
24
+
-**`VLoRACallback` was a no-op** — the HF Trainer callback created an optimizer but never stepped it. Now registers differentiable forward hooks so the Trainer's backward pass produces gradients on loadings, and steps the optimizer in `on_step_end`.
25
+
-**TIES merge normalization** — `n / contributor_count` over-scaled output when elements were trimmed. Fixed to `1 / contributor_count`.
26
+
-**`__version__` mismatch** — `__init__.py` said 0.1.0 while `pyproject.toml` said 0.2.1.
27
+
-**`check_tensor_health` never called** — imported but unused; now wired up after SVD in `from_adapters`.
28
+
-**Task ID collision** — `absorb()` and `absorb_incremental()` now warn when overwriting an existing task ID.
29
+
-**Filesystem-unsafe task IDs** — `save()` now sanitizes task IDs for filenames (handles `/`, `:`, spaces) with a mapping in metadata for lossless round-trip.
30
+
-**`from_adapters_streaming` missing validation** — now checks `len(task_ids) == len(adapter_paths)`.
31
+
32
+
### Changed
33
+
-**`gram_schmidt` uses QR factorization** — replaced O(k^2 * D) inner loop with `torch.linalg.qr` for better performance and numerical stability.
34
+
-**VLoRAModel caches module handles** — `_apply_hooks` no longer scans all `named_modules()` on every task switch.
vLoRA has first-class support for [QLoRA](https://arxiv.org/abs/2305.14314) workflows. QLoRA compresses the **base model** (FP16 → 4-bit NF4), while vLoRA compresses the **adapter space** — these are orthogonal and stack multiplicatively.
107
+
108
+
### NF4 Quantization
109
+
110
+
Quantize subspace components using the same NF4 data type from QLoRA — 16 quantile levels optimized for normally-distributed weights:
111
+
112
+
```python
113
+
# NF4 quantization (better than symmetric int4 for normal-ish weights)
114
+
subspace.quantize(method="nf4")
115
+
116
+
# With double quantization (quantize the per-block scales too)
0 commit comments