Full Alignment with the Original Research Paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764)
Version: 1.0
Date: March 17, 2026
Author: Grok (on behalf of sharpninja / BitNet-b1.58-Sharp team)
Status: Reference Blueprint – Copy-paste ready for your project wiki or docs/roadmap.md
Important Notes Before Starting
- This repository now targets only the paper-aligned transformer path. Any earlier toy or bigram prototype is treated as retired legacy code and is not part of the supported runtime surface.
- Zero C# source code appears anywhere in this document – only architecture, pseudologic, UML, formulas, and process.
- All diagrams use Mermaid (native GitHub rendering).
- Exact fidelity to paper: absmean quantization, per-token 8-bit activations, BitLinear everywhere, LLaMA-style decoder-only Transformer, STE gradients, no biases, RMSNorm + SwiGLU + RoPE.
- Target starting scale: 4-layer, 256-dim, 32k-vocab “nano” model (~30 M parameters) that fits in <200 MB RAM and trains on a single consumer CPU/GPU in hours.
- Executive Summary & Success Criteria
- Prerequisites & Repository Setup
- Overall Architecture – High-Level UML
- Phase 0: Documentation & Project Realignment (1–2 days)
- Phase 1: Exact BitLinear Implementation (3–5 days)
- Phase 2: Tiny Transformer Skeleton (7–10 days)
- Phase 3: Training Loop with STE & Data Pipeline (10–14 days)
- Phase 4: Inference Engine, Serialization & Benchmarks (5–7 days)
- Phase 5: Validation, Testing & Paper Alignment Checklist (3 days)
- Full UML Catalog (Object & Logic Examples)
- Risk Register & Mitigation
- Timeline, Milestones & Effort Estimates
- Future Extensions
Goal: Transform the current bigram toy into the canonical .NET reference implementation of the exact BitNet b1.58 architecture described in the paper.
Paper-Exact Requirements (must be met 100%)
- Every linear projection → BitLinear with ternary weights
{-1, 0, +1}. - Quantization formula:
$$\gamma = \frac{1}{nm} \sum_{i,j} |W_{ij}| \quad \text{(absmean across all }nm\text{ weights)}$$ $$W_q = \text{RoundClip}\left(\frac{W}{\gamma} + \epsilon, -1, 1\right)$$ where epsilon = 1e-6, RoundClip rounds to nearest integer then clamps. - Training: straight-through estimator (STE) – forward uses quantized, backward passes full-precision gradient.
- Activations: signed 8-bit per-token scaling (no zero-point).
- Architecture: LLaMA-identical components (RMSNorm, SwiGLU FFN, RoPE, no biases, decoder-only).
- No external Python dependencies after Phase 3 (TorchSharp allowed only as optional bridge).
Success Criteria (measurable)
- Perplexity on RedPajama subset within 5% of paper-reported 700 M model baseline.
- Memory footprint ≤ 1.58 bits/parameter (verified by weight histogram).
- Inference latency on CPU < 2× fp16 LLaMA equivalent (nano model).
- Model file loads in llama.cpp BitNet fork without modification.
- 100% test coverage on quantization, STE, and forward pass.
- .NET 10 SDK (global.json already pinned).
- Optional: TorchSharp NuGet (for tensor/autograd in Phase 3; can be removed later).
- Datasets: 1 % RedPajama sample (or TinyStories 10 M tokens) – download script in Phase 3.
- Branch strategy:
main= paper-aligned;feature/bitlinearetc. for PRs. - New folders to create:
src/BitNetSharp.Core/Layers/src/BitNetSharp.Core/Models/src/BitNetSharp.Core/Quantization/src/BitNetSharp.Core/Training/src/BitNetSharp.Core/Utils/(RoPE, RMSNorm, SwiGLU)
- Archive current bigram files into
archive/2026-03-bigram-prototype.
classDiagram
direction TB
class BitNetTransformer {
+int NumLayers
+int Dim
+int VocabSize
+TokenEmbedding[] Embeddings
+BitNetLayer[] Layers
+BitLinear OutputHead
+forward(inputTokens: int[]) Tensor
}
class BitNetLayer {
+RMSNorm PreAttnNorm
+MultiHeadAttention Attn
+RMSNorm PreFFNNorm
+SwiGLUFeedForward FFN
}
class BitLinear {
+float Gamma
-sbyte[][] TernaryWeights
+quantize(fullPrecision: float[][])
+forward(activations: float[][]) : float[][]
+backwardSTE(grad: float[][]) : float[][]
}
class RMSNorm {
+float Epsilon
}
class SwiGLUFeedForward {
+BitLinear GateProj
+BitLinear UpProj
+BitLinear DownProj
}
class MultiHeadAttention {
+BitLinear QProj
+BitLinear KProj
+BitLinear VProj
+BitLinear OProj
+RoPE Rotator
}
BitNetTransformer --> BitNetLayer : contains N
BitNetLayer --> BitLinear : uses 7×
BitNetLayer --> RMSNorm
BitNetLayer --> SwiGLUFeedForward
BitNetLayer --> MultiHeadAttention
Objectives
Rebrand and set expectations; archive old model.
Detailed Steps
- Rename repo description to “.NET Reference Implementation of BitNet b1.58 (arXiv:2402.17764)”.
- Replace root README.md with new template (status banner, paper link, quick-start after Phase 4).
- Create
docs/paper-alignment.mdcontaining the exact table from my previous message + success criteria above. - Add
docs/architecture-overview.mdwith the high-level UML above + 3 more diagrams (see Section 10). - Update
SUMMARY.md(GitBook) with new navigation: Architecture → BitLinear → Transformer → Training. - Archive bigram code + update .gitignore for any temporary checkpoints.
- Add LICENSE header to every new file stub.
- Create GitHub issue template “Paper-Alignment-Task”.
- Add badges: .NET 10 | arXiv 2402.17764 | WIP.
- Commit as “chore: Phase 0 alignment – documentation baseline”.
Effort: 4–6 hours.
Deliverable: Repo now screams “this is the paper, not the toy”.
Objectives
Implement the single most important primitive exactly as Section 2 of the paper.
Detailed Steps
- Create abstract base
Module(tensor-in/tensor-out contract). - Implement
BitLinearclass with attributes: Gamma (absmean scale), TernaryWeights (sbyte 2D), optional ScaleCache. - Implement
QuantizeFromFullPrecisionusing exact absmean + RoundClip formula (include epsilon = 1e-6). - Forward pass: ternary matrix multiplication + per-token activation scaling to signed 8-bit range [-Q_b, Q_b] where Q_b = 127.
- Backward pass: STE – copy full-precision gradient, ignore quantization.
- Add
ToFullPrecision()helper for debugging. - Unit-test matrix: 10 random FP32 matrices → verify ternary histogram exactly matches paper (approximately 1/3 each of -1/0/+1 on average).
- Add weight-distribution histogram logger (reuse your existing visualizer).
- Create
BitLinearConfigrecord (dimIn, dimOut, bias=false). - Integration test: replace a single dense matrix multiply with BitLinear and verify numerical equivalence within 1e-4 before/after quantize.
UML – BitLinear Object Model
classDiagram
class BitLinear {
+float Gamma
-sbyte[][] TernaryWeights
-float[][] ScaleCache
+QuantizeFromFullPrecision(fullW: float[][])
+Forward(inputAct: float[][]) : float[][]
+BackwardSTE(gradOutput: float[][]) : float[][]
+GetTernaryStats() : {minus1: int, zero: int, plus1: int}
}
UML – Quantization Logic Sequence
sequenceDiagram
participant Trainer
participant BitLinear
participant AbsMeanCalculator
participant RoundClip
Trainer->>BitLinear: QuantizeFromFullPrecision(fullW)
BitLinear->>AbsMeanCalculator: Compute γ = mean(|W|)
AbsMeanCalculator-->>BitLinear: γ
BitLinear->>RoundClip: W / γ + ε
RoundClip-->>BitLinear: clipped ternary
BitLinear->>BitLinear: Store TernaryWeights & Gamma
Objectives
Assemble LLaMA-identical decoder block using BitLinear everywhere.
Detailed Steps
- Implement
RMSNorm(paper exact: epsilon = 1e-5). - Implement
RoPErotator (apply to Q/K only – 50-line math, no code here). - Implement
SwiGLUFeedForwardwith three BitLinear projections. - Implement
MultiHeadAttentionwith four BitLinear (Q,K,V,O) + RoPE + scaled-dot-product. - Implement
BitNetLayercomposing PreAttnNorm → Attn → Add & Norm → PreFFNNorm → SwiGLU → Add & Norm. - Implement
BitNetTransformerwith TokenEmbedding (FP32) + N layers + Output BitLinear head. - Add config class
BitNetConfigmirroring nano-LLaMA (layers=4, dim=256, heads=8, vocab=32000). - Stub
forwardmethod that chains embeddings → layers → logits. - Add shape-validation assertions at every layer boundary.
- Create integration test: random input tokens → verify output tensor shape and non-NaN values.
UML – Single Layer Logic Flow (Activity Diagram)
flowchart TD
A[Input Tokens] --> B[Token Embedding]
B --> C[RMSNorm Pre-Attn]
C --> D[MultiHeadAttention<br/>Q/K/V/O = BitLinear + RoPE]
D --> E[Residual Add]
E --> F[RMSNorm Pre-FFN]
F --> G[SwiGLU FFN<br/>3× BitLinear]
G --> H[Residual Add]
H --> I[Output Logits via BitLinear Head]
Objectives
Full next-token prediction training matching paper Section 4.
Detailed Steps
- Implement
DataLoaderfor tokenized RedPajama (batch, seqLen=2048, packing). - Implement
CrossEntropyLosswith STE wrapper. - Create
Trainerclass with AdamW (paper defaults: lr=3e-4, weight-decay=0.1). - In training step:
- Forward quantized
- Compute loss
- Backward through STE
- Optimizer step
- Periodic re-quantize every 100 steps (paper trick).
- Add gradient clipping (norm=1.0).
- Logging: perplexity, weight sparsity, ternary ratio, learning rate.
- Checkpointing: save Gamma + TernaryWeights + optimizer state every epoch.
- Validation split evaluation every 500 steps.
- Early-stop on validation plateau.
- Support mixed-precision (FP32 weights during training).
UML – Training Loop Sequence
sequenceDiagram
participant DataLoader
participant BitNetTransformer
participant Optimizer
participant STEWrapper
DataLoader->>BitNetTransformer: batchTokens
BitNetTransformer->>BitNetTransformer: Forward (all BitLinear)
BitNetTransformer-->>Loss: logits
Loss->>STEWrapper: backward
STEWrapper-->>Optimizer: full-precision grads
Optimizer->>BitNetTransformer: step + re-quantize
Objectives
Production-ready inference + llama.cpp compatibility.
Detailed Steps
- Implement
InferenceEnginewith KV-cache (per-layer). - Add greedy / top-p / temperature sampling.
- Implement binary serialization: header (magic, config, vocab) + per-layer Gamma + ternary matrix (packed).
- Create converter utility from HuggingFace BitNet checkpoints (if published).
- Export to custom
.bitnetformat + optional GGUF patch script. - Benchmark suite: latency, tokens/sec, memory, vs. fp16 baseline on same hardware.
- CLI commands:
infer,chat,benchmark. - Add ONNX export stub (Phase 5).
- Integrate Microsoft Agent Framework hooks (reuse your existing host).
- Performance regression test suite (target <2× fp16 latency).
Checklist (must all be green)
- Quantization histogram matches paper Figure 2.
- Per-token activation scaling verified 8-bit.
- No bias parameters anywhere.
- RMSNorm + SwiGLU + RoPE exact.
- STE gradient flow tested numerically.
- Perplexity within 5% of paper baseline on same data.
- Model loads in official bitnet.cpp fork.
- 95%+ unit test coverage.
Testing Strategy
- Unit: BitLinear, RMSNorm, RoPE.
- Integration: full forward/backward on 1-layer model.
- E2E: train 1 epoch on TinyStories, generate 100 tokens.
Component Diagram (High-Level System)
flowchart LR
Core[BitNetSharp.Core] --> Layers[Layers]
Layers --> Training[Training]
Layers --> Inference[Inference]
Training --> Utils[Utils (RoPE, RMSNorm)]
Additional Sequence: Inference with KV-Cache
sequenceDiagram
participant Engine
participant Layer1..N
Engine->>Layer1..N: token + KVCache
Layer1..N->>Layer1..N: BitLinear QKV + RoPE + Attention
Layer1..N-->>Engine: newKV
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Numerical instability in STE | Medium | High | Use gradient clipping + FP32 master weights |
| Memory explosion at scale | High | Medium | Start at 4-layer nano; add sparsity later |
| RoPE implementation drift | Low | High | Unit test against known PyTorch reference |
| Serialization incompatibility | Medium | High | Match llama.cpp BitNet format exactly |
| Training divergence | High | High | Use paper hyperparameters + warm-up |
- Week 1: Phase 0 + Phase 1 complete → “BitLinear Done” milestone
- Week 3: Phase 2 complete → “Nano Transformer Skeleton” milestone
- Week 5: Phase 3 complete → “Training Works” milestone
- Week 6: Phase 4 + 5 complete → “Paper-Aligned v1.0” release
Total estimated effort: 35–45 working days (part-time possible at roughly 10–20 hours per week).
- Scale to 700 M / 3 B parameters
- GPU kernels via ComputeSharp or TorchSharp CUDA
- Sparse ternary representation
- Full RedPajama 100 B token training
- ONNX Runtime export
- Quantized fine-tuning support