Full Paper Alignment + Comprehensive Training Datasets Strategy "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764)
Version: 2.0 (Updated March 19, 2026) Date: March 19, 2026 Status: Ready-to-execute blueprint
Note: This plan has been superseded by Implementation Plan v3. It is retained here for historical reference.
Key additions in v2 over v1: Dedicated training datasets strategy (Section 3), tiered development approach, download/prep pseudocode, and expanded UML catalog.
- Executive Summary & Success Criteria
- Prerequisites & Repository Setup
- Training Datasets Strategy
- Overall Architecture – High-Level UML
- Phase 0: Documentation & Realignment (1–2 days)
- Phase 1: Exact BitLinear Implementation (3–5 days)
- Phase 2: Tiny Transformer Skeleton (7–10 days)
- Phase 3: Training Loop + Data Pipeline (14–21 days)
- Phase 4: Inference, Serialization & Benchmarks (7–10 days)
- Phase 5: Validation, Testing & Paper Alignment Checklist (5 days)
- Full UML Catalog
- Risk Register & Mitigation
- Timeline, Milestones & Effort Estimates
- Future Extensions
Goal: Deliver the first canonical pure-C# reference implementation of BitNet b1.58 that matches the paper exactly in architecture, quantization, and training procedure – including reproducible results on the paper's datasets.
-
Quantization:
$$\gamma = \frac{1}{n \times m} \sum_{i,j} |W_{ij}|, \quad W_q = \text{RoundClip}!\left(\frac{W}{\gamma + \epsilon}, -1, 1\right)$$ where
$\epsilon = 10^{-6}$ . -
Training: 100B tokens on RedPajama (paper default), STE gradients, AdamW.
-
Activations: Signed int8 per-token scaling to
$[-127, 127]$ . -
Architecture: Decoder-only LLaMA with BitLinear replacements.
- Perplexity within 5% of paper's 700M/3B baselines on WikiText2 + C4.
- Zero-shot scores on ARC-Easy, HellaSwag, etc., matching paper tables.
- Model loads unmodified in llama.cpp BitNet fork.
- Memory ≤ 1.58 bits/parameter verified by histogram.
- .NET 10 SDK (global.json pinned).
- Optional: TorchSharp (Phase 3 only), Microsoft.ML.Tokenizers (for LLaMA-style vocab).
- Datasets (detailed in Section 3).
- New folders:
src/BitNetSharp.Core/Datasets/,src/BitNetSharp.Core/DataLoaders/. - Archive any remaining bigram files.
This section provides exact, reproducible instructions matching the paper's 100B-token RedPajama training while enabling fast iteration on a laptop.
- Dataset: TinyStories (full dataset, ~2.2M stories, ~10M tokens after tokenization).
- Why: Matches paper's "small-scale validation" spirit; trains in minutes on CPU; perfect for verifying STE and quantization stability.
- Source: Hugging Face –
roneneldan/TinyStories(or raw JSONL from together.ai mirror). - Preprocessing:
- Download JSONL (stories + instructions).
- Concatenate all text fields.
- Tokenize with LLaMA-3 32k vocab (Microsoft.ML.Tokenizers).
- Pack sequences to exactly 2048 tokens (paper context length).
- Save as binary
.binshards (8 shards total).
- Target tokens: 10M (full pass = 1 epoch).
- Dataset: SlimPajama (subset of RedPajama, 627B tokens total; take 10B-token slice) OR FineWeb-Edu (high-quality 1.3T filtered web data).
- Why: Direct subset of paper's RedPajama family; enables perplexity close to paper baselines.
- Source: Hugging Face –
cerebras/SlimPajama-627B(filter to 10B tokens) orHuggingFaceFW/fineweb-edu(score ≥ 3.0). - Preprocessing (identical to Tier 1 + deduplication):
- Filter: keep only English, score ≥ 3.0 (FineWeb).
- Deduplicate (exact + MinHash 5-gram).
- Tokenize + pack to 2048-token sequences.
- Create 100 shards of 100M tokens each.
- Target tokens: 10B (matches early paper experiments).
- Dataset: RedPajama-v2 (full 30T token version) OR original RedPajama-1T.
- Exact paper match: "We pre-trained the models on the RedPajama dataset for 100 billion tokens."
- Source: Together.ai RedPajama (or Hugging Face
togethercomputer/RedPajama-Data-1T). - Preprocessing (paper-identical):
- Multi-source mixture (CommonCrawl, C4, GitHub, Books, Wikipedia, ArXiv, StackExchange).
- Quality filtering + deduplication as in original RedPajama pipeline.
- Tokenize with same 32k LLaMA tokenizer.
- Pack to 2048 tokens with attention mask.
- Store as 1000+ binary shards (each 100M tokens).
- Target tokens: 100B (exact paper) or 2T (StableLM-3B recipe extension).
- Use LLaMA-3 32k vocab (exact match to modern LLaMA baselines used in paper reproductions).
- Fallback: SentencePiece or tiktoken port.
- Special tokens:
<|endoftext|>,<|unk|>.
BitNetDatasetLoader: yields batches of shape(batch_size, 2048).- Supports streaming from disk shards, shuffling, packing.
- Validation split: 1% held-out (WikiText2 + C4 as per paper eval).
One PowerShell/Bash script per tier that downloads, tokenizes, and shards.
Store all prepared data in data/ folder (gitignore large files).
classDiagram
direction TB
class BitNetTransformer {
+BitNetConfig Config
+TokenEmbedding Embeddings
+BitNetLayer[] Layers
+BitLinear OutputHead
+forward(tokens: int[batch, seq]) Tensor
}
class BitNetLayer {
+RMSNorm PreAttnNorm
+MultiHeadAttention Attn
+RMSNorm PreFFNNorm
+SwiGLUFeedForward FFN
}
BitNetTransformer --> BitNetLayer : contains N
BitNetLayer --> BitLinear : 7 instances
Goal: Ensure all documentation, folder layout, and README accurately reflect the paper-aligned BitNet b1.58 path before any new code is written.
Steps:
- Archive
implementation-plan.md→implementation-plan-v1.md(done as part of this update). - Create this document (
implementation-plan-v2.md) as the active blueprint. - Update
docs/SUMMARY.mdto list both v1 (archived) and v2 (active) plans. - Update
docs/README.mddataset preparation section with Tier 1/2/3 links. - Verify
.gitignoreexcludesdata/and large binary shards. - Create stub folders:
src/BitNetSharp.Core/Datasets/,src/BitNetSharp.Core/DataLoaders/. - Add
data/to.gitignoreif not already present. - Pin
global.jsonto .NET 10 SDK. - Validate
dotnet buildanddotnet testpass cleanly from the repo root. - Tag
v0.0-phase0-completeon completion. - Update
docs/architecture.mdwith any component changes since v1. - Add dataset download links to
docs/README.md.
Goal: Implement BitLinear exactly as specified in Section 2 of the paper.
Steps:
- Define
BitLinearclass insrc/BitNetSharp.Core/extending the base linear layer. - Implement weight quantization: absmean scaling
$\gamma = \text{mean}(|W|)$ , thenRoundClip. - Implement
RoundClip(x, -1, 1)=clamp(round(x), -1, 1)– ternary {-1, 0, +1}. - Implement per-token activation quantization: signed int8 scaling to
$[-127, 127]$ . - Add
$\epsilon = 10^{-6}$ guard in the$\gamma$ denominator. - Ensure no biases on any
BitLinearlayer (paper requirement). - Write unit tests asserting the ternary histogram (≥ 99% of weights in {-1, 0, +1}).
- Write unit tests asserting per-token activation range stays within
$[-127, 127]$ . - Verify against paper Section 2 scalar examples.
- Profile memory: confirm ≤ 1.58 bits/parameter via weight histogram logging.
Goal: Assemble the full decoder-only LLaMA-style transformer using only BitLinear projections. Nano scale: 4 layers, dim=256, heads=8, vocab=32k, ~30M params.
Steps:
- Implement
RMSNorm(no bias, learnable scale only). - Implement
RotaryPositionalEmbedding(RoPE) for positions up to 2048. - Implement
MultiHeadAttentionwith Q/K/V/O projections asBitLinear. - Implement
SwiGLUFeedForwardwith gate/up/down projections asBitLinear. - Implement
BitNetLayer=RMSNorm+ attention +RMSNorm+ FFN (pre-norm). - Implement
TokenEmbedding(standard float32, not quantized). - Implement
BitNetTransformerstacking NBitNetLayers + output head. - Add
BitNetConfigrecord: layers, dim, heads, ffn_mult, vocab_size, max_seq_len. - Write shape-assertion unit tests for each sub-component.
- Test RoPE against reference vectors from the LLaMA paper.
- Run a forward pass with random weights and assert output shape =
(batch, seq, vocab). - Log ternary weight histogram after initialization.
Goal: End-to-end training from Tier 1 (TinyStories) through Tier 3 (RedPajama 100B tokens), with full STE gradient support.
Detailed Steps (18 sub-steps):
- Implement
BitNetDataLoaderfor Tier 1: streaming JSONL, tokenizing, packing to 2048 tokens, emitting batches. - Implement shard-based disk format (binary
.bin): write/read helpers. - Implement Tier 2 data loader with MinHash deduplication filter.
- Implement Tier 3 multi-source shard loader with weighted mixing.
- Add STE wrapper in
BitLinearbackward pass: pass gradients through quantization unchanged. - Implement AdamW optimizer: lr=3e-4, warm-up 2000 steps, cosine decay, weight decay=0.1.
- Implement gradient clipping (norm=1.0).
- Implement cross-entropy loss on next-token prediction.
- Implement periodic re-quantization every 100 steps (refresh ternary weights from float shadow copy).
- Add training loop: forward → loss → STE backward → AdamW step → re-quantize.
- Add perplexity logging (moving average over 1000 steps).
- Add ternary ratio metric: fraction of weights exactly in {-1, 0, +1}.
- Add weight histogram logging every 500 steps.
- Implement checkpoint: save
$\gamma$ values + ternary weights every 1B tokens. - Tier 1 first: Run full TinyStories training; verify non-divergence and falling perplexity.
- Tier 2: Scale to 10B SlimPajama/FineWeb-Edu tokens; confirm perplexity approaches paper baselines.
- Tier 3: Full 100B RedPajama on cluster.
- Validate on WikiText2 + C4 after every 10B tokens.
flowchart TD
A[Dataset Shards<br/>TinyStories / RedPajama] --> B[BitNetDataLoader<br/>Pack 2048 tokens]
B --> C[BitNetTransformer<br/>Forward quantized]
C --> D[CrossEntropyLoss]
D --> E[STE Backward]
E --> F[AdamW Step + Re-quantize]
F --> G[Log Perplexity & Histogram]
G --> B
Goal: Efficient inference path, GGUF export, and BenchmarkDotNet measurement.
Steps:
- Implement greedy and top-p/top-k sampling.
- Implement KV-cache for auto-regressive decoding.
- Implement fused ternary multiply-accumulate kernel (SIMD via
System.Numerics). - Implement GGUF serialization compatible with llama.cpp BitNet fork.
- Write round-trip test: serialize → load → assert identical logits on sample input.
- Add BenchmarkDotNet benchmarks for: tokens/second, memory footprint, perplexity.
- Compare against
traditional-localbaseline using existing--compare-modelflag. - Measure and report bits/parameter from saved checkpoint histogram.
- Validate GGUF file loads in official
bitnet.cppwithout modification. - Publish benchmark report via existing
benchmark-report.ymlCI workflow.
Goal: Confirm every paper claim is met before tagging a v1.0 release.
Checklist (must all pass):
- Quantization histogram matches paper Figure 2 (ternary distribution).
- 100B-token RedPajama perplexity within 5% of paper baselines.
- Zero-shot scores match paper Table 3 (ARC, HellaSwag, WinoGrande, PIQA, StoryCloze).
- Model file loads in official
bitnet.cppwithout modification. - All existing
dotnet testtests continue to pass. - No biases present in any
BitLinearlayer (verified by parameter inspection). - RMSNorm, SwiGLU, RoPE unit tests pass with paper-reference values.
- Memory ≤ 1.58 bits/parameter confirmed by checkpoint histogram.
flowchart LR
W[Float Weights W] --> ABS[Element-wise absolute value]
ABS --> MEAN[Mean → γ]
MEAN --> SCALE[W · 1/γ]
SCALE --> ROUND[Round to nearest integer]
ROUND --> CLIP[Clip to -1, 0, +1]
CLIP --> Wq[Ternary Weights Wq]
flowchart LR
G_out[Upstream gradient ∂L/∂Wq] --> STE{STE: pass-through}
STE --> G_in[∂L/∂W unchanged]
G_in --> OPT[AdamW update on float W]
OPT --> RQ[Re-quantize W → Wq every 100 steps]
flowchart LR
X[Input activation x] --> ABSMAX[Per-token absmax → η]
ABSMAX --> SCALE2[x · 127/η]
SCALE2 --> CLIP2[Clip to -127, 127]
CLIP2 --> Xq[Signed int8 activation Xq]
flowchart TD
SRC[Raw dataset JSONL/Parquet] --> FILTER[Language + quality filter]
FILTER --> DEDUP[Deduplication MinHash]
DEDUP --> TOK[Tokenize LLaMA-3 32k vocab]
TOK --> PACK[Pack to 2048-token sequences]
PACK --> SHARD[Write binary shards .bin]
SHARD --> LOADER[BitNetDataLoader streaming]
LOADER --> BATCH[Yield batch_size × 2048 batches]
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | RedPajama download size (1T+ tokens) | High | High | Start with Tier 1 TinyStories; gate Tier 3 on cluster availability |
| 2 | STE gradient instability / loss divergence | Medium | High | Validate on Tier 1 first; monitor ternary ratio and weight histograms |
| 3 | GGUF format drift from bitnet.cpp | Medium | Medium | Pin bitnet.cpp commit; add round-trip test in CI |
| 4 | .NET 10 TorchSharp API changes | Low | Medium | Abstract behind IOptimizer interface; pin package version |
| 5 | Perplexity gap > 5% from paper | Medium | High | Audit quantization constants ( |
| 6 | Tokenizer mismatch | Low | High | Use Microsoft.ML.Tokenizers with exact LLaMA-3 vocab file; test round-trip |
| Week | Phase | Milestone | Deliverable |
|---|---|---|---|
| 1 | Phase 0 + Tier 1 data prep | Data Ready | TinyStories shards on disk |
| 3 | Phase 1–2 complete | Nano Model Runs | BitLinear + transformer forward pass |
| 6 | Phase 3 Tier 2 training | Paper-Like Training | 10B-token checkpoint |
| 9 | Phase 3 Tier 3 + Phase 4–5 | v1.0 Release | 100B-token model + GGUF |
Total estimated effort: 45–60 days (parallelize dataset prep with code development).
Scale targets:
- Nano (start): 4 layers, dim=256, heads=8, ~30M params
- Small: 700M params
- Medium: 3B params
- 2T-token StableLM recipe extension.
- GPU kernels via ComputeSharp.
- Full evaluation suite (ARC, HellaSwag, WinoGrande, PIQA, StoryCloze).
- ONNX export.
- Sparse ternary packing for further memory reduction.
- Distributed training sharding.