Follow-up: Chain-Bucket Speculative Decoding Extension for 2–3× Additional Speedup Full Paper Alignment + Advanced Inference Optimizations "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (arXiv:2402.17764)
Version: 3.0 (Follow-up to v2.0) Date: March 19, 2026 Status: Ready-to-execute blueprint
Note: This supersedes Implementation Plan v2. The v2 plan is retained for historical reference.
Key addition in v3: Phase 6 – Chain-Bucket Speculative Decoding – a pure inference-time optimization that delivers 2–3× effective speedup on top of BitNet's native gains by turning frequent token chains into single-byte lookups. All previous phases are unchanged.
- Executive Summary & Success Criteria
- Prerequisites & Repository Setup
- Training Datasets Strategy (Unchanged – Reused for Bucket Mining)
- Overall Architecture – High-Level UML
- Phase 0: Documentation & Realignment (1–2 days)
- Phase 1: Exact BitLinear Implementation (3–5 days)
- Phase 2: Tiny Transformer Skeleton (7–10 days)
- Phase 3: Training Loop + Data Pipeline (14–21 days)
- Phase 4: Inference, Serialization & Benchmarks (7–10 days)
- Phase 5: Validation, Testing & Paper Alignment Checklist (5 days)
- Phase 6: Chain-Bucket Speculative Decoding (New – 7–10 days)
- Phase 7: Full Integration, Testing & Benchmarks (5–7 days)
- Full UML Catalog
- Risk Register & Mitigation
- Timeline, Milestones & Effort Estimates
- Future Extensions
(Identical to v2.0, with one new success criterion.)
Goal: Deliver the first canonical pure-C# reference implementation of BitNet b1.58 that matches the paper exactly in architecture, quantization, and training procedure – including reproducible results on the paper's datasets – and achieves measurable inference speedup via chain-bucket speculative decoding.
-
Quantization:
$$\gamma = \frac{1}{n \times m} \sum_{i,j} |W_{ij}|, \quad W_q = \text{RoundClip}!\left(\frac{W}{\gamma + \epsilon}, -1, 1\right)$$ where
$\epsilon = 10^{-6}$ . -
Training: 100B tokens on RedPajama (paper default), STE gradients, AdamW.
-
Activations: Signed int8 per-token scaling to
$[-127, 127]$ . -
Architecture: Decoder-only LLaMA with BitLinear replacements.
- Perplexity within 5% of paper's 700M/3B baselines on WikiText2 + C4.
- Zero-shot scores on ARC-Easy, HellaSwag, etc., matching paper tables.
- Model loads unmodified in llama.cpp BitNet fork.
- Memory ≤ 1.58 bits/parameter verified by histogram.
- New: Chain-bucket acceptance rate ≥ 65% on Tier 2 validation set → measured 2.0–3.0× tokens-per-second uplift vs. single-token greedy decoding.
(Identical to v2.0.)
New folder: src/BitNetSharp.Core/Inference/ChainBuckets/
(All three tiers – TinyStories, SlimPajama/FineWeb-Edu, full RedPajama – are now also used for offline n-gram mining to populate the 256 chain buckets. No new data is required. See v2 plan for full detail.)
Critical reuse note: The existing BitNetDataLoader shards produced in Phase 3 feed directly into the ChainMiner in Phase 6. No separate download or preprocessing step is needed.
(Identical to v2.0. The new ChainBucketTable component plugs into InferenceEngine only; it does not touch the training path.)
classDiagram
direction TB
class BitNetTransformer {
+BitNetConfig Config
+TokenEmbedding Embeddings
+BitNetLayer[] Layers
+BitLinear OutputHead
+forward(tokens: int[batch, seq]) Tensor
}
class BitNetLayer {
+RMSNorm PreAttnNorm
+MultiHeadAttention Attn
+RMSNorm PreFFNNorm
+SwiGLUFeedForward FFN
}
class InferenceEngine {
+ChainBucketTable Buckets
+generate(prompt: int[]) int[]
}
class ChainBucketTable {
+ChainEntry[256] Entries
+lookup(context: int[]) ChainEntry?
+load(path: string) void
}
BitNetTransformer --> BitNetLayer : contains N
BitNetLayer --> BitLinear : 7 instances
InferenceEngine --> ChainBucketTable : uses
InferenceEngine --> BitNetTransformer : verifies chains
(Unchanged from v2.0, with one additional README line.)
Add one line to docs/README.md: "Phase 6 adds chain-bucket speculative decoding for massive inference speedup."
(Unchanged from v2.0 – see v2 plan.)
(Unchanged from v2.0 – see v2 plan.)
(Unchanged from v2.0 – see v2 plan.)
(Unchanged from v2.0. Phase 6 extends the InferenceEngine created here – see v2 plan.)
(Unchanged from v2.0 – see v2 plan.)
Objectives
Implement "one byte = chain of likely tokens" using 256 pre-mined buckets. This turns frequent n-grams into single-byte lookups + parallel verification, delivering 2–3× effective speedup with zero quality loss.
| Tier | Dataset | Duration | Output |
|---|---|---|---|
| 1 | TinyStories | 1 day | Initial 256 buckets for rapid prototyping |
| 2 | SlimPajama / FineWeb-Edu | 2 days | Full 256-bucket table with frequency + conditional probability scores |
| 3 | RedPajama 100B | 3–4 days (cluster) | Production-grade buckets covering chat, code, and long-context patterns |
- Scan entire dataset with sliding window (n = 2 to n = 8).
- Count frequency of each sequence.
- Score each candidate: frequency × average model probability (from a preliminary BitNet run).
- Cluster top candidates into exactly 256 buckets using greedy prefix-tree packing.
- Store:
byte ChainID → TokenID[] chain + float confidence. - Output:
chain-buckets.bin(header + 256 entries, < 50 KB total).
- Define
ChainEntryrecord:byte Id,int[] Tokens,float Confidence. - Implement
ChainBucketTable: load 256 entries fromchain-buckets.bin; exposeLookup(ReadOnlySpan<int> context). - Add serialization/deserialization for
chain-buckets.binwith a fully specified binary format for cross-implementation stability:- Endianness: All multi-byte values are little-endian.
- Header (fixed 12 bytes):
char[4] Magic= ASCII"CHNB"(0x43, 0x48, 0x4E, 0x42).uint16 Version=0x0001(allows future format evolution).uint16 EntryCount=256(must be 256 for v1).uint16 MaxChainLength=8(implementers MUST NOT write longer chains).uint16 Reserved=0(writer MUST set to 0; readers MUST ignore).
- Entries (repeated
EntryCounttimes, in ascendingChainIDorder 0–255):uint8 ChainID(0–255).uint8 Reserved=0(align to 2 bytes; written as 0, ignored on read).uint16 TokenCount(0–MaxChainLength; readers MUST rejectTokenCount > MaxChainLengthas a format error).int32[TokenCount] Tokens— token IDs in the main model tokenizer's ID space.float32 Confidence— IEEE 754 single-precision, little-endian.
- Footer (4 bytes):
uint32 Crc32— CRC32 of all preceding bytes (header + all entries), using polynomial 0xEDB88320. Readers MUST verify and reject mismatches.
- This layout MUST be mirrored exactly in both the .NET implementation and any C/C++ (
llama.cpp) consumer.
- Implement
ChainMinerusing the existingBitNetDataLoadershard streams. - Add sliding-window n-gram extraction (n = 2–8) to
ChainMiner. - Add frequency counter and conditional probability scorer inside
ChainMiner. - Implement greedy prefix-tree packing to compress top candidates into exactly 256 buckets.
- Wire
ChainBucketTableintoInferenceEngine: after each generated token, attempt bucket lookup against the last 1–3 context tokens. - On a hit, use the single-byte
ChainIDto look up and expand the corresponding candidate token chain (noChainIDis emitted as part of the model output). - Run one BitNet forward pass across the full candidate chain for parallel verification.
- Accept tokens sequentially from the chain while
p_model(token_k) ≥ AcceptanceThreshold; stop at the first token whose probability falls below the threshold. - On mismatch or no-hit, fall back to single-token greedy/top-p decoding.
- Update the KV-cache by appending the entire accepted chain in one operation.
- Add
ChainBucketOptionsconfiguration:Enabled,MaxChainLength(default 8),AcceptanceThreshold(default 0.85). - Add
--enable-chainsCLI flag wired toChainBucketOptions. - Log per-step metrics: acceptance rate, average accepted tokens per step, effective speedup multiplier.
- Export
chain-buckets.binalongside the model checkpoint for llama.cpp compatibility. - Validate: Tier 2 (SlimPajama / FineWeb-Edu) acceptance rate ≥ 65%; tokens/sec ≥ 2× baseline.
flowchart TD
A[Last 1-3 Tokens] --> B[Lookup in 256-Bucket Table<br/>1-byte ChainID]
B --> C{Bucket Hit?}
C -->|Yes| D[Expand Candidate Chain<br/>up to 8 tokens]
C -->|No| G[Single-Token Greedy / Top-p]
D --> E[Parallel BitNet Forward Pass<br/>+ Verification]
E --> F{p_model(token_k) ≥ AcceptanceThreshold?}
F -->|Yes| H[Append Entire Chain to KV-Cache<br/>1 Update]
F -->|No| G
H --> I[Continue Generation]
G --> I
sequenceDiagram
participant DatasetShards
participant ChainMiner
participant FrequencyCounter
participant Clusterer
DatasetShards->>ChainMiner: Tier 1/2/3 shard streams
ChainMiner->>FrequencyCounter: Extract n-grams (n=2–8)
FrequencyCounter-->>ChainMiner: Counts + conditional probs
ChainMiner->>Clusterer: Top-scored candidates
Clusterer-->>ChainMiner: 256 packed bucket entries
ChainMiner->>ChainMiner: Save chain-buckets.bin
Goal: Wire the chain-bucket layer into the full application surface, achieve complete test coverage, and verify the 2×+ speedup claim.
Detailed Steps:
- Wire
ChainBucketTableintoInferenceEngineconstructor; make it optional (null = disabled). - Add
--enable-chains [path]CLI flag; default path =chain-buckets.binbeside the model file. - Update
HostedAgentModelFactoryto passChainBucketOptionsthrough toInferenceEngine. - Add
ChainBucketTableserialization round-trip test (write → read → assert identical entries). - Add unit tests for
ChainMiner: frequency counter, prefix-tree packer, 256-bucket constraint. - Add unit tests for
InferenceEnginewith chain acceptance: mock bucket hit → assert chain appended. - Add unit tests for fallback path: mock bucket miss + mismatch → assert single-token greedy.
- Add integration test: end-to-end generation with and without chains; assert identical token outputs.
- Add BenchmarkDotNet benchmark: tokens/sec with chains enabled vs. disabled on Tier 2 validation (target ≥ 2× uplift).
- Add chain-length histogram logging: distribution of accepted chain lengths (1–8) per generation.
- Add E2E tests: chat, code completion, and agentic task prompts with chains enabled.
- Add chain-bucket visualization: ASCII histogram of acceptance rate by chain length in
--visualizemode. - Update
docs/architecture.mdwithChainBucketTablecomponent and integration diagram. - Add chain-bucket speedup metrics to
paper-alignment.md(new checklist item). - Create a conversion helper: generates
chain-buckets.binfrom an existing trained checkpoint + Tier 1 shards. - Release checklist: pre-built nano model file +
chain-buckets.binpackaged together. - Update
docs/benchmarking.mdwith chain-enabled benchmark instructions and expected throughput range.
(All diagrams from v2.0 are retained. The four new diagrams below are specific to v3.0.)
(See Phase 6 above.)
(See Phase 6 above.)
graph LR
InferenceEngine --> ChainBucketTable
InferenceEngine --> BitNetTransformer
ChainBucketTable --> chain-buckets.bin[(chain-buckets.bin)]
BitNetTransformer --> model-checkpoint[(model checkpoint)]
flowchart LR
V[Verification logits for chain token k] --> P{p_model ≥ AcceptanceThreshold?}
P -->|Yes| ACC[Accept token k<br/>advance k]
P -->|No| REJ[Reject token k<br/>fall back to greedy for position k]
ACC --> NXT{k = chain length?}
NXT -->|No| V
NXT -->|Yes| DONE[Full chain accepted]
(All rows from v2.0 are retained. Three new rows are added for Phase 6.)
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | RedPajama download size (1T+ tokens) | High | High | Start with Tier 1 TinyStories; gate Tier 3 on cluster availability |
| 2 | STE gradient instability / loss divergence | Medium | High | Validate on Tier 1 first; monitor ternary ratio and weight histograms |
| 3 | GGUF format drift from bitnet.cpp | Medium | Medium | Pin bitnet.cpp commit; add round-trip test in CI |
| 4 | .NET 10 TorchSharp API changes | Low | Medium | Abstract behind IOptimizer interface; pin package version |
| 5 | Perplexity gap > 5% from paper | Medium | High | Audit quantization constants ( |
| 6 | Tokenizer mismatch | Low | High | Use Microsoft.ML.Tokenizers with exact LLaMA-3 vocab file; test round-trip |
| 7 | Low chain-bucket acceptance rate on creative text | Medium | Medium | Fallback always active; mine Tier 3 for domain diversity |
| 8 | Mining compute time | Low | Low | Start with Tier 1 (~1 day); parallelize with existing DataLoader
|
| 9 | Chain verification overhead negating speedup | Low | Medium | Limit MaxChainLength=8; early-exit on first mismatch; profile on Tier 2 |
| Week | Phase | Milestone | Deliverable |
|---|---|---|---|
| 1 | Phase 0 + Tier 1 data prep | Data Ready | TinyStories shards on disk |
| 3 | Phase 1–2 complete | Nano Model Runs | BitLinear + transformer forward pass |
| 6 | Phase 3 Tier 2 training | Paper-Like Training | 10B-token checkpoint |
| 7–8 | Phase 6 Chain-Bucket | Chain Decoding Live | Mining complete + InferenceEngine integration |
| 9 | Phase 3 Tier 3 + Phase 4–5 + Phase 7 | v3.0 Release | 100B-token model + GGUF + 2–3× speedup verified |
Total estimated effort: 60–75 days (parallelize mining with Phase 4).
New Milestone: "Chain-Bucket v1.0" – 2×+ speedup verified on chat and code workloads.
Scale targets:
- Nano (start): 4 layers, dim=256, heads=8, ~30M params
- Small: 700M params
- Medium: 3B params
- Dynamic bucket updating during fine-tuning (online mining).
- Multi-byte chain IDs (512+ buckets) for broader coverage.
- Integration with Microsoft Agent Framework for tool-call chains.
- GPU-accelerated chain verification kernels (ComputeSharp).
- 2T-token StableLM recipe extension.
- Full evaluation suite (ARC, HellaSwag, WinoGrande, PIQA, StoryCloze).
- ONNX export.
- Sparse ternary packing for further memory reduction.
- Distributed training sharding.