Address All Three Issues in One Cohesive Plan
Core Repository – Strictly Domain-Agnostic
Version: 1.0
Date: March 20, 2026
Status: Ready-to-execute
Dependency note: WikiText-2 validation download and tokenization are being added in PR #27. This plan assumes that dependency merges first and then consumes those repository-local artifacts.
- Executive Summary & Success Criteria
- Prerequisites
- Overall Architecture
- Phase 1: Enforce Repository Purity & Architecture Guidelines (1–2 days)
- Phase 2: Implement Real Training Loop (7–10 days)
- Phase 3: Build Enhanced Benchmark Suite with TinyLlama-1.1B (6–8 days)
- Phase 4: Create Improved Report that Surfaces Strengths & Deficiencies (3–4 days)
- Phase 5: CI Integration & Release (2 days)
- Full UML Catalog
- Risk Register & Mitigation
- Timeline & Effort Estimates
This plan replaces the stub training, expands benchmarks to include TinyLlama-1.1B, perplexity, and real-world task comparisons, and redesigns the report to clearly show where BitNet wins on speed and memory and where it still needs quality improvements.
- Training runs multiple epochs with real data and visibly reduces loss
- Benchmarks measure perplexity, reasoning, code, and efficiency on TinyLlama-1.1B
- Report shows zero-based quality delta and clearly flags deficiencies
- Repository remains 100% domain-agnostic with no vertical code
- Existing
BitNetModel,BitLinear, tokenizer, and SpecFlow tests - BenchmarkDotNet already added to the test project
- WikiText-2 validation set downloaded and pre-tokenized by PR #27
flowchart TD
A[WikiText-2 Loader] --> B[Real Training Loop (Epochs + STE)]
B --> C[BenchmarkDotNet Suite (TinyLlama-1.1B)]
C --> D[Perplexity + Zero-Shot + Code + Efficiency]
D --> E[Improved Report (Strengths vs Deficiencies)]
- Commit
docs/repo-alignment-guidelines.mdfrom the prior discussion. - Update the root
README.mdwith a repository-purity banner and no vertical mentions. - Add a pull request template that requires a purity checklist.
- Move any stray domain code, if present, to a new companion repository stub.
Replace the stub in BitNetModel.cs with a training API shaped like this:
public TrainingReport Train(int epochs, IDataLoader loader)
{
var optimizer = new AdamWOptimizer(3e-4f, 0.1f);
var report = new TrainingReport();
for (int e = 0; e < epochs; e++)
{
double totalLoss = 0;
int count = 0;
foreach (var batch in loader.GetBatches())
{
var logits = Forward(batch.Input);
var loss = CrossEntropyLoss(logits, batch.Target);
totalLoss += loss.Value * batch.Size;
count += batch.Size;
loss.BackwardWithSTE();
optimizer.Step(Parameters);
optimizer.ZeroGrad();
}
ReQuantizeAllLayers();
report.AddEpoch(e, totalLoss / count);
}
return report;
}Implement IDataLoader, AdamWOptimizer, and CrossEntropyLoss with STE support.
Create tests/BitNetSharp.Tests/Benchmarks/TinyLlamaBenchmark.cs:
[Config(typeof(BitNetBenchmarkConfig))]
public class TinyLlamaBenchmark
{
[Benchmark] public void TrainingEpoch() => model.Train(1, wikiLoader);
[Benchmark] public double PerplexityBitNet() => model.CalculatePerplexity(wikiLoader);
[Benchmark] public double ARCEasyAccuracy() => model.EvaluateZeroShot(ARC_Easy);
[Benchmark] public double HumanEvalPass1() => model.EvaluateHumanEval();
}Add a WikiText-2 loader and zero-shot evaluators.
Update ReportGenerator.cs to emit a clear comparison table:
Category | Metric | BitNet | Traditional | Delta | Interpretation
----------------------|-------------------------|----------|-------------|----------------|-------------------------------
Language Modeling | WikiText-2 PPL | 18.4 | 17.1 | -7.6% | Minor quality gap
Reasoning | ARC-Easy Accuracy | 61% | 68% | -10.3% | Needs improvement
Code Generation | HumanEval Pass@1 | 19% | 25% | -24% | Significant deficiency
Efficiency | CPU Tokens/sec | 48 | 13 | +269% | Major win
Efficiency | Memory (MB) | 1,150 | 4,600 | 4× smaller | Strong advantageDelta is zero-based: 0% means parity, positive means better, and negative means worse.
- Add a nightly benchmark job in GitHub Actions
- Publish the report to
docs/benchmarks/latest.html - Tag a release when perplexity delta and speed targets are met
flowchart TD
A[WikiText-2] --> B[Real Training]
B --> C[Enhanced Benchmarks]
C --> D[Improved Report]
D --> E[Actionable Insights]
| Risk | Likelihood | Mitigation |
|---|---|---|
| Training still stub-like | High | Enforce a minimum of 3 epochs plus a real data loader |
| Report misleading | Medium | Use zero-based delta plus explicit better/worse labels |
| Scope creep | High | Require a purity checklist in every PR |
| Phase | Estimate |
|---|---|
| Phase 1: Enforce Repository Purity & Architecture Guidelines | 1–2 days |
| Phase 2: Implement Real Training Loop | 7–10 days |
| Phase 3: Build Enhanced Benchmark Suite with TinyLlama-1.1B | 6–8 days |
| Phase 4: Create Improved Report that Surfaces Strengths & Deficiencies | 3–4 days |
| Phase 5: CI Integration & Release | 2 days |
| Total | 19–26 days |
This plan keeps all work inside the core repository while remaining strictly domain-agnostic. It is intended to address stub training, benchmark quality, and report clarity as one coordinated roadmap.