You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Snapshot of project completion status. Updated 2026-04-17 — P2
soft-expired-alive counter landed; P3 script hygiene fully landed;
--shard-prefix flag + TelemetryPruneService reconciled out of the
further-horizon list; §3 table reconciled with Further-horizon footer
(Real-ASR + Multi-turn rows closed).
This is the "where we are right now" doc: what works end-to-end, what
is partially wired, and what is still on the backlog. Cross-references
the detailed plans in distributed-training.md,
training-corpus-scope.md, and
full-implementation-plan-real-training-benchmarks-purity-v1.0.md.
Status legend
✅ Done — shipped, exercised in production/dev, no known gaps.
Landed in commit 9b604b7; fallback TargetTaskDurationSeconds*2 covers first-task case
Seed-size feedback loop
✅
seed-real-tasks auto reads fleet-wide gradient_events tps and sizes tokensPerTask to fit TargetTaskDurationSeconds; falls back to 16,384 when the telemetry table is empty
"Soft-expired but alive" UI counter
✅
Dashboard card joins tasks.deadline_at < now with fresh worker heartbeat
"Stuck (dead)" UI counter
✅
Dashboard card: Assigned + deadline past + worker missing/heartbeat stale
Dedicated MeasuredTokensPerSecond field on GradientSubmitRequest; persisted in gradient_events.measured_tps
7. Fleet nodes
Node
Role
Status
PAYTON-DESKTOP
Coordinator (Windows Service)
✅ online
LEGION2
Worker (Docker)
✅ running real training
DESKTOP
Worker (Docker)
✅ running real training
8. Source control & CI
Item
Status
Notes
Azure DevOps (azure) as primary
✅
All pushes target azure
GitHub (origin) downstream mirror
✅
On-demand sync after azure
Azure DevOps pipelines doc
✅
docs/azure-devops-pipelines.md
GHCR worker image publish
✅
ghcr-push-worker.ps1
Signed-commit / PR-only bypass
🟡
GitHub reports bypass warnings; intentional for mirror
9. Documentation
Doc
Status
README / SUMMARY
✅
Architecture
✅
Bucketing guide + impl plan
✅
DataGen guide
✅
Implementation plan v3
✅
Full implementation plan (real training + benchmarks + purity)
✅
Real training implementation plan v1.0
✅
Benchmarking
✅
Azure DevOps pipelines
✅
Releases and packaging
✅
Repo alignment guidelines
✅
Usage
✅
Training and visualization
✅
Distributed training
✅
Training corpus scope (new)
✅
State of completion (this doc)
✅
10. Remaining work — ordered by priority
P0 — Real-throughput deadline calibration
Status: ✅ landed in 9b604b7; 3c06953 adds the companion
KLocalSteps: 1 default + purge-pending CLI so mis-seeded
40-min K=4 tasks don't outrun the lease before first telemetry
lands.
Files:SqliteTelemetryStore.GetMeasuredTokensPerSecond,
SqliteWorkQueueStore.TryClaimNextPending(Func<long, TimeSpan>),
ClaimNextTaskCommandHandler.LeaseFor, SeedRealTasksCommandLine.
P1 — Seed-size feedback loop
Status: ✅ shipped. seed-real-tasks auto reads the fleet-wide
measured tps from gradient_events (30-min window) and picks
tokensPerTask = round(tps × TargetTaskDurationSeconds, multiple of 512). Falls back to 16,384 when no recent events exist, so a fresh
coordinator DB still seeds sanely.
Files:SqliteTelemetryStore.GetGlobalMeasuredTokensPerSecond,
SeedRealTasksCommandLine auto branch.
P2 — "Soft-expired but alive" UI counter
Status: ✅ shipped. SqliteWorkQueueStore.CountSoftExpiredButAlive
joins tasks with workers: Assigned rows whose deadline_at < now but
whose owning worker's last_heartbeat >= now - StaleWorkerThresholdSeconds.
GetDashboardSnapshotQueryHandler now takes IOptionsMonitor<CoordinatorOptions>
and wires the window. TaskCounts gets a new SoftExpiredButAlive
field; DashboardPage.razor renders a new card (warn-tinted when > 0)
between Assigned and Done so the operator can distinguish a stuck
worker from a slow-but-alive one at a glance.
Files:SqliteWorkQueueStore.CountSoftExpiredButAlive,
GetDashboardSnapshotQuery.cs (ctor + TaskCounts), DashboardPage.razor.
P3 — Script hygiene
Status: ✅ shipped. The surviving one-off helpers were promoted out
of .claude/scripts/tmp-*.ps1 into scripts/ with non-tmp-
names: deploy-coord.ps1, dump-events.ps1, purge-and-reseed.ps1,
check-telemetry.ps1, purge-telemetry.ps1, set-coord-env.ps1,
purge-v1-shards.ps1. Caller references in
scripts/Generate-TruckMateCorpusV2.ps1 and this doc follow the new
paths.
Acceptance: no tmp-* probes in .claude/scripts/; git status
clean. ✅
P4 — Legacy task-seed-* rows
Status: ✅ tooling shipped — product picks which knob to pull.
Options now available:
purge-legacy-seed-rows CLI: dry-run by default; --yes hard-deletes
all task-seed-* Done rows. Irreversible.
mark-legacy CLI: tags matching rows with legacy=1. Dashboard
progress counter filters them out via CountByState(.., excludeLegacy: true); rows survive for audit. mark-legacy --unmark
recovers. Reversible.
Acceptance: dashboard progress bar tracks real-corpus training
signal once operator runs either CLI.
Files:SqliteWorkQueueStore.CountByTaskIdPrefixAndState,
DeleteByTaskIdPrefixAndState, MarkLegacyByTaskIdPrefix,
UnmarkLegacyByTaskIdPrefix, CountByState(state, excludeLegacy);
Program.cs (purge-legacy-seed-rows, mark-legacy subcommands).
Further horizon (not sequenced)
Corpus-v2 scale (200K+ examples) to unlock truckmate-large. ✅ Done 2026-04-17.
seed-real-tasks --shard-prefix truckmate-v2 flag to target v2 shards in seeding. ✅ Landed in commit 12da06f; scripts/Generate-TruckMateCorpusV2.ps1 wires it for auto-seed.
Automated nightly telemetry prune (D-5 in the full impl plan). ✅ TelemetryPruneService runs hourly, deletes gradient_events + worker_logs older than TelemetryRetentionDays / LogRetentionDays.
Real-ASR-trace ingestion with PII scrubbing. Superseded 2026-04-17 by SonnetAsrCorpusGenerator (commit 07f39f1): Sonnet synthesizes ASR-noisy [USER]/[INTENT] lines into asr-v1- shards via generate-asr-corpus CLI. No real audio or PII to scrub.
Multi-turn corpus generator. ✅ Landed 2026-04-17 in commit a7453b9 (Option Z: reuse [USER]/[INTENT] as turn separators; MultiTurnCorpusGenerator + generate-multiturn-corpus CLI; vocab-pin 5174 preserved).
Per-worker GPU support. Dropped 2026-04-17 — BitNet ternary hot path is already sbyte × sbyte → int32 CPU SIMD; no off-the-shelf ternary CUDA kernel and roundtrip cost > win.