|
2 | 2 | <picture> |
3 | 3 | <img alt="TrainCheck logo" width="55%" src="./docs/assets/images/traincheck_logo.png"> |
4 | 4 | </picture> |
5 | | -<h1>TrainCheck: Training with Confidence</h1> |
| 5 | +<h1>TrainCheck: Invariant Checking & Observability for AI Training</h1> |
6 | 6 |
|
7 | 7 | [](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml) |
8 | 8 | [](https://github.com/OrderLab/traincheck/actions/workflows/correctness_checks.yml) |
|
12 | 12 | </div> |
13 | 13 |
|
14 | 14 |
|
15 | | -**TrainCheck** is a lightweight tool for proactively catching **silent errors** in deep learning training runs. It detects correctness issues, such as code bugs and faulty hardware, early and pinpoints their root cause. |
| 15 | +**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail. |
16 | 16 |
|
17 | | -TrainCheck has detected silent errors in a wide range of real-world training scenarios, from large-scale LLM pretraining (such as BLOOM-176B) to small-scale tutorial runs by deep learning beginners. |
| 17 | +--- |
18 | 18 |
|
19 | | -📌 For a list of successful cases, see our [Success Stories](./docs/successful-stories.md). |
| 19 | +### Why TrainCheck? |
20 | 20 |
|
21 | | -## What It Does |
| 21 | +✅ **Continuous Invariant Checking** |
| 22 | +TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours. |
22 | 23 |
|
23 | | -TrainCheck uses **training invariants**, which are semantic rules that describe expected behavior during training, to detect bugs as they happen. These invariants can be extracted from any correct run, including those produced by official examples and tutorials. There is no need to curate inputs or write manual assertions. |
| 24 | +🚀 **Holistic Observability** |
| 25 | +Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss. |
24 | 26 |
|
25 | | -TrainCheck performs three core functions: |
| 27 | +🧠 **Zero-Config Validation** |
| 28 | +No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly. |
26 | 29 |
|
27 | | -1. **Instruments your training code** |
28 | | - Inserts lightweight tracing into existing scripts (such as [pytorch/examples](https://github.com/pytorch/examples) or [transformers](https://github.com/huggingface/transformers/tree/main/examples)) with minimal code changes. |
| 30 | +⚡ **Universal Compatibility** |
| 31 | +Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more. |
29 | 32 |
|
30 | | -2. **Learns invariants from correct runs** |
31 | | - Discovers expected relationships across APIs, tensors, and training steps to build a model of normal behavior. |
| 33 | +--- |
32 | 34 |
|
33 | | -3. **Checks new or modified runs** |
34 | | - Validates behavior against the learned invariants and flags silent errors, such as missing gradient clipping, weight desynchronization, or broken mixed precision, right when they occur. |
| 35 | +### How It Works |
35 | 36 |
|
36 | | -This picture illustrates the TrainCheck workflow: |
| 37 | +1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed. |
| 38 | +2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training). |
| 39 | +3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults. |
37 | 40 |
|
38 | 41 |  |
39 | 42 |
|
40 | | -Under the hood, TrainCheck decomposes into three CLI tools: |
41 | | -- **Instrumentor** (`traincheck-collect`) |
42 | | - Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics. |
43 | | -- **Inference Engine** (`traincheck-infer`) |
44 | | - Consumes one or more trace logs from successful runs to infer training invariants. |
45 | | -- **Checker** (`traincheck-check`) |
46 | | - Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants. |
47 | | - |
48 | 43 | ## 🔥 Try TrainCheck |
49 | 44 |
|
50 | 45 | Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to: |
|
0 commit comments