Skip to content

Commit 5c688c2

Browse files
committed
updated homepage and readme
1 parent 43b7984 commit 5c688c2

6 files changed

Lines changed: 190 additions & 123 deletions

File tree

.github/workflows/deploy-docs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,6 @@ jobs:
2020
python-version: 3.x
2121
- name: Install dependencies
2222
run: |
23-
pip install mkdocs
23+
pip install mkdocs-material
2424
- name: Deploy docs
2525
run: mkdocs gh-deploy --force

README.md

Lines changed: 17 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
<picture>
33
<img alt="TrainCheck logo" width="55%" src="./docs/assets/images/traincheck_logo.png">
44
</picture>
5-
<h1>TrainCheck: Training with Confidence</h1>
5+
<h1>TrainCheck: Invariant Checking & Observability for AI Training</h1>
66

77
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml)
88
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/correctness_checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/correctness_checks.yml)
@@ -12,39 +12,34 @@
1212
</div>
1313

1414

15-
**TrainCheck** is a lightweight tool for proactively catching **silent errors** in deep learning training runs. It detects correctness issues, such as code bugs and faulty hardware, early and pinpoints their root cause.
15+
**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail.
1616

17-
TrainCheck has detected silent errors in a wide range of real-world training scenarios, from large-scale LLM pretraining (such as BLOOM-176B) to small-scale tutorial runs by deep learning beginners.
17+
---
1818

19-
📌 For a list of successful cases, see our [Success Stories](./docs/successful-stories.md).
19+
### Why TrainCheck?
2020

21-
## What It Does
21+
**Continuous Invariant Checking**
22+
TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours.
2223

23-
TrainCheck uses **training invariants**, which are semantic rules that describe expected behavior during training, to detect bugs as they happen. These invariants can be extracted from any correct run, including those produced by official examples and tutorials. There is no need to curate inputs or write manual assertions.
24+
🚀 **Holistic Observability**
25+
Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss.
2426

25-
TrainCheck performs three core functions:
27+
🧠 **Zero-Config Validation**
28+
No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.
2629

27-
1. **Instruments your training code**
28-
Inserts lightweight tracing into existing scripts (such as [pytorch/examples](https://github.com/pytorch/examples) or [transformers](https://github.com/huggingface/transformers/tree/main/examples)) with minimal code changes.
30+
**Universal Compatibility**
31+
Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.
2932

30-
2. **Learns invariants from correct runs**
31-
Discovers expected relationships across APIs, tensors, and training steps to build a model of normal behavior.
33+
---
3234

33-
3. **Checks new or modified runs**
34-
Validates behavior against the learned invariants and flags silent errors, such as missing gradient clipping, weight desynchronization, or broken mixed precision, right when they occur.
35+
### How It Works
3536

36-
This picture illustrates the TrainCheck workflow:
37+
1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed.
38+
2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training).
39+
3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.
3740

3841
![Workflow](docs/assets/images/workflow.png)
3942

40-
Under the hood, TrainCheck decomposes into three CLI tools:
41-
- **Instrumentor** (`traincheck-collect`)
42-
Wraps target training programs with lightweight tracing logic. It produces an instrumented version of the target program that logs API calls and model states without altering training semantics.
43-
- **Inference Engine** (`traincheck-infer`)
44-
Consumes one or more trace logs from successful runs to infer training invariants.
45-
- **Checker** (`traincheck-check`)
46-
Runs alongside or after new training jobs to verify that each recorded event satisfies the inferred invariants.
47-
4843
## 🔥 Try TrainCheck
4944

5045
Work through [5‑Minute Experience with TrainCheck](./docs/5-min-tutorial.md). You’ll learn how to:

docs/README.md

Lines changed: 0 additions & 92 deletions
This file was deleted.

docs/index.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
<div align="center">
2+
<picture>
3+
<img alt="TrainCheck logo" width="55%" src="assets/images/traincheck_logo.png">
4+
</picture>
5+
</div>
6+
7+
# TrainCheck: Invariant Checking & Observability for AI Training
8+
9+
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/pre-commit-checks.yml)
10+
[![format and types](https://github.com/OrderLab/traincheck/actions/workflows/correctness_checks.yml/badge.svg)](https://github.com/OrderLab/traincheck/actions/workflows/correctness_checks.yml)
11+
[![Chat on Discord](https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white)](https://discord.gg/ZvYewjsQ9D)
12+
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/OrderLab/TrainCheck)
13+
14+
**Stop flying blind.** TrainCheck gives you deep visibility into your training dynamics, continuously validating correctness and stability where standard metrics fail.
15+
16+
---
17+
18+
### Why TrainCheck?
19+
20+
**Continuous Invariant Checking**
21+
22+
TrainCheck validates the "physics" of your training process in real-time. It ensures your model adheres to learned invariants—such as gradient norms, tensor shapes, and update magnitudes—effectively catching silent corruption before it wastes GPU hours.
23+
24+
🚀 **Holistic Observability**
25+
26+
Traditional tools only show you *if* your model crashed. TrainCheck shows you *why* it's degrading, analyzing internal state dynamics that loss curves miss.
27+
28+
🧠 **Zero-Config Validation**
29+
30+
No manual tests required. TrainCheck automatically learns the invariants of your specific model from healthy runs and flags deviations instantly.
31+
32+
**Universal Compatibility**
33+
34+
Drop-in support for PyTorch, Hugging Face, and industry-class workloads using DeepSpeed/Megatron and more.
35+
36+
---
37+
38+
### How It Works
39+
40+
1. **Instrument**: We wrap your training loop with lightweight probes—no code changes needed.
41+
2. **Learn**: We analyze correct runs to infer *invariants* (mathematical rules of healthy training).
42+
3. **Check**: We monitor new runs in real-time, verifying every step against learned invariants to catch silent logic bugs and hardware faults.
43+
44+
![Workflow](assets/images/workflow.png)
45+
46+
## 🔥 Try TrainCheck
47+
48+
Work through [5‑Minute Experience with TrainCheck](5-min-tutorial.md). You’ll learn how to:
49+
- Instrument a training script and collect a trace
50+
- Automatically infer invariants
51+
- Uncover silent bugs in the training script
52+
53+
## Documentation
54+
55+
- **[Installation Guide](installation-guide.md)**
56+
- **[Usage Guide: Scenarios and Limitations](usage-guide.md)**
57+
- **[TrainCheck Technical Doc](technical-doc.md)**
58+
- **[TrainCheck Dev RoadMap](https://github.com/OrderLab/traincheck/blob/main/ROADMAP.md)**
59+
60+
## Status
61+
62+
TrainCheck is under active development. Please join our 💬 [Discord server](https://discord.gg/VwxpJDvB) or file a GitHub issue for support.
63+
We welcome feedback and contributions from early adopters.
64+
65+
## Contributing
66+
67+
We welcome and value any contributions and collaborations. Please check out [Contributing to TrainCheck](https://github.com/OrderLab/traincheck/blob/main/CONTRIBUTING.md) for how to get involved.
68+
69+
## License
70+
71+
TrainCheck is licensed under the [Apache License 2.0](https://github.com/OrderLab/traincheck/blob/main/LICENSE).
72+
73+
## Citation
74+
75+
If TrainCheck is relevant to your work, please cite our paper:
76+
```bib
77+
@inproceedings{TrainCheckOSDI2025,
78+
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
79+
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
80+
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
81+
series = {OSDI '25},
82+
month = {July},
83+
year = {2025},
84+
address = {Boston, MA, USA},
85+
publisher = {USENIX Association},
86+
}
87+
```
88+
89+
## Artifact Evaluation
90+
91+
🕵️‍♀️ OSDI AE members, please see [TrainCheck AE Guide](ae.md).

docs/style.css

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
2+
body {
3+
font-family: Arial, sans-serif;
4+
line-height: 1.6;
5+
margin: 0;
6+
padding: 20px;
7+
background: #f4f4f4;
8+
color: #333;
9+
}
10+
.container {
11+
max-width: 900px;
12+
margin: auto;
13+
background: #fff;
14+
padding: 30px;
15+
border-radius: 8px;
16+
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
17+
}
18+
h1, h2, h3, h4, h5, h6 {
19+
color: #0056b3;
20+
}
21+
pre {
22+
background: #eee;
23+
padding: 15px;
24+
border-radius: 5px;
25+
overflow-x: auto;
26+
}
27+
code {
28+
font-family: "Courier New", Courier, monospace;
29+
background: #e9e9e9;
30+
padding: 2px 4px;
31+
border-radius: 3px;
32+
}
33+
a {
34+
color: #007bff;
35+
text-decoration: none;
36+
}
37+
a:hover {
38+
text-decoration: underline;
39+
}

mkdocs.yml

Lines changed: 42 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,45 @@
11
site_name: TrainCheck
2+
site_url: https://orderlab.github.io/traincheck/
3+
repo_url: https://github.com/OrderLab/traincheck
4+
repo_name: OrderLab/traincheck
5+
26
theme:
3-
name: readthedocs
7+
name: material
8+
features:
9+
- navigation.tabs
10+
- navigation.indexes
11+
- content.code.copy
12+
palette:
13+
# Palette toggle for light mode
14+
- scheme: default
15+
primary: teal
16+
accent: purple
17+
toggle:
18+
icon: material/brightness-7
19+
name: Switch to dark mode
20+
21+
# Palette toggle for dark mode
22+
- scheme: slate
23+
primary: teal
24+
accent: lime
25+
toggle:
26+
icon: material/brightness-4
27+
name: Switch to light mode
28+
429
nav:
5-
- Home: README.md
6-
- "Installation Guide": ./installation-guide.md
7-
- "5 Minute Quick Start": ./5-min-tutorial.md
8-
- "Success Stories": ./successful-stories.md
9-
- "Technical Documentation": ./technical-doc.md
10-
- "Usage Tips": usage-guide.md
11-
- "Performance Benchmarks": ./benchmarks.md
30+
- Home: index.md
31+
- Paper: https://www.usenix.org/conference/osdi25/presentation/jiang
32+
- Documentation:
33+
- "Installation Guide": installation-guide.md
34+
- "5 Minute Quick Start": 5-min-tutorial.md
35+
- "Success Stories": successful-stories.md
36+
- "Technical Documentation": technical-doc.md
37+
- "Usage Tips": usage-guide.md
38+
- "Performance Benchmarks": benchmarks.md
39+
40+
markdown_extensions:
41+
- pymdownx.highlight:
42+
anchor_linenums: true
43+
- pymdownx.inlinehilite
44+
- pymdownx.snippets
45+
- pymdownx.superfences

0 commit comments

Comments
 (0)