You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<spanclass="home-news-list__content"><ahref="https://osdi.dev/" target="_blank" rel="noopener">Yuzhuo</a> successfully defended his PhD thesis titled <b><i>"Operating System Support for Reliable Software"</i></b> and will join Google after graduation. Congratulations, Dr. Jing!</span>
60
+
</li>
61
+
<li>
62
+
<spanclass="home-news-list__date">Jul 2025</span>
63
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/verify-llm/TrainVerify" target="_blank" rel="noopener">TrainVerify</a></b> is accepted to <ahref="https://sigops.org/s/conferences/sosp/2025/">SOSP '25</a>! TrainVerify uses equivalence-based verification to provide strong correctness guarantess for the parallelization logic of distributed LLM training.</span>
64
+
</li>
65
+
<li>
66
+
<spanclass="home-news-list__date">Jul 2025</span>
67
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/OrderLab/phoenix" target="_blank" rel="noopener">Phoenix</a></b> is accepted to <ahref="https://sigops.org/s/conferences/sosp/2025/">SOSP '25</a>! Phoenix provides OS-level support for optimistic recovery and partial state preservation for high-availability software.</span>
68
+
</li>
69
+
<li>
70
+
<spanclass="home-news-list__date">Jul 2025</span>
71
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/OrderLab/Atropos" target="_blank" rel="noopener">Atropos</a></b> is accepted to <ahref="https://sigops.org/s/conferences/sosp/2025/">SOSP '25</a>! Atropos is an application overload control framework that uses targeted cancellation to maintain tight SLOs.</span>
72
+
</li>
73
+
<li>
74
+
<spanclass="home-news-list__date">Mar 2025</span>
75
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/OrderLab/TrainCheck" target="_blank" rel="noopener">TrainCheck</a></b> is accepted to <ahref="https://www.usenix.org/conference/osdi25">OSDI '25</a>! TrainCheck automatically infers invariants tailored for DL training and uses these invariants to proactively detect silent training errors.</span>
76
+
</li>
77
+
</ul>
78
+
<divclass="text-center mt-4">
79
+
<aclass="button button-light" href="{{ '/news/' | relative_url }}">View All News</a>
80
+
</div>
81
+
</div>
82
+
</section>
83
+
<!--================ End news preview section =================-->
<ahref="https://github.com/OrderLab/TrainCheck"> TrainCheck</a> is accepted to appear at <ahref="https://www.usenix.org/conference/osdi25">OSDI '25</a>
9
+
<spanclass="home-news-list__date">Dec 2025</span>
10
+
<spanclass="home-news-list__content"><ahref="https://osdi.dev/" target="_blank" rel="noopener">Yuzhuo</a> successfully defended his PhD thesis titled <i>"Operating System Support for Reliable Software"</i> and will join Google after graduation. Congratulations, Dr. Jing!</span>
11
+
</li>
12
+
<li>
13
+
<spanclass="home-news-list__date">Jul 2025</span>
14
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/verify-llm/TrainVerify" target="_blank" rel="noopener">TrainVerify</a></b> is accepted to <ahref="https://sigops.org/s/conferences/sosp/2025/">SOSP '25</a>! TrainVerify uses equivalence-based verification to provide strong correctness guarantess for the parallelization logic of distributed LLM training.</span>
15
+
</li>
16
+
<li>
17
+
<spanclass="home-news-list__date">Jul 2025</span>
18
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/OrderLab/phoenix" target="_blank" rel="noopener">Phoenix</a></b> is accepted to <ahref="https://sigops.org/s/conferences/sosp/2025/">SOSP '25</a>! Phoenix provides OS-level support for optimistic recovery and partial state preservation for high-availability software.</span>
19
+
</li>
20
+
<li>
21
+
<spanclass="home-news-list__date">Jul 2025</span>
22
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/OrderLab/Atropos" target="_blank" rel="noopener">Atropos</a></b> is accepted to <ahref="https://sigops.org/s/conferences/sosp/2025/">SOSP '25</a>! Atropos is an application overload control framework that uses targeted cancellation to maintain tight SLOs.</span>
23
+
</li>
24
+
<li>
25
+
<spanclass="home-news-list__date">Mar 2025</span>
26
+
<spanclass="home-news-list__content"><b><ahref="https://github.com/OrderLab/TrainCheck" target="_blank" rel="noopener">TrainCheck</a></b> is accepted to <ahref="https://www.usenix.org/conference/osdi25">OSDI '25</a>! TrainCheck automatically infers invariants tailored for DL training and uses these invariants to proactively detect silent training errors.</span>
11
27
<details>
12
28
<summary>[...]</summary>
13
29
Training deep learning (DL) models is a complex task involving multiple steps and various libraries, making DL training pipelines prone to silent bugs that lead to suboptimal or incorrect models. These issues are challenging to detect and diagnose. TrainCheck is the first framework that takes a proactive checking approach to systematically address silent issues. TrainCheck automatically infers invariants tailored for DL training. It uses these invariants to enhance a training task and proactively detect silent issues while providing debugging help.
0 commit comments