Skip to content

Commit ab662e3

Browse files
committed
deploy: 35c076d
1 parent c3fb08b commit ab662e3

9 files changed

Lines changed: 56 additions & 7 deletions

File tree

715 KB
Loading

assets/img/team/yuxuan.jpg

160 KB
Loading

feed.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2025-05-09T17:38:57-04:00</updated><id>/feed.xml</id><title type="html">Ordered Systems Lab at U-M</title><subtitle>This is the website for the Ordered Systems Lab (a.k.a Order Lab) at University of Michigan, led by Prof. Ryan Huang.</subtitle></feed>
1+
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.2.2">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2025-06-11T11:39:14-04:00</updated><id>/feed.xml</id><title type="html">Ordered Systems Lab at U-M</title><subtitle>This is the website for the Ordered Systems Lab (a.k.a Order Lab) at University of Michigan, led by Prof. Ryan Huang.</subtitle></feed>

index.html

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,28 @@ <h2 class="mb-3">Recent Projects</h2>
135135
<!-- single course -->
136136
<div class="col-lg-12">
137137
<div class="owl-theme owl-carousel active_course">
138+
<div class="single_recent_project">
139+
<div class="recent_project_head">
140+
<img class="img-fluid" src="/assets/img/project/traincheck_logo.png" alt="TrainCheck" />
141+
</div>
142+
<div class="recent_project_content">
143+
<h4 class="mb-3">
144+
<a href="#">Catching Silent Errors in Deep Learning Training</a>
145+
</h4>
146+
<p>
147+
Silent errors in deep learning training can silently waste
148+
thousands of GPU hours and produce low-quality models. We
149+
introduce TrainCheck, a proactive checking framework that learns
150+
semantic invariants from correct training runs and enforces them
151+
at runtime to catch failures early—before they silently
152+
accumulate cost and damage model reliability.
153+
</p>
154+
<div class="recent_project_meta d-flex justify-content-lg-between align-items-lg-center flex-lg-row flex-column mt-4">
155+
<a class="button button-light" href="paper/traincheck-osdi25-preprint.pdf" target="_blank">Read More</a>
156+
</div>
157+
</div>
158+
</div>
159+
138160
<div class="single_recent_project">
139161
<div class="recent_project_head">
140162
<img class="img-fluid" src="/assets/img/project/watchdog.jpg" alt="" />
@@ -238,7 +260,7 @@ <h2 class="section-intro__title">Sponsors</h2>
238260
<p class="footer-text m-0 col-lg-8 col-md-12">
239261
Copyright &copy; OrderLab 2017-<script>
240262
document.write(new Date().getFullYear());
241-
</script> All rights reserved. | Last updated 2025-04-30 22:19:44 -0400.
263+
</script> All rights reserved. | Last updated 2025-06-11 04:35:11 -0400.
242264
</p>
243265
</div>
244266
</div>

news.html

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,14 @@ <h1>News</h1>
6363
<section class="section-margin">
6464
<div class="container">
6565
<ul class="newslist">
66+
<li>
67+
<span class="newsicon"><i class="flaticon-document"></i></span><span class="newsdate">Mar 2025</span>
68+
<a href="https://github.com/OrderLab/TrainCheck"> TrainCheck</a> is accepted to appear at <a href="https://www.usenix.org/conference/osdi25">OSDI '25</a>
69+
<details>
70+
<summary>[...]</summary>
71+
Training deep learning (DL) models is a complex task involving multiple steps and various libraries, making DL training pipelines prone to silent bugs that lead to suboptimal or incorrect models. These issues are challenging to detect and diagnose. TrainCheck is the first framework that takes a proactive checking approach to systematically address silent issues. TrainCheck automatically infers invariants tailored for DL training. It uses these invariants to enhance a training task and proactively detect silent issues while providing debugging help.
72+
</details>
73+
</li>
6674
<li>
6775
<span class="newsicon"><i class="flaticon-distance"></i></span><span class="newsdate">May 2024</span>
6876
<span class="text-danger">Yigong will join Boston University as an Assistant Professor!</span>
@@ -188,7 +196,7 @@ <h1>News</h1>
188196
<p class="footer-text m-0 col-lg-8 col-md-12">
189197
Copyright &copy; OrderLab 2017-<script>
190198
document.write(new Date().getFullYear());
191-
</script> All rights reserved. | Last updated 2025-04-30 22:19:44 -0400.
199+
</script> All rights reserved. | Last updated 2025-06-11 04:35:11 -0400.
192200
</p>
193201
</div>
194202
</div>
622 KB
Binary file not shown.

paper/traincheck-osdi25.bib

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
@inproceedings{TrainCheckOSDI2025,
2+
author = {Jiang, Yuxuan and Zhou, Ziming and Xu, Boyu and Liu, Beijie and Xu, Runhui and Huang, Peng},
3+
title = {Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks},
4+
booktitle = {Proceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation},
5+
series = {OSDI '25},
6+
month = {July},
7+
year = {2025},
8+
address = {Boston, MA, USA},
9+
publisher = {USENIX Association},
10+
}

pubs.html

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,10 @@ <h1>Publications</h1>
6565
<h2 id="publications">2025</h2>
6666
<ul class="publications">
6767
<li>
68-
<a target="_blank" href="#">Training with Confidence: Catching Silent DL Training Bugs with Automated Proactive Checks</a><br>
68+
<a target="_blank" href="paper/traincheck-osdi25-preprint.pdf">Training with Confidence: Catching Silent DL Training Bugs with Automated Proactive Checks</a><br>
6969
<span class="authorlist"><i><a href="https://essoz.github.io" class="nodec">Yuxuan Jiang</a>, </i><i>Ziming Zhou, </i><i>Boyu Xu, </i><i>Beijie Liu, </i><i>Runhui Xu, </i><i><a href="https://web.eecs.umich.edu/~ryanph" class="nodec">Peng Huang</a><br></i></span>
70-
<a target="_blank" href="https://www.usenix.org/conference/osdi25" class="conf"><b>OSDI 2025</b></a>&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="https://github.com/OrderLab/TrainCheck">Software</a>
70+
<a target="_blank" href="https://www.usenix.org/conference/osdi25" class="conf"><b>OSDI 2025</b></a>&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="paper/traincheck-osdi25.bib">BibTeX</a>
71+
&nbsp;&nbsp;<a target="_blank" class="btn btn-outline-primary publinkitem" href="https://github.com/OrderLab/TrainCheck">Software</a>
7172
</li>
7273
<li>
7374
<a target="_blank" href="#">Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems</a><br>
@@ -347,7 +348,7 @@ <h2 id="publications">2010</h2>
347348
<p class="footer-text m-0 col-lg-8 col-md-12">
348349
Copyright &copy; OrderLab 2017-<script>
349350
document.write(new Date().getFullYear());
350-
</script> All rights reserved. | Last updated 2025-05-09 17:37:34 -0400.
351+
</script> All rights reserved. | Last updated 2025-06-11 04:35:11 -0400.
351352
</p>
352353
</div>
353354
</div>

software.html

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,14 @@ <h3 class="section-intro__title">Group GitHub Repository</h3>
6868
</div>
6969
</div>
7070
</section>
71+
<section class="section-padding bg-magnolia">
72+
<div class="container">
73+
<div class="section-intro pb-85px text-center">
74+
<h3 class="section-intro__title">TrainCheck [<a href="/paper/violet-osdi20-preprint.pdf">OSDI '25</a>]</h3>
75+
<p class="section-intro__subtitle">TrainCheck is an innovative tool for detecting silent errors in deep learning training. We are excited to open-source TrainCheck–explore the project and get involved on <a href="https://github.com/OrderLab/TrainCheck">GitHub</a>!</p>
76+
</div>
77+
</div>
78+
</section>
7179
<section class="section-padding bg-magnolia">
7280
<div class="container">
7381
<div class="section-intro pb-85px text-center">
@@ -108,7 +116,7 @@ <h3 class="section-intro__title">Panorama [<a href="/paper/panorama-osdi18.pdf">
108116
<p class="footer-text m-0 col-lg-8 col-md-12">
109117
Copyright &copy; OrderLab 2017-<script>
110118
document.write(new Date().getFullYear());
111-
</script> All rights reserved. | Last updated 2025-04-30 22:19:44 -0400.
119+
</script> All rights reserved. | Last updated 2025-06-11 04:35:11 -0400.
112120
</p>
113121
</div>
114122
</div>

0 commit comments

Comments
 (0)