Commit 1055cd1
authored
Harden worker disconnect (TraceMachina#1972)
There are a number of issues with worker disconnect that cause jobs to be unable to be scheduled or stick permanently in Executing.
The first issue is that killing a process does not wait for the process to die which causes the cleanup of the work tree and associated file system store to fail when the system is heavily loaded or the file system is slow.
The second issue is that cleanup is required even if prepare_action hasn't been called yet, separate the concerns of cleanup to filesystem cleanup and manager cleanup as two separate AtomicBools and handle these separate cases.
The third issue is that apply_filter_predicate doesn't handle update_awaited_action failing due to a version mismatch (e.g. client keep-alive), so that can cause the worker state to be skipped.
The fourth and most important is if a worker doesn't exist when worker_notify_run_action is called, the action remains forever in Executing state as nothing updates it, this is trivially resolved by calling update_operation in this case which would otherwise be called by immediate_evict_worker if it actually existed.
Finally, there are a couple of places that log errors which are just noise, notably updating an action on worker disconnect or keep-alive that's already completed and a warning about killing a process that's already dead.1 parent 9353508 commit 1055cd1
3 files changed
Lines changed: 135 additions & 38 deletions
File tree
- nativelink-scheduler/src
- nativelink-worker/src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
340 | 340 | | |
341 | 341 | | |
342 | 342 | | |
| 343 | + | |
343 | 344 | | |
344 | 345 | | |
345 | 346 | | |
346 | 347 | | |
347 | 348 | | |
348 | 349 | | |
349 | 350 | | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
350 | 359 | | |
351 | | - | |
352 | 360 | | |
353 | 361 | | |
354 | 362 | | |
| |||
Lines changed: 101 additions & 31 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
314 | 314 | | |
315 | 315 | | |
316 | 316 | | |
| 317 | + | |
317 | 318 | | |
318 | 319 | | |
319 | 320 | | |
320 | 321 | | |
| 322 | + | |
321 | 323 | | |
322 | 324 | | |
323 | 325 | | |
| 326 | + | |
| 327 | + | |
324 | 328 | | |
325 | 329 | | |
326 | 330 | | |
| |||
331 | 335 | | |
332 | 336 | | |
333 | 337 | | |
334 | | - | |
335 | | - | |
336 | | - | |
337 | | - | |
338 | | - | |
339 | | - | |
340 | | - | |
341 | | - | |
342 | | - | |
343 | | - | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
344 | 387 | | |
345 | 388 | | |
346 | | - | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
347 | 392 | | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
348 | 398 | | |
349 | 399 | | |
350 | 400 | | |
| |||
518 | 568 | | |
519 | 569 | | |
520 | 570 | | |
521 | | - | |
522 | | - | |
523 | | - | |
524 | | - | |
525 | | - | |
526 | | - | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
527 | 586 | | |
528 | 587 | | |
529 | 588 | | |
530 | 589 | | |
531 | 590 | | |
532 | | - | |
| 591 | + | |
533 | 592 | | |
534 | 593 | | |
535 | 594 | | |
536 | | - | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
537 | 603 | | |
538 | 604 | | |
539 | 605 | | |
| |||
658 | 724 | | |
659 | 725 | | |
660 | 726 | | |
661 | | - | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
662 | 731 | | |
663 | 732 | | |
664 | 733 | | |
| |||
678 | 747 | | |
679 | 748 | | |
680 | 749 | | |
681 | | - | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
682 | 754 | | |
683 | 755 | | |
684 | 756 | | |
| |||
704 | 776 | | |
705 | 777 | | |
706 | 778 | | |
707 | | - | |
708 | | - | |
709 | | - | |
710 | | - | |
711 | | - | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
712 | 783 | | |
713 | 784 | | |
714 | 785 | | |
| |||
750 | 821 | | |
751 | 822 | | |
752 | 823 | | |
753 | | - | |
754 | | - | |
755 | | - | |
756 | | - | |
757 | | - | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
758 | 828 | | |
759 | 829 | | |
760 | 830 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
661 | 661 | | |
662 | 662 | | |
663 | 663 | | |
| 664 | + | |
664 | 665 | | |
665 | 666 | | |
666 | 667 | | |
| |||
691 | 692 | | |
692 | 693 | | |
693 | 694 | | |
694 | | - | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
695 | 699 | | |
696 | 700 | | |
697 | 701 | | |
| |||
732 | 736 | | |
733 | 737 | | |
734 | 738 | | |
| 739 | + | |
| 740 | + | |
735 | 741 | | |
736 | 742 | | |
737 | 743 | | |
| |||
923 | 929 | | |
924 | 930 | | |
925 | 931 | | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
926 | 936 | | |
927 | 937 | | |
928 | 938 | | |
| |||
966 | 976 | | |
967 | 977 | | |
968 | 978 | | |
969 | | - | |
| 979 | + | |
970 | 980 | | |
971 | 981 | | |
972 | 982 | | |
| |||
1035 | 1045 | | |
1036 | 1046 | | |
1037 | 1047 | | |
1038 | | - | |
| 1048 | + | |
1039 | 1049 | | |
1040 | 1050 | | |
1041 | 1051 | | |
| |||
1305 | 1315 | | |
1306 | 1316 | | |
1307 | 1317 | | |
| 1318 | + | |
| 1319 | + | |
| 1320 | + | |
| 1321 | + | |
| 1322 | + | |
| 1323 | + | |
1308 | 1324 | | |
1309 | 1325 | | |
1310 | 1326 | | |
| |||
1370 | 1386 | | |
1371 | 1387 | | |
1372 | 1388 | | |
| 1389 | + | |
1373 | 1390 | | |
1374 | 1391 | | |
1375 | 1392 | | |
| |||
2104 | 2121 | | |
2105 | 2122 | | |
2106 | 2123 | | |
2107 | | - | |
2108 | | - | |
2109 | | - | |
| 2124 | + | |
| 2125 | + | |
| 2126 | + | |
| 2127 | + | |
| 2128 | + | |
2110 | 2129 | | |
2111 | 2130 | | |
2112 | 2131 | | |
| |||
0 commit comments