Skip to content

temporary test fix for gpt-oss nan loss#3648

Merged
copybara-service[bot] merged 1 commit intomainfrom
shuningjin-gpt-oss-nan
Apr 13, 2026
Merged

temporary test fix for gpt-oss nan loss#3648
copybara-service[bot] merged 1 commit intomainfrom
shuningjin-gpt-oss-nan

Conversation

@shuningjin
Copy link
Copy Markdown
Collaborator

@shuningjin shuningjin commented Apr 12, 2026

Description

Temporary fix for gpt-oss test: b/497864549

  • Issue: We encounter NaN loss when finetuning gpt-oss-20b, while pretraining is mostly fine.
  • Use abort_on_nan_loss=false for finetune/SFT as a temporary fix; the check is kept for pretrain. Will keep investigating cause in the bug.
  • Add gcs_metrics=true to facilitate metric monitoring (e.g., loss, grad norm) and debugging.

Tests

N/A

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@copybara-service copybara-service bot merged commit 3385aa0 into main Apr 13, 2026
49 of 50 checks passed
@copybara-service copybara-service bot deleted the shuningjin-gpt-oss-nan branch April 13, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants