Skip to content

Mark test_rloo[fsdp2] as xfail for transformers 5.4.0#5387

Merged
qgallouedec merged 1 commit intohuggingface:mainfrom
albertvillanova:fix-5386
Mar 27, 2026
Merged

Mark test_rloo[fsdp2] as xfail for transformers 5.4.0#5387
qgallouedec merged 1 commit intohuggingface:mainfrom
albertvillanova:fix-5386

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Mar 27, 2026

Fix CI failure for distributed test_rloo[fsdp2]:

  • Mark test_rloo[fsdp2] as xfail for transformers 5.4.0

Fix #5386.

See upstream fix:

This PR marks the "fsdp2" configuration of the distributed test suite as an expected failure (xfail) for a specific version of the transformers library, providing clearer test outcomes and documentation for known upstream issues.

Testing improvements:

  • Added a new pytest.param for the "fsdp2" configuration in test_rloo, marking it as an expected failure (xfail) when transformers version is exactly 5.4.0, with a detailed reason and strict enforcement. This documents and accounts for an upstream issue causing NaN weights on non-rank-0 FSDP processes.

Note

Low Risk
Low risk: test-only change that marks a known upstream transformers==5.4.0 FSDP issue as an expected failure, without affecting runtime code paths.

Overview
Marks the distributed test_rloo case for the fsdp2 configuration as xfail specifically on transformers==5.4.0, with a documented reason and strict=True to prevent silent passes.

This stabilizes CI by treating the known upstream NaN-weight failure on non-rank-0 FSDP processes as an expected, version-gated test outcome.

Written by Cursor Bugbot for commit a502f50. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova albertvillanova changed the title Fix CI failure for distributed test_rloo[fsdp2] Mark test_rloo[fsdp2] as xfail for transformers 5.4.0 Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI fails for distributed test_rloo[fsdp2]: CUDA error: device-side assert triggered: probability tensor contains either inf, nan or element < 0

3 participants