Skip to content

Fix flaky CI test_rloo[fsdp2]: Replace non-deterministic xfail with skipif for transformers 5.4.0#5403

Merged
albertvillanova merged 2 commits intohuggingface:mainfrom
albertvillanova:fix-5387
Mar 31, 2026
Merged

Fix flaky CI test_rloo[fsdp2]: Replace non-deterministic xfail with skipif for transformers 5.4.0#5403
albertvillanova merged 2 commits intohuggingface:mainfrom
albertvillanova:fix-5387

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Mar 30, 2026

Replace non-deterministic xfail with skipif for test_rloo[fsdp2]

Fix to:

The test_rloo[fsdp2] test was marked xfail(strict=True) for transformers 5.4.0 due to an upstream bug that causes NaN weights on non-rank-0 FSDP processes (see #5386 and transformers#45050). However, the NaN generation is non-deterministic, so the test randomly passes or fails, making strict=True incorrect: a passing run is reported as XPASS (an error), causing flaky CI: https://github.com/huggingface/trl/actions/runs/23733750480/job/69133242696?pr=5402

FAILED tests/distributed/test_distributed.py::TestDistributed::test_rloo[fsdp2] - [XPASS(strict)] Upstream issue: NaN weights on non-rank-0 FSDP processes (see #5386 and transformers#45050)

xfail means "this test is expected to fail", which doesn't apply when the failure is random. skipif is the correct marker: it signals that the test cannot run reliably on this version, keeps CI deterministic (always SKIPPED for 5.4.0), and avoids noise for maintainers. Tracking of the upstream fix remains via the issue links in the reason string.

Changes

Testing logic update:

  • Changed the FSDP2 test in test_reward to use pytest.mark.skipif for transformers version 5.4.0, skipping the test when the upstream NaN weights issue is present, instead of marking it as an expected failure.

Note

Low Risk
Low risk: changes only a pytest marker to make CI deterministic by skipping a known-flaky upstream combination; no production code paths affected.

Overview
Makes test_rloo[fsdp2] deterministic on transformers==5.4.0 by replacing a xfail(strict=True) with skipif for the known upstream NaN-weights issue, preventing intermittent XPASS failures in CI.

Written by Cursor Bugbot for commit 18df790. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova albertvillanova merged commit 2f67d93 into huggingface:main Mar 31, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants