Fix tree-graph search failures on sm80 (A100) #1952
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request: Fix tree-graph search fallback on A100 (sm80)
Title
Fix tree-graph BALANCED_TREE to TREE fallback for A100 (sm80)
Description
NCCL 2.25+ tree-graph searching was failing on A100 GPUs due to architecture checks that excluded sm80 from the BALANCED_TREE to TREE pattern fallback logic.
This caused NCCL to get stuck with the BALANCED_TREE pattern, which on certain A100 configurations (particularly with AMD CPUs or specific interconnect topologies) fails to find optimal paths, resulting in only 1 channel being used globally and severely limiting performance in multi-node training scenarios.
Root Cause
File:
src/graph/search.cc, line 1117Problem: Architecture check
ccMin >= 90excluded sm80 from fallback mechanismImpact: sm90+ (Hopper) could fall back to simpler TREE pattern when BALANCED_TREE fails, but sm80 (Ampere/A100) could not
This meant that when BALANCED_TREE graph search failed on A100 systems, NCCL had no fallback option and would use the failed graph with nChannels=1.
Solution
Changed the condition from
ccMin >= 90toccMin >= 80to include Ampere architecture:This is a minimal one-line change that extends the existing, well-tested fallback logic to A100 GPUs.
Related Issues
Fixes #1946
Changes & Impact
Code Changes
src/graph/search.cc(1 line changed)Technical Impact
Why This Works
The fallback mechanism already exists and is proven on sm90+ (Hopper). This change simply allows sm80 (Ampere) to use the same battle-tested logic. The TREE pattern is simpler and more robust than BALANCED_TREE, making it a safe fallback option.
Performance Impact
Before Fix
After Fix
Benchmark Results
Single Node (8x A100-80GB)
Testing Performed
Functional Testing
Regression Testing
Configuration Testing
Debug Logs
<details> <summary>Before Fix - Graph Search Failure</summary>Rationale for Minimal Change
This is a conservative, low-risk fix because:
Why Not a Larger Refactor?
While a larger refactoring of the graph search logic could be beneficial, this minimal fix:
A more comprehensive graph search optimization could be considered for future releases.
Additional Context
System Configurations Affected
This bug primarily affects A100 systems with:
On standard DGX A100 systems with optimal topology, BALANCED_TREE typically succeeds and this fallback isn't needed. However, on the configurations listed above, the fallback is critical.
Architecture Background
The fallback pattern follows NVIDIA's general principle: newer architectures get more sophisticated algorithms, but should have robust fallbacks for edge cases.
Documentation
A detailed analysis document (
ISSUE_1946_ANALYSIS.md) is available in the repository with:Checklist
Reviewers
Suggested reviewers:
Questions for Reviewers
Summary: One-line fix extends proven BALANCED_TREE→TREE fallback logic from sm90+ to include sm80 (A100), resolving critical performance regression in certain A100 configurations. Minimal risk, significant impact.