Add flat-collective fallback to hierarchical_communicator for small site counts#7193
Conversation
|
Can one of the admins verify this patch? |
Up to standards ✅🟢 Issues
|
Could you please collect benchmark data when running with more than one thread per locality? Using one thread per locality is not a representative use case. |
Thanks for the review. You are right, 1 thread per locality is not representative. I am setting up a multi thread sweep covering 16 to 256 processes at 1, 4, and 8 threads per locality, with 4B and 1MiB message sizes and 100 iteration averaging (matching the methodology from the IPTW 2025 talk). My UPES DGX queue is heavily loaded at the moment with a previous job pending 70+ hours, so turnaround may take a few days. I have also reached out to @constracktor in case a run on medusa is feasible in parallel. Keeping the PR as draft until the data is in. I will post results as a comment here as soon as I have them. |
|
BTW, is the test failure for all_reduce_sync (on MacOS) related to your recent changes? |
No, unrelated. all_reduce_sync wraps the flat all_reduce overload, which goes through create_communicator. This PR only modifies create_hierarchical_communicator and adds an optional argument with a default; all_reduce.hpp is untouched. The git diff against master shows the changes are confined to create_communicator.{hpp,cpp}, argument_types.hpp, the new test, and the benchmark. The macOS failure is a timeout (not an assertion) on debug TCP, and the other failing check is transpose_smp_block on Windows debug, unrelated to collectives. Both look like the usual debug CI flakes. Happy to re-run the failed jobs if you would like a clean board. |
Exercises site counts that are not clean multiples of the arity,
including configurations where tree recursion produces size-1
subgroups. Covers arity=2 with site counts {3, 5, 6, 7, 9, 10, 11, 15}
and arity=4 with {5, 6, 7, 9, 10, 11, 13, 15}.
These paths were previously uncovered. The hierarchical tree
construction in create_communicator.cpp handles them through the
division_steps and remainder logic, but no existing test verified
the behaviour. Companion coverage to the adaptive flat-fallback
work in TheHPXProject#7193.
…shold Below a configurable site-count threshold, hierarchical collectives perform worse than flat because tree-walking overhead exceeds the synchronization-depth benefit. This change adds a fallback in the create_hierarchical_communicator factory: when num_sites < threshold, it produces a hierarchical_communicator whose underlying vector holds a single flat communicator spanning all N sites. The tree-walking loops in the hierarchical overloads of reduce, broadcast, gather, scatter, all_reduce, and all_gather collapse to a single flat call when size() == 1. This is also the code path that non-leader sites already take in normal tree mode (a leaf-only site has exactly one communicator in its vector), so no collective code changes are required. The threshold is exposed as a new flat_fallback_threshold_arg argument to create_hierarchical_communicator with a default of 16, matching the suggestion from PR TheHPXProject#7160. Pass 0 to disable the fallback and force tree construction (useful for tests exercising the tree path). New test hierarchical_flat_fallback verifies the structural invariant (size() == 1 when num_sites < threshold) and correctness of the resulting all_reduce, at both 1 and 2 localities and across site counts 2, 4, 8. Addresses follow-up comment by @hkaiser on TheHPXProject#7160.
Adds benchmark coverage for the hierarchical flat fallback:
- New test functions test_one_shot_use_all_gather,
test_multiple_use_with_generation_all_gather, and
test_all_gather_hierarchical mirroring the existing all_reduce
benchmark functions.
- New --fallback_threshold CLI flag on benchmark_collectives_test
plumbed through to the five hierarchical test functions. Default
value -1 preserves existing behavior (library default threshold of
16). Pass 0 to force tree construction, or any other value to test
an arbitrary threshold.
- When an explicit threshold is supplied, results are written with
module name 'hierarchical_t{N}' instead of 'hierarchical' so flat
fallback and forced tree runs can be distinguished in the CSV
output.
This enables the measurement sweep requested by @hkaiser on the
parent PR: multiple threads per locality, across both all_reduce and
all_gather, with the fallback explicitly enabled and disabled for
direct comparison.
187d863 to
66b9fc2
Compare
|
@iemAnshuman Can we merge this now (I un-marked it as draft)? |
Yes, please merge. Thanks for the review. @constracktor is running the full HPC benchmark sweep on a larger cluster; I will post the results as a follow up comment once they land, and open a tuning PR if the threshold needs adjustment. |
Summary
Adds a flat fallback to the
hierarchical_communicatorfactory for site counts below a configurable threshold. Addresses the follow-up question from @hkaiser on #7160:Benchmark context
Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with
benchmark_collectives_test, single int, 100 iterations, oneHPX thread per locality:
Hierarchical arity=2 overtakes flat at 32 processes (1.34x speedup).
Arity=4 regresses at P≥16 and is tracked separately; this PR uses
arity=2 as the recommended default for ≥16 ranks. Default threshold
set to 16 per the suggestion in #7160. The exact optimal value likely
depends on node topology and interconnect, so the threshold is kept
configurable and flagged in "open items" below.
Design
The fallback lives entirely in
create_hierarchical_communicator. Whennum_sites < threshold, the factory returns ahierarchical_communicatorwhose underlyingstd::vector<tuple<communicator, this_site_arg>>contains a single entry: a flat communicator spanning all N sites.The tree-walking loops in the existing hierarchical overloads of
reduce_here/reduce_there,broadcast_to/broadcast_from,gather_here/gather_thereall collapse correctly whensize() == 1. This is the same code path non-leader sites already take in normal tree mode (a leaf-only site has exactly one communicator in its vector), so this fallback is not a new state for the collective code — it's a state the collectives already handle in production.This means zero changes to
all_reduce.hpp,all_gather.hpp,reduce.hpp,broadcast.hpp,gather.hpp, orscatter.hpp. All correctness follows from the factory change.API change
A new
flat_fallback_threshold_arg(default16) is added as the last parameter ofcreate_hierarchical_communicator:Pass
flat_fallback_threshold_arg(0)to disable the fallback and always build a tree (useful for testing the tree path directly).All existing callers use default arguments, so this is source-compatible.
Testing
New test
hierarchical_flat_fallbackverifies:num_sites < threshold, the returnedhierarchical_communicatorhassize() == 1all_reduceon the fallback communicator produces the correct resultflat_fallback_threshold_arg(0)produces the same resultTested locally at 1 and 2 localities over the TCP parcelport on macOS. All existing hierarchical tests continue to pass without modification.
Open items for discussion
all_reduce_hierarchical, etc.) to passflat_fallback_threshold_arg(0). With the default threshold of 16, those tests now exercise the fallback path for site counts 2–8 and the tree path only for 16+. If preserving tree-path coverage at small site counts is important, I can update those tests in a follow-up commit — happy to do either.Follow-ups
After this merges, the same pattern generalizes naturally to
all_to_allonce that collective is implemented hierarchically.