Add flat-collective fallback to hierarchical_communicator for small site counts by iemAnshuman · Pull Request #7193 · TheHPXProject/hpx

iemAnshuman · 2026-04-17T00:00:06Z

Summary

Adds a flat fallback to the hierarchical_communicator factory for site counts below a configurable threshold. Addresses the follow-up question from @hkaiser on #7160:

Should we fall back to the flat collectives if the number of connected sites is less than 16? This would require more measurements, but might be worth investigating.

Benchmark context

Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with benchmark_collectives_test, single int, 100 iterations, one
HPX thread per locality:

Procs	Flat multi_use (μs)	Hier arity=2 (μs)	Hier arity=4 (μs)
4	50	101	—
8	98	139	141
16	166	198	849
32	465	346	1,120

Hierarchical arity=2 overtakes flat at 32 processes (1.34x speedup).
Arity=4 regresses at P≥16 and is tracked separately; this PR uses
arity=2 as the recommended default for ≥16 ranks. Default threshold
set to 16 per the suggestion in #7160. The exact optimal value likely
depends on node topology and interconnect, so the threshold is kept
configurable and flagged in "open items" below.

Design

The fallback lives entirely in create_hierarchical_communicator. When num_sites < threshold, the factory returns a hierarchical_communicator whose underlying std::vector<tuple<communicator, this_site_arg>> contains a single entry: a flat communicator spanning all N sites.

The tree-walking loops in the existing hierarchical overloads of reduce_here/reduce_there, broadcast_to/broadcast_from, gather_here/gather_there all collapse correctly when size() == 1. This is the same code path non-leader sites already take in normal tree mode (a leaf-only site has exactly one communicator in its vector), so this fallback is not a new state for the collective code — it's a state the collectives already handle in production.

This means zero changes to all_reduce.hpp, all_gather.hpp, reduce.hpp, broadcast.hpp, gather.hpp, or scatter.hpp. All correctness follows from the factory change.

API change

A new flat_fallback_threshold_arg (default 16) is added as the last parameter of create_hierarchical_communicator:

auto comm = create_hierarchical_communicator(
    basename, num_sites, this_site, arity, generation, root_site,
    flat_fallback_threshold_arg(16));  // default

Pass flat_fallback_threshold_arg(0) to disable the fallback and always build a tree (useful for testing the tree path directly).

All existing callers use default arguments, so this is source-compatible.

Testing

New test hierarchical_flat_fallback verifies:

Structural invariant: when num_sites < threshold, the returned hierarchical_communicator has size() == 1
Correctness under fallback: all_reduce on the fallback communicator produces the correct result
Correctness under forced tree mode: passing flat_fallback_threshold_arg(0) produces the same result
Local-mode coverage: the fallback works when multiple sites run as threads within a single process

Tested locally at 1 and 2 localities over the TCP parcelport on macOS. All existing hierarchical tests continue to pass without modification.

Open items for discussion

Default threshold value. I set it to 16 to match the suggestion in Add hierarchical all_reduce and all_gather via reduce+broadcast composition #7160, but the benchmark data above suggests a value of 24–32 may be closer to optimal on the DGX node tested. I'd appreciate input on whether to change the default or leave it at 16 pending multi-node measurements.
Preserving coverage of the tree path in existing tests. I did not modify the existing hierarchical unit tests (all_reduce_hierarchical, etc.) to pass flat_fallback_threshold_arg(0). With the default threshold of 16, those tests now exercise the fallback path for site counts 2–8 and the tree path only for 16+. If preserving tree-path coverage at small site counts is important, I can update those tests in a follow-up commit — happy to do either.
Multi-node validation. I don't have multi-node cluster access at the moment. Once I do, I'd like to rerun the benchmark at 2+ nodes over InfiniBand to confirm the threshold holds.

Follow-ups

After this merges, the same pattern generalizes naturally to all_to_all once that collective is implemented hierarchically.

StellarBot · 2026-04-17T00:00:37Z

Can one of the admins verify this patch?

codacy-production · 2026-04-17T00:01:06Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

hkaiser · 2026-04-17T00:24:49Z

Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with benchmark_collectives_test, single int, 100 iterations, one
HPX thread per locality:

Could you please collect benchmark data when running with more than one thread per locality? Using one thread per locality is not a representative use case.

iemAnshuman · 2026-04-17T08:01:50Z

Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with benchmark_collectives_test, single int, 100 iterations, one
HPX thread per locality:

Could you please collect benchmark data when running with more than one thread per locality? Using one thread per locality is not a representative use case.

Thanks for the review. You are right, 1 thread per locality is not representative.

I am setting up a multi thread sweep covering 16 to 256 processes at 1, 4, and 8 threads per locality, with 4B and 1MiB message sizes and 100 iteration averaging (matching the methodology from the IPTW 2025 talk). My UPES DGX queue is heavily loaded at the moment with a previous job pending 70+ hours, so turnaround may take a few days. I have also reached out to @constracktor in case a run on medusa is feasible in parallel.

Keeping the PR as draft until the data is in. I will post results as a comment here as soon as I have them.

hkaiser · 2026-04-17T19:42:21Z

BTW, is the test failure for all_reduce_sync (on MacOS) related to your recent changes?

iemAnshuman · 2026-04-17T21:14:57Z

BTW, is the test failure for all_reduce_sync (on MacOS) related to your recent changes?

No, unrelated. all_reduce_sync wraps the flat all_reduce overload, which goes through create_communicator. This PR only modifies create_hierarchical_communicator and adds an optional argument with a default; all_reduce.hpp is untouched. The git diff against master shows the changes are confined to create_communicator.{hpp,cpp}, argument_types.hpp, the new test, and the benchmark.

The macOS failure is a timeout (not an assertion) on debug TCP, and the other failing check is transpose_smp_block on Windows debug, unrelated to collectives. Both look like the usual debug CI flakes. Happy to re-run the failed jobs if you would like a clean board.

Exercises site counts that are not clean multiples of the arity, including configurations where tree recursion produces size-1 subgroups. Covers arity=2 with site counts {3, 5, 6, 7, 9, 10, 11, 15} and arity=4 with {5, 6, 7, 9, 10, 11, 13, 15}. These paths were previously uncovered. The hierarchical tree construction in create_communicator.cpp handles them through the division_steps and remainder logic, but no existing test verified the behaviour. Companion coverage to the adaptive flat-fallback work in TheHPXProject#7193.

iemAnshuman · 2026-04-18T17:43:47Z

@hkaiser Opened #7198 with companion non-power-of-arity tests for hierarchical all_reduce and all_gather. Test only, no production changes. Happy to fold into this PR instead if you prefer a single commit rather than two PRs.

@hkaiser

…shold Below a configurable site-count threshold, hierarchical collectives perform worse than flat because tree-walking overhead exceeds the synchronization-depth benefit. This change adds a fallback in the create_hierarchical_communicator factory: when num_sites < threshold, it produces a hierarchical_communicator whose underlying vector holds a single flat communicator spanning all N sites. The tree-walking loops in the hierarchical overloads of reduce, broadcast, gather, scatter, all_reduce, and all_gather collapse to a single flat call when size() == 1. This is also the code path that non-leader sites already take in normal tree mode (a leaf-only site has exactly one communicator in its vector), so no collective code changes are required. The threshold is exposed as a new flat_fallback_threshold_arg argument to create_hierarchical_communicator with a default of 16, matching the suggestion from PR TheHPXProject#7160. Pass 0 to disable the fallback and force tree construction (useful for tests exercising the tree path). New test hierarchical_flat_fallback verifies the structural invariant (size() == 1 when num_sites < threshold) and correctness of the resulting all_reduce, at both 1 and 2 localities and across site counts 2, 4, 8. Addresses follow-up comment by @hkaiser on TheHPXProject#7160.

@hkaiser

Adds benchmark coverage for the hierarchical flat fallback: - New test functions test_one_shot_use_all_gather, test_multiple_use_with_generation_all_gather, and test_all_gather_hierarchical mirroring the existing all_reduce benchmark functions. - New --fallback_threshold CLI flag on benchmark_collectives_test plumbed through to the five hierarchical test functions. Default value -1 preserves existing behavior (library default threshold of 16). Pass 0 to force tree construction, or any other value to test an arbitrary threshold. - When an explicit threshold is supplied, results are written with module name 'hierarchical_t{N}' instead of 'hierarchical' so flat fallback and forced tree runs can be distinguished in the CSV output. This enables the measurement sweep requested by @hkaiser on the parent PR: multiple threads per locality, across both all_reduce and all_gather, with the fallback explicitly enabled and disabled for direct comparison.

hkaiser

LGTM, thanks!

hkaiser · 2026-04-19T15:08:38Z

@iemAnshuman Can we merge this now (I un-marked it as draft)?

iemAnshuman · 2026-04-19T16:45:42Z

@iemAnshuman Can we merge this now (I un-marked it as draft)?

Yes, please merge. Thanks for the review. @constracktor is running the full HPC benchmark sweep on a larger cluster; I will post the results as a follow up comment once they land, and open a tuning PR if the threshold needs adjustment.

iemAnshuman requested a review from hkaiser as a code owner April 17, 2026 00:00

hkaiser added type: enhancement category: components type: compatibility issue category: algorithms labels Apr 17, 2026

iemAnshuman marked this pull request as draft April 17, 2026 08:37

iemAnshuman mentioned this pull request Apr 18, 2026

Add non-power-of-arity tests for hierarchical all_reduce and all_gather #7198

Merged

iemAnshuman added 2 commits April 19, 2026 16:35

iemAnshuman force-pushed the feature/hierarchical-flat-fallback branch from 187d863 to 66b9fc2 Compare April 19, 2026 11:05

hkaiser approved these changes Apr 19, 2026

View reviewed changes

hkaiser added this to the 2.0.0 milestone Apr 19, 2026

hkaiser marked this pull request as ready for review April 19, 2026 15:08

hkaiser merged commit 4db60c8 into TheHPXProject:master Apr 19, 2026
66 of 67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flat-collective fallback to hierarchical_communicator for small site counts#7193

Add flat-collective fallback to hierarchical_communicator for small site counts#7193
hkaiser merged 2 commits intoTheHPXProject:masterfrom
iemAnshuman:feature/hierarchical-flat-fallback

iemAnshuman commented Apr 17, 2026

Uh oh!

StellarBot commented Apr 17, 2026

Uh oh!

codacy-production Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

hkaiser commented Apr 17, 2026

Uh oh!

iemAnshuman commented Apr 17, 2026

Uh oh!

hkaiser commented Apr 17, 2026

Uh oh!

iemAnshuman commented Apr 17, 2026

Uh oh!

iemAnshuman commented Apr 18, 2026

Uh oh!

hkaiser left a comment

Uh oh!

hkaiser commented Apr 19, 2026

Uh oh!

iemAnshuman commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

iemAnshuman commented Apr 17, 2026

Summary

Benchmark context

Design

API change

Testing

Open items for discussion

Follow-ups

Uh oh!

StellarBot commented Apr 17, 2026

Uh oh!

codacy-production Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

hkaiser commented Apr 17, 2026

Uh oh!

iemAnshuman commented Apr 17, 2026

Uh oh!

hkaiser commented Apr 17, 2026

Uh oh!

iemAnshuman commented Apr 17, 2026

Uh oh!

iemAnshuman commented Apr 18, 2026

Uh oh!

hkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

hkaiser commented Apr 19, 2026

Uh oh!

iemAnshuman commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codacy-production Bot commented Apr 17, 2026 •

edited

Loading