Skip to content

Add flat-collective fallback to hierarchical_communicator for small site counts#7193

Merged
hkaiser merged 2 commits intoTheHPXProject:masterfrom
iemAnshuman:feature/hierarchical-flat-fallback
Apr 19, 2026
Merged

Add flat-collective fallback to hierarchical_communicator for small site counts#7193
hkaiser merged 2 commits intoTheHPXProject:masterfrom
iemAnshuman:feature/hierarchical-flat-fallback

Conversation

@iemAnshuman
Copy link
Copy Markdown
Contributor

Summary

Adds a flat fallback to the hierarchical_communicator factory for site counts below a configurable threshold. Addresses the follow-up question from @hkaiser on #7160:

Should we fall back to the flat collectives if the number of connected sites is less than 16? This would require more measurements, but might be worth investigating.

Benchmark context

Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with benchmark_collectives_test, single int, 100 iterations, one
HPX thread per locality:

Procs Flat multi_use (μs) Hier arity=2 (μs) Hier arity=4 (μs)
4 50 101
8 98 139 141
16 166 198 849
32 465 346 1,120

Hierarchical arity=2 overtakes flat at 32 processes (1.34x speedup).
Arity=4 regresses at P≥16 and is tracked separately; this PR uses
arity=2 as the recommended default for ≥16 ranks. Default threshold
set to 16 per the suggestion in #7160. The exact optimal value likely
depends on node topology and interconnect, so the threshold is kept
configurable and flagged in "open items" below.

Design

The fallback lives entirely in create_hierarchical_communicator. When num_sites < threshold, the factory returns a hierarchical_communicator whose underlying std::vector<tuple<communicator, this_site_arg>> contains a single entry: a flat communicator spanning all N sites.

The tree-walking loops in the existing hierarchical overloads of reduce_here/reduce_there, broadcast_to/broadcast_from, gather_here/gather_there all collapse correctly when size() == 1. This is the same code path non-leader sites already take in normal tree mode (a leaf-only site has exactly one communicator in its vector), so this fallback is not a new state for the collective code — it's a state the collectives already handle in production.

This means zero changes to all_reduce.hpp, all_gather.hpp, reduce.hpp, broadcast.hpp, gather.hpp, or scatter.hpp. All correctness follows from the factory change.

API change

A new flat_fallback_threshold_arg (default 16) is added as the last parameter of create_hierarchical_communicator:

auto comm = create_hierarchical_communicator(
    basename, num_sites, this_site, arity, generation, root_site,
    flat_fallback_threshold_arg(16));  // default

Pass flat_fallback_threshold_arg(0) to disable the fallback and always build a tree (useful for testing the tree path directly).

All existing callers use default arguments, so this is source-compatible.

Testing

New test hierarchical_flat_fallback verifies:

  1. Structural invariant: when num_sites < threshold, the returned hierarchical_communicator has size() == 1
  2. Correctness under fallback: all_reduce on the fallback communicator produces the correct result
  3. Correctness under forced tree mode: passing flat_fallback_threshold_arg(0) produces the same result
  4. Local-mode coverage: the fallback works when multiple sites run as threads within a single process

Tested locally at 1 and 2 localities over the TCP parcelport on macOS. All existing hierarchical tests continue to pass without modification.

Open items for discussion

  • Default threshold value. I set it to 16 to match the suggestion in Add hierarchical all_reduce and all_gather via reduce+broadcast composition #7160, but the benchmark data above suggests a value of 24–32 may be closer to optimal on the DGX node tested. I'd appreciate input on whether to change the default or leave it at 16 pending multi-node measurements.
  • Preserving coverage of the tree path in existing tests. I did not modify the existing hierarchical unit tests (all_reduce_hierarchical, etc.) to pass flat_fallback_threshold_arg(0). With the default threshold of 16, those tests now exercise the fallback path for site counts 2–8 and the tree path only for 16+. If preserving tree-path coverage at small site counts is important, I can update those tests in a follow-up commit — happy to do either.
  • Multi-node validation. I don't have multi-node cluster access at the moment. Once I do, I'd like to rerun the benchmark at 2+ nodes over InfiniBand to confirm the threshold holds.

Follow-ups

After this merges, the same pattern generalizes naturally to all_to_all once that collective is implemented hierarchically.

@iemAnshuman iemAnshuman requested a review from hkaiser as a code owner April 17, 2026 00:00
@StellarBot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 17, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 17, 2026

Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with benchmark_collectives_test, single int, 100 iterations, one
HPX thread per locality:

Could you please collect benchmark data when running with more than one thread per locality? Using one thread per locality is not a representative use case.

@iemAnshuman
Copy link
Copy Markdown
Contributor Author

Benchmark data from the parent PR #7160, measured on a DGX H100 compute
node with benchmark_collectives_test, single int, 100 iterations, one
HPX thread per locality:

Could you please collect benchmark data when running with more than one thread per locality? Using one thread per locality is not a representative use case.

Thanks for the review. You are right, 1 thread per locality is not representative.

I am setting up a multi thread sweep covering 16 to 256 processes at 1, 4, and 8 threads per locality, with 4B and 1MiB message sizes and 100 iteration averaging (matching the methodology from the IPTW 2025 talk). My UPES DGX queue is heavily loaded at the moment with a previous job pending 70+ hours, so turnaround may take a few days. I have also reached out to @constracktor in case a run on medusa is feasible in parallel.

Keeping the PR as draft until the data is in. I will post results as a comment here as soon as I have them.

@iemAnshuman iemAnshuman marked this pull request as draft April 17, 2026 08:37
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 17, 2026

BTW, is the test failure for all_reduce_sync (on MacOS) related to your recent changes?

@iemAnshuman
Copy link
Copy Markdown
Contributor Author

BTW, is the test failure for all_reduce_sync (on MacOS) related to your recent changes?

No, unrelated. all_reduce_sync wraps the flat all_reduce overload, which goes through create_communicator. This PR only modifies create_hierarchical_communicator and adds an optional argument with a default; all_reduce.hpp is untouched. The git diff against master shows the changes are confined to create_communicator.{hpp,cpp}, argument_types.hpp, the new test, and the benchmark.

The macOS failure is a timeout (not an assertion) on debug TCP, and the other failing check is transpose_smp_block on Windows debug, unrelated to collectives. Both look like the usual debug CI flakes. Happy to re-run the failed jobs if you would like a clean board.

iemAnshuman added a commit to iemAnshuman/hpx that referenced this pull request Apr 18, 2026
Exercises site counts that are not clean multiples of the arity,
including configurations where tree recursion produces size-1
subgroups. Covers arity=2 with site counts {3, 5, 6, 7, 9, 10, 11, 15}
and arity=4 with {5, 6, 7, 9, 10, 11, 13, 15}.

These paths were previously uncovered. The hierarchical tree
construction in create_communicator.cpp handles them through the
division_steps and remainder logic, but no existing test verified
the behaviour. Companion coverage to the adaptive flat-fallback
work in TheHPXProject#7193.
@iemAnshuman
Copy link
Copy Markdown
Contributor Author

@hkaiser Opened #7198 with companion non-power-of-arity tests for hierarchical all_reduce and all_gather. Test only, no production changes. Happy to fold into this PR instead if you prefer a single commit rather than two PRs.

…shold

Below a configurable site-count threshold, hierarchical collectives
perform worse than flat because tree-walking overhead exceeds the
synchronization-depth benefit. This change adds a fallback in the
create_hierarchical_communicator factory: when num_sites < threshold,
it produces a hierarchical_communicator whose underlying vector holds
a single flat communicator spanning all N sites.

The tree-walking loops in the hierarchical overloads of reduce,
broadcast, gather, scatter, all_reduce, and all_gather collapse to a
single flat call when size() == 1. This is also the code path that
non-leader sites already take in normal tree mode (a leaf-only site
has exactly one communicator in its vector), so no collective code
changes are required.

The threshold is exposed as a new flat_fallback_threshold_arg argument
to create_hierarchical_communicator with a default of 16, matching the
suggestion from PR TheHPXProject#7160. Pass 0 to disable the fallback and force
tree construction (useful for tests exercising the tree path).

New test hierarchical_flat_fallback verifies the structural invariant
(size() == 1 when num_sites < threshold) and correctness of the
resulting all_reduce, at both 1 and 2 localities and across site
counts 2, 4, 8.

Addresses follow-up comment by @hkaiser on TheHPXProject#7160.
Adds benchmark coverage for the hierarchical flat fallback:

- New test functions test_one_shot_use_all_gather,
  test_multiple_use_with_generation_all_gather, and
  test_all_gather_hierarchical mirroring the existing all_reduce
  benchmark functions.

- New --fallback_threshold CLI flag on benchmark_collectives_test
  plumbed through to the five hierarchical test functions. Default
  value -1 preserves existing behavior (library default threshold of
  16). Pass 0 to force tree construction, or any other value to test
  an arbitrary threshold.

- When an explicit threshold is supplied, results are written with
  module name 'hierarchical_t{N}' instead of 'hierarchical' so flat
  fallback and forced tree runs can be distinguished in the CSV
  output.

This enables the measurement sweep requested by @hkaiser on the
parent PR: multiple threads per locality, across both all_reduce and
all_gather, with the fallback explicitly enabled and disabled for
direct comparison.
@iemAnshuman iemAnshuman force-pushed the feature/hierarchical-flat-fallback branch from 187d863 to 66b9fc2 Compare April 19, 2026 11:05
Copy link
Copy Markdown
Contributor

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@hkaiser hkaiser added this to the 2.0.0 milestone Apr 19, 2026
@hkaiser hkaiser marked this pull request as ready for review April 19, 2026 15:08
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 19, 2026

@iemAnshuman Can we merge this now (I un-marked it as draft)?

@iemAnshuman
Copy link
Copy Markdown
Contributor Author

@iemAnshuman Can we merge this now (I un-marked it as draft)?

Yes, please merge. Thanks for the review. @constracktor is running the full HPC benchmark sweep on a larger cluster; I will post the results as a follow up comment once they land, and open a tuning PR if the threshold needs adjustment.

@hkaiser hkaiser merged commit 4db60c8 into TheHPXProject:master Apr 19, 2026
66 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants