executors: fix performance regressions and optimize parallel_executor caching#7209
Conversation
… caching This PR resolves critical performance regressions in fork_join_executor and parallel_executor: 1. fork_join_executor: Implemented missing tag_invoke traits for processing_units_count_t and get_first_core_t. This ensures the executor correctly identifies available cores, preventing sequential execution fallbacks. Added validation to prevent thread_stacksize::nostack usage. 2. parallel_executor: Implemented a dual-path pu_mask caching strategy to eliminate expensive recomputations. Unified the flat and hierarchical parallel_policy_executor specializations into a single template to reduce code duplication and complexity. 3. Traits & Serialization: Restored is_bulk_one_way_executor trait and ensured correct serialization order for backward compatibility. Closes TheHPXProject#6919
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Pull request overview
This PR targets performance regressions in HPX executors by ensuring fork_join_executor reports correct processing unit traits (avoiding sequential fallbacks) and by optimizing parallel_executor PU-mask handling while reducing template duplication.
Changes:
- Add missing executor trait customizations to
fork_join_executorand validate unsupportednostackstacksize. - Add lazy caching of
pu_maskinparallel_policy_executor_baseto avoid repeated recomputation in hot paths. - Unify flat vs hierarchical
parallel_policy_executorspecializations into a single template withif constexprand preserve serialization format.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| libs/core/executors/include/hpx/executors/parallel_executor.hpp | Adds PU-mask caching and merges hierarchical/flat executor implementations using if constexpr. |
| libs/core/executors/include/hpx/executors/fork_join_executor.hpp | Implements missing trait CPOs and rejects thread_stacksize::nostack for correctness. |
| // `mask_` is conceptually mutable: use const_cast to cache the | ||
| // computed mask on first access, avoiding repeated recomputation. | ||
| *const_cast<hpx::threads::mask_type**>(&mask_) = | ||
| new hpx::threads::mask_type(mask); | ||
| return *mask_; |
There was a problem hiding this comment.
pu_mask() now allocates and caches mask_, but parallel_policy_executor_base::operator= currently sets mask_ = nullptr without deleting the previously cached mask. With caching enabled this becomes a real memory leak when assigning an executor that has already computed its mask. Consider deleting the existing mask_ before clearing it (and also handle the case where pu_mask() might overwrite an existing pointer).
| // `mask_` is conceptually mutable: use const_cast to cache the | ||
| // computed mask on first access, avoiding repeated recomputation. | ||
| *const_cast<hpx::threads::mask_type**>(&mask_) = | ||
| new hpx::threads::mask_type(mask); | ||
| return *mask_; |
There was a problem hiding this comment.
The lazy initialization of mask_ in pu_mask() is not thread-safe: concurrent calls can race on mask_ (data race + potentially multiple allocations/leaks). Also, mutating mask_ via const_cast is fragile if an executor instance is ever truly const. Consider making the cache explicitly thread-safe (e.g., std::atomic<mask_type*> with compare_exchange, or std::once_flag + mutable, etc.).
| template <typename Parameters> | ||
| requires(hpx::traits::is_executor_parameters_v<Parameters>) | ||
| friend std::size_t tag_invoke( | ||
| hpx::execution::experimental::processing_units_count_t, | ||
| Parameters&&, fork_join_executor const& exec, | ||
| hpx::chrono::steady_duration const& = hpx::chrono::null_duration, | ||
| std::size_t = 0) | ||
| { | ||
| return exec.shared_data_->num_threads_; | ||
| } | ||
|
|
||
| friend std::size_t tag_invoke( | ||
| hpx::execution::experimental::get_first_core_t, | ||
| fork_join_executor const& exec) noexcept | ||
| { | ||
| return hpx::threads::find_first(exec.shared_data_->pu_mask_); | ||
| } |
There was a problem hiding this comment.
This adds new trait customizations (processing_units_count_t, get_first_core_t) that directly impact algorithm scheduling decisions. There are extensive fork_join_executor unit/regression tests, but none appear to assert the reported processing unit count/first core; adding a regression test would help prevent the sequential-fallback performance issue from reappearing.
| @@ -376,14 +376,11 @@ namespace hpx::execution { | |||
| } | |||
| } | |||
|
|
|||
| // `mask_` is conceptually mutable, however in order to constexpr | |||
| // construct this object we need to use a `const_cast` to cache | |||
| // the mask. | |||
|
|
|||
| //*const_cast<hpx::threads::mask_type**>(&mask_) = | |||
| // new hpx::threads::mask_type(mask); | |||
|
|
|||
| return mask; | |||
| // `mask_` is conceptually mutable: use const_cast to cache the | |||
| // computed mask on first access, avoiding repeated recomputation. | |||
| *const_cast<hpx::threads::mask_type**>(&mask_) = | |||
| new hpx::threads::mask_type(mask); | |||
| return *mask_; | |||
There was a problem hiding this comment.
Given the new caching behavior, it would be useful to add/extend a unit test to exercise repeated get_processing_units_mask calls (ensuring stability) and at least one copy/assignment scenario after the first call, to guard against regressions in the cache lifetime/ownership logic.
Description
This Pull Request addresses critical performance regressions in
fork_join_executorandparallel_executorby implementing missing traits, optimizing caching strategies, and refactoring the executor template structure.Key Changes
tag_invokeforprocessing_units_count_tandget_first_core_t. This prevents the executor from falling back to a default (often 1) core count, which was causing algorithms to run sequentially.parallel_policy_executorflat and hierarchical specializations into a single template usingif constexpr, reducing code duplication and maintenance burden.is_bulk_one_way_executorspecializations.📊 Performance Results
The following results were obtained using the
foreach_report_testbenchmark with 8 threads in the standard Docker build environment:These results confirm that the sequential fallback regression in
fork_join_executoris fully resolved and theparallel_executoroptimizations are performing as intended.Closes
This PR supersedes and Closes #6919.