executors: fix performance regressions and optimize parallel_executor caching by arpittkhandelwal · Pull Request #7209 · TheHPXProject/hpx

arpittkhandelwal · 2026-04-20T17:40:04Z

Description

This Pull Request addresses critical performance regressions in fork_join_executor and parallel_executor by implementing missing traits, optimizing caching strategies, and refactoring the executor template structure.

Key Changes

fork_join_executor:
- Implemented tag_invoke for processing_units_count_t and get_first_core_t. This prevents the executor from falling back to a default (often 1) core count, which was causing algorithms to run sequentially.
- Added stacksize validation to ensure correct executor instantiation.
parallel_executor:
- Implemented dual-path pu_mask caching to eliminate expensive recomputations in the hot path.
- Unified parallel_policy_executor flat and hierarchical specializations into a single template using if constexpr, reducing code duplication and maintenance burden.
Traits & Compatibility:
- Restored mandatory is_bulk_one_way_executor specializations.
- Preserved serialization order for backward compatibility with established data formats.

📊 Performance Results

The following results were obtained using the foreach_report_test benchmark with 8 threads in the standard Docker build environment:

Executor	Execution Time (ms)	Status
fork_join_executor	~2.3ms	✅ Fixed (from ~150ms regression)
parallel_executor	~16.8ms	✅ Stable (at parallel baseline)

These results confirm that the sequential fallback regression in fork_join_executor is fully resolved and the parallel_executor optimizations are performing as intended.

Closes

This PR supersedes and Closes #6919.

… caching This PR resolves critical performance regressions in fork_join_executor and parallel_executor: 1. fork_join_executor: Implemented missing tag_invoke traits for processing_units_count_t and get_first_core_t. This ensures the executor correctly identifies available cores, preventing sequential execution fallbacks. Added validation to prevent thread_stacksize::nostack usage. 2. parallel_executor: Implemented a dual-path pu_mask caching strategy to eliminate expensive recomputations. Unified the flat and hierarchical parallel_policy_executor specializations into a single template to reduce code duplication and complexity. 3. Traits & Serialization: Restored is_bulk_one_way_executor trait and ensured correct serialization order for backward compatibility. Closes TheHPXProject#6919

codacy-production · 2026-04-20T17:40:56Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

Copilot

Pull request overview

This PR targets performance regressions in HPX executors by ensuring fork_join_executor reports correct processing unit traits (avoiding sequential fallbacks) and by optimizing parallel_executor PU-mask handling while reducing template duplication.

Changes:

Add missing executor trait customizations to fork_join_executor and validate unsupported nostack stacksize.
Add lazy caching of pu_mask in parallel_policy_executor_base to avoid repeated recomputation in hot paths.
Unify flat vs hierarchical parallel_policy_executor specializations into a single template with if constexpr and preserve serialization format.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
libs/core/executors/include/hpx/executors/parallel_executor.hpp	Adds PU-mask caching and merges hierarchical/flat executor implementations using `if constexpr`.
libs/core/executors/include/hpx/executors/fork_join_executor.hpp	Implements missing trait CPOs and rejects `thread_stacksize::nostack` for correctness.

Copilot · 2026-04-20T17:45:03Z

+            // `mask_` is conceptually mutable: use const_cast to cache the
+            // computed mask on first access, avoiding repeated recomputation.
+            *const_cast<hpx::threads::mask_type**>(&mask_) =
+                new hpx::threads::mask_type(mask);
+            return *mask_;


pu_mask() now allocates and caches mask_, but parallel_policy_executor_base::operator= currently sets mask_ = nullptr without deleting the previously cached mask. With caching enabled this becomes a real memory leak when assigning an executor that has already computed its mask. Consider deleting the existing mask_ before clearing it (and also handle the case where pu_mask() might overwrite an existing pointer).

Copilot · 2026-04-20T17:45:04Z

+            // `mask_` is conceptually mutable: use const_cast to cache the
+            // computed mask on first access, avoiding repeated recomputation.
+            *const_cast<hpx::threads::mask_type**>(&mask_) =
+                new hpx::threads::mask_type(mask);
+            return *mask_;


The lazy initialization of mask_ in pu_mask() is not thread-safe: concurrent calls can race on mask_ (data race + potentially multiple allocations/leaks). Also, mutating mask_ via const_cast is fragile if an executor instance is ever truly const. Consider making the cache explicitly thread-safe (e.g., std::atomic<mask_type*> with compare_exchange, or std::once_flag + mutable, etc.).

Copilot · 2026-04-20T17:45:04Z

+        template <typename Parameters>
+            requires(hpx::traits::is_executor_parameters_v<Parameters>)
+        friend std::size_t tag_invoke(
+            hpx::execution::experimental::processing_units_count_t,
+            Parameters&&, fork_join_executor const& exec,
+            hpx::chrono::steady_duration const& = hpx::chrono::null_duration,
+            std::size_t = 0)
+        {
+            return exec.shared_data_->num_threads_;
+        }
+
+        friend std::size_t tag_invoke(
+            hpx::execution::experimental::get_first_core_t,
+            fork_join_executor const& exec) noexcept
+        {
+            return hpx::threads::find_first(exec.shared_data_->pu_mask_);
+        }


This adds new trait customizations (processing_units_count_t, get_first_core_t) that directly impact algorithm scheduling decisions. There are extensive fork_join_executor unit/regression tests, but none appear to assert the reported processing unit count/first core; adding a regression test would help prevent the sequential-fallback performance issue from reappearing.

Copilot · 2026-04-20T17:45:04Z

@@ -376,14 +376,11 @@ namespace hpx::execution {
                }
            }

-            // `mask_` is conceptually mutable, however in order to constexpr
-            // construct this object we need to use a `const_cast` to cache
-            // the mask.
-
-            //*const_cast<hpx::threads::mask_type**>(&mask_) =
-            //    new hpx::threads::mask_type(mask);
-
-            return mask;
+            // `mask_` is conceptually mutable: use const_cast to cache the
+            // computed mask on first access, avoiding repeated recomputation.
+            *const_cast<hpx::threads::mask_type**>(&mask_) =
+                new hpx::threads::mask_type(mask);
+            return *mask_;


Given the new caching behavior, it would be useful to add/extend a unit test to exercise repeated get_processing_units_mask calls (ensuring stability) and at least one copy/assignment scenario after the first call, to guard against regressions in the cache lifetime/ownership logic.

arpittkhandelwal requested a review from hkaiser as a code owner April 20, 2026 17:40

Copilot AI review requested due to automatic review settings April 20, 2026 17:40

Copilot started reviewing on behalf of arpittkhandelwal April 20, 2026 17:40 View session

arpittkhandelwal mentioned this pull request Apr 20, 2026

Fix performance regression in fork_join_executor by implementing missing traits #6919

Closed

Copilot AI reviewed Apr 20, 2026

View reviewed changes

hkaiser added type: optimization type: enhancement type: compatibility issue category: executors labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

executors: fix performance regressions and optimize parallel_executor caching#7209

executors: fix performance regressions and optimize parallel_executor caching#7209
arpittkhandelwal wants to merge 1 commit intoTheHPXProject:masterfrom
arpittkhandelwal:fix/executor-performance-final

arpittkhandelwal commented Apr 20, 2026

Uh oh!

codacy-production Bot commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

arpittkhandelwal commented Apr 20, 2026

Description

Key Changes

📊 Performance Results

Closes

Uh oh!

codacy-production Bot commented Apr 20, 2026

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants