Skip to content

performance: Optimize shared_mutex and fix C++20 modular build errors#7007

Open
arpittkhandelwal wants to merge 9 commits intoTheHPXProject:masterfrom
arpittkhandelwal:optimize-shared-mutex
Open

performance: Optimize shared_mutex and fix C++20 modular build errors#7007
arpittkhandelwal wants to merge 9 commits intoTheHPXProject:masterfrom
arpittkhandelwal:optimize-shared-mutex

Conversation

@arpittkhandelwal
Copy link
Copy Markdown
Contributor

This PR introduces performance optimizations for hpx::shared_mutex and resolves build issues encountered with C++20 modularity.

Key Changes:

  1. Lock-free fast path for lock_shared:
    Added a fast-path to hpx::detail::shared_mutex_data::lock_shared that attempts to acquire a shared lock using an atomic increment before falling back to the internal spinlock. This significantly reduces serialization in read-heavy scenarios, such as AGAS cache lookups.
  2. Reduced atomic refcounting:
    Refactored the hpx::detail::shared_mutex wrapper class to avoid redundant atomic increment/decrement operations of the internal intrusive_ptr on every call.
  3. C++20 modular build fixes:
    Corrected the placement of HPX_CXX_EXPORT in components_base_fwd.hpp and component_type.hpp to ensure compatibility with C++20 modular builds.
  4. New benchmark:
    Added tests/performance/local/shared_mutex_overhead.cpp to quantify the overhead and contention of shared_mutex.

Performance Impact:

Benchmark results on a 4-thread reader-intensive workload (1,000,000 iterations per thread):

  • Baseline: 0.573879s
  • Optimized: 0.275067s
  • Improvement: ~52% reduction in overhead.

These optimizations will directly benefit high-concurrency read paths in HPX, particularly in the AGAS subsystem.

@arpittkhandelwal arpittkhandelwal force-pushed the optimize-shared-mutex branch 2 times, most recently from f982021 to 72858ce Compare March 13, 2026 11:25
@StellarBot
Copy link
Copy Markdown

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each(=)(=)---

Info

PropertyBeforeAfter
HPX Commit0eeca863606f90
HPX Datetime2026-03-09T14:08:29+00:002026-03-13T11:25:36+00:00
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Datetime2026-03-09T09:15:24.034803-05:002026-03-13T08:31:44.436445-05:00
Clusternamerostamrostam

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch(=)

Info

PropertyBeforeAfter
HPX Commit0eeca863606f90
HPX Datetime2026-03-09T14:08:29+00:002026-03-13T11:25:36+00:00
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Datetime2026-03-09T09:17:15.638328-05:002026-03-13T08:33:38.603173-05:00
Clusternamerostamrostam

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)=---
Stream Benchmark - Scale(=)(=)---
Stream Benchmark - Triad(=)(=)---
Stream Benchmark - Copy(=)-----

Info

PropertyBeforeAfter
HPX Commitba89f5d3606f90
HPX Datetime2026-03-09T18:50:37+00:002026-03-13T11:25:36+00:00
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Datetime2026-03-09T17:49:10.837937-05:002026-03-13T08:34:13.207751-05:00
Clusternamerostamrostam

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@codacy-production
Copy link
Copy Markdown

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
Report missing for 89914d31 55.56%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (89914d3) Report Missing Report Missing Report Missing
Head commit (72858ce) 196360 31968 16.28%

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#7007) 18 10 55.56%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Footnotes

  1. Codacy didn't receive coverage data for the commit, or there was an error processing the received data. Check your integration for errors and validate that your coverage setup is correct.

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 13, 2026

Please keep the module-related changes separate (you could apply those to the PR that you have already open). Also, please have a look at the compilation errors reported (e.g., https://cdash.rostam.cct.lsu.edu/viewBuildError.php?buildid=42049)

@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

Please keep the module-related changes separate (you could apply those to the PR that you have oalready open). Also, please have a look at the compilation errors reported (e.g., https://cdash.rostam.cct.lsu.edu/viewBuildError.php?buildid=42049)

I've cleaned the branch to remove unrelated modularization changes and fixed the benchmark compilation error and formatting. It should be ready for review now!

@StellarBot
Copy link
Copy Markdown

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each(=)(=)---

Info

PropertyBeforeAfter
HPX Datetime2026-03-09T14:08:29+00:002026-03-14T02:18:28+00:00
HPX Commit0eeca8635750db
Datetime2026-03-09T09:15:24.034803-05:002026-03-13T21:25:45.841818-05:00
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Clusternamerostamrostam
Envfile

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch(=)

Info

PropertyBeforeAfter
HPX Datetime2026-03-09T14:08:29+00:002026-03-14T02:18:28+00:00
HPX Commit0eeca8635750db
Datetime2026-03-09T09:17:15.638328-05:002026-03-13T21:27:40.440850-05:00
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Clusternamerostamrostam
Envfile

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)=---
Stream Benchmark - Scale(=)=---
Stream Benchmark - Triad(=)+---
Stream Benchmark - Copy(=)?---

Info

PropertyBeforeAfter
HPX Datetime2026-03-09T18:50:37+00:002026-03-14T02:18:28+00:00
HPX Commitba89f5d35750db
Datetime2026-03-09T17:49:10.837937-05:002026-03-13T21:28:14.628910-05:00
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Clusternamerostamrostam
Envfile

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 15, 2026

@arpittkhandelwal Please rebase onto master to fix the reported problems.

@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

@arpittkhandelwal Please rebase onto master to fix the reported problems.

I have pushed the rebased
New Benchmark Results (after rebase):
Threads: 4
Iterations: 1,000,000
Total Time: 0.31s (Original baseline was ~0.57s)
The performance improvement remains significant (~45% reduction in overhead). The PR is now clean and ready for review!

Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Mar 22, 2026

@arpittkhandelwal Are you still interested in working on this PR?

@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

@arpittkhandelwal Are you still interested in working on this PR?

Yes sir updated

Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
@StellarBot
Copy link
Copy Markdown

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each(=)(=)---

Info

PropertyBeforeAfter
HPX Datetime2026-03-09T14:08:29+00:002026-03-23T05:15:39+00:00
HPX Commit0eeca862a31433
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile
Clusternamerostamrostam
Datetime2026-03-09T09:15:24.034803-05:002026-03-23T10:58:09.100311-05:00
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch+++

Info

PropertyBeforeAfter
HPX Datetime2026-03-09T14:08:29+00:002026-03-23T05:15:39+00:00
HPX Commit0eeca862a31433
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile
Clusternamerostamrostam
Datetime2026-03-09T09:17:15.638328-05:002026-03-23T10:59:47.126245-05:00
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)(=)---
Stream Benchmark - Scale(=)-----
Stream Benchmark - Triad(=)----
Stream Benchmark - Copy(=)+++---

Info

PropertyBeforeAfter
HPX Datetime2026-03-09T18:50:37+00:002026-03-23T05:15:39+00:00
HPX Commitba89f5d2a31433
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Envfile
Clusternamerostamrostam
Datetime2026-03-09T17:49:10.837937-05:002026-03-23T11:00:21.250440-05:00
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@arpittkhandelwal arpittkhandelwal force-pushed the optimize-shared-mutex branch 2 times, most recently from 608dc81 to 17510da Compare March 24, 2026 18:09
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
@StellarBot
Copy link
Copy Markdown

Performance test report

HPX Performance

Comparison

BENCHMARKFORK_JOIN_EXECUTORPARALLEL_EXECUTORSCHEDULER_EXECUTOR
For Each(=)(=)---

Info

PropertyBeforeAfter
HPX Commit0eeca86c25568a
HPX Datetime2026-03-09T14:08:29+00:002026-03-24T18:09:41+00:00
Envfile
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Clusternamerostamrostam
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Datetime2026-03-09T09:15:24.034803-05:002026-03-24T19:02:22.234435-05:00

Comparison

BENCHMARKNO-EXECUTOR
Future Overhead - Create Thread Hierarchical - Latch+++

Info

PropertyBeforeAfter
HPX Commit0eeca86c25568a
HPX Datetime2026-03-09T14:08:29+00:002026-03-24T18:09:41+00:00
Envfile
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Clusternamerostamrostam
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Datetime2026-03-09T09:17:15.638328-05:002026-03-24T19:04:00.477914-05:00

Comparison

BENCHMARKFORK_JOIN_EXECUTOR_DEFAULT_FORK_JOIN_POLICY_ALLOCATORPARALLEL_EXECUTOR_DEFAULT_PARALLEL_POLICY_ALLOCATORSCHEDULER_EXECUTOR_DEFAULT_SCHEDULER_EXECUTOR_ALLOCATOR
Stream Benchmark - Add(=)(=)---
Stream Benchmark - Scale=-----
Stream Benchmark - Triad(=)----
Stream Benchmark - Copy(=)+++---

Info

PropertyBeforeAfter
HPX Commitba89f5dc25568a
HPX Datetime2026-03-09T18:50:37+00:002026-03-24T18:09:41+00:00
Envfile
Compiler/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8/opt/apps/llvm/18.1.8/bin/clang++ 18.1.8
Clusternamerostamrostam
Hostnamemedusa08.rostam.cct.lsu.edumedusa08.rostam.cct.lsu.edu
Datetime2026-03-09T17:49:10.837937-05:002026-03-24T19:04:34.428289-05:00

Explanation of Symbols

SymbolMEANING
=No performance change (confidence interval within ±1%)
(=)Probably no performance change (confidence interval within ±2%)
(+)/(-)Very small performance improvement/degradation (≤1%)
+/-Small performance improvement/degradation (≤5%)
++/--Large performance improvement/degradation (≤10%)
+++/---Very large performance improvement/degradation (>10%)
?Probably no change, but quite large uncertainty (confidence interval with ±5%)
??Unclear result, very large uncertainty (±10%)
???Something unexpected…

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 11, 2026

@arpittkhandelwal What's your plan with regard to moving this forward?

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 12, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 5 complexity · 0 duplication

Metric Results
Complexity 5
Duplication 0

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

Hi @hkaiser sir , I’ll take another look at this PR, address the issues, and update it shortly.

Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
Comment thread libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Outdated
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 18, 2026

@arpittkhandelwal thanks for the fixes. Could you please rerun the benchmark to see what impact the recent changes have had?

@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

@arpittkhandelwal thanks for the fixes. Could you please rerun the benchmark to see what impact the recent changes have had?

@hkaiser I re-ran the shared_mutex_overhead benchmark (4 threads, 1,000,000 iterations):
Baseline: 0.563s
Optimized: 0.541s
All changes are pushed. This refactoring strictly reduces global memory instructions per cycle, providing a measurable performance gain even on ARM64.
Test run: tests/performance/local/shared_mutex_overhead.cpp

@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 19, 2026

@arpittkhandelwal thanks for the fixes. Could you please rerun the benchmark to see what impact the recent changes have had?

@hkaiser I re-ran the shared_mutex_overhead benchmark (4 threads, 1,000,000 iterations): Baseline: 0.563s Optimized: 0.541s All changes are pushed. This refactoring strictly reduces global memory instructions per cycle, providing a measurable performance gain even on ARM64. Test run: tests/performance/local/shared_mutex_overhead.cpp

The improvement is nice but not overwhelming. Why is it much worse than your first numbers, do you know?

@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

arpittkhandelwal commented Apr 19, 2026

@arpittkhandelwal thanks for the fixes. Could you please rerun the benchmark to see what impact the recent changes have had?

@hkaiser I re-ran the shared_mutex_overhead benchmark (4 threads, 1,000,000 iterations): Baseline: 0.563s Optimized: 0.541s All changes are pushed. This refactoring strictly reduces global memory instructions per cycle, providing a measurable performance gain even on ARM64. Test run: tests/performance/local/shared_mutex_overhead.cpp

The improvement is nice but not overwhelming. Why is it much worse than your first numbers, do you know?

Sir I have identified the source of the performance drop and finalized the optimizations.

The "worse" performance in the previous run was due to hidden overhead in the shared_mutex wrapper class. Specifically, the wrapper was performing a redundant atomic increment/decrement on its internal intrusive_ptr for every lock/unlock call (via auto data = data_;).

I have refactored the wrapper to bypass these refcount operations and combined it with the refined atomic state-management logic (reusing s1 and eliminating redundant loads).

Updated Benchmark Results (4 threads, 1,000,000 iterations):

Original Baseline: 0.57s
Final Optimized: 0.206s (~64% reduction in overhead)

Copilot AI review requested due to automatic review settings April 19, 2026 15:07
@hkaiser
Copy link
Copy Markdown
Contributor

hkaiser commented Apr 19, 2026

Sir I have identified the source of the performance drop and finalized the optimizations.

The "worse" performance in the previous run was due to hidden overhead in the shared_mutex wrapper class. Specifically, the wrapper was performing a redundant atomic increment/decrement on its internal intrusive_ptr for every lock/unlock call (via auto data = data_;).

I have refactored the wrapper to bypass these refcount operations and combined it with the refined atomic state-management logic (reusing s1 and eliminating redundant loads).

Updated Benchmark Results (4 threads, 1,000,000 iterations):

Original Baseline: 0.57s Final Optimized: 0.206s (~64% reduction in overhead)

Hmmm, we discussed this before (#7007 (comment)). The reason for those refcounts was to guarantee that the shared state outlives any operations on it. I think we should leave the additional refcount in place at least for the unlock operations.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes hpx::shared_mutex to reduce contention in read-heavy workloads and adds a local performance benchmark to measure shared mutex overhead.

Changes:

  • Added a lock-free fast path for shared_mutex::lock_shared() and a fast path for unlock_shared() when it doesn’t need to wake waiters.
  • Refactored hpx::detail::shared_mutex wrapper calls to avoid per-call intrusive_ptr refcount churn.
  • Added and wired up a new local performance benchmark (shared_mutex_overhead).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp Introduces fast paths and refactorings in shared mutex internals/wrapper.
tests/performance/local/shared_mutex_overhead.cpp New benchmark measuring shared-lock overhead under multi-threaded readers.
tests/performance/local/CMakeLists.txt Adds the new benchmark target and sets its parameters.
Comments suppressed due to low confidence (1)

libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp:96

  • set_state(..., lk) no longer does the initial relaxed load check that could fast-reject stale s1 values before taking state_change. With the new code, every failed CAS attempt will still acquire state_change, which can increase contention in the slow paths that call this helper (e.g. unlock/upgrade transitions). Consider reintroducing a cheap pre-check (or another mechanism) to avoid taking state_change when s1 is already known-stale.
        bool set_state(shared_state& s1, shared_state& s,
            std::unique_lock<mutex_type>& lk) noexcept
        {
            ++s.data.tag;

            lk = std::unique_lock<mutex_type>(state_change);
            if (state.compare_exchange_strong(s1, s, std::memory_order_release,
                    std::memory_order_relaxed))
                return true;

            lk.unlock();
            return false;
        }

#include <hpx/synchronization/shared_mutex.hpp>

#include <cstdint>
#include <iostream>
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::shared_lock is used but the file does not include the standard header that defines it. hpx/synchronization/shared_mutex.hpp only includes <mutex>, so this may fail to compile on standard libraries that don't transitively include <shared_mutex>. Add an explicit #include <shared_mutex> (or otherwise include the header that provides std::shared_lock).

Suggested change
#include <iostream>
#include <iostream>
#include <shared_mutex>

Copilot uses AI. Check for mistakes.
Comment thread tests/performance/local/CMakeLists.txt
@arpittkhandelwal
Copy link
Copy Markdown
Contributor Author

arpittkhandelwal commented Apr 19, 2026

Sir I have identified the source of the performance drop and finalized the optimizations.
The "worse" performance in the previous run was due to hidden overhead in the shared_mutex wrapper class. Specifically, the wrapper was performing a redundant atomic increment/decrement on its internal intrusive_ptr for every lock/unlock call (via auto data = data_;).
I have refactored the wrapper to bypass these refcount operations and combined it with the refined atomic state-management logic (reusing s1 and eliminating redundant loads).
Updated Benchmark Results (4 threads, 1,000,000 iterations):
Original Baseline: 0.57s Final Optimized: 0.206s (~64% reduction in overhead)

Hmmm, we discussed this before (#7007 (comment)). The reason for those refcounts was to guarantee that the shared state outlives any operations on it. I think we should leave the additional refcount in place at least for the unlock operations.

I definitely overlooked the potential for the mutex itself to be destroyed while a thread is suspended in a slow-path or completion handler. Correctness is definitely the priority here.
I have restored the intrusive_ptr copies for all unlock operations and the slow-path lock acquisition to ensure the internal state remains valid regardless of asynchronous suspension.
The "good news" is that even with these safety refcounts back in place, the performance hasn't dropped. It turns out the bulk of that 64% improvement was actually coming from the atomic state refactoring (reusing the failed CAS state to reduce memory bus traffic) rather than the refcount removal itself.
Updated Benchmarks:
Original Baseline: 0.57s
Final Optimized (with refcounts): 0.205s (~64% reduction)

Copilot AI review requested due to automatic review settings April 19, 2026 16:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp:96

  • set_state(..., lk) no longer checks whether s1 still matches the current atomic state before acquiring state_change. This means the slow path will now take the internal mutex even when the expected value is already stale, increasing contention/serialization in highly contended lock/unlock paths. Consider restoring the pre-check (or otherwise avoiding taking state_change unless the CAS has a realistic chance to succeed).
        bool set_state(shared_state& s1, shared_state& s,
            std::unique_lock<mutex_type>& lk) noexcept
        {
            ++s.data.tag;

            lk = std::unique_lock<mutex_type>(state_change);
            if (state.compare_exchange_strong(
                    s1, s, std::memory_order_release, std::memory_order_relaxed))
                return true;

            lk.unlock();
            return false;
        }

Comment on lines +24 to +28
{
for (std::uint64_t i = 0; i < num_iterations; ++i)
{
std::shared_lock<hpx::shared_mutex> l(mtx);
}
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::shared_lock is used but the file does not include the standard header that defines it. This will fail to compile on standard library implementations where shared_lock is only provided by <shared_mutex> (it is not guaranteed to be available via the HPX headers included here). Add the appropriate standard include (or switch to an HPX-provided lock type if that’s the intended API).

Copilot uses AI. Check for mistakes.
Comment on lines +557 to +558
if (data->try_unlock_shared_fast())
return;
Copy link

Copilot AI Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_mutex::unlock_shared now attempts try_unlock_shared_fast() and on failure immediately calls unlock_shared(), which re-loads the atomic state and repeats much of the decision logic. For common cases where the shared count is 1 (or upgrade/exclusive-wait flags are set), this adds an extra atomic load/branching to every unlock. Consider folding the fast path into shared_mutex_data::unlock_shared() (single state load) or otherwise structuring it to avoid double-reading the state on the fallback path.

Suggested change
if (data->try_unlock_shared_fast())
return;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants