performance: Optimize shared_mutex and fix C++20 modular build errors#7007
performance: Optimize shared_mutex and fix C++20 modular build errors#7007arpittkhandelwal wants to merge 9 commits intoTheHPXProject:masterfrom
Conversation
f982021 to
72858ce
Compare
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesFootnotes
|
|
Please keep the module-related changes separate (you could apply those to the PR that you have already open). Also, please have a look at the compilation errors reported (e.g., https://cdash.rostam.cct.lsu.edu/viewBuildError.php?buildid=42049) |
72858ce to
c16cfa8
Compare
I've cleaned the branch to remove unrelated modularization changes and fixed the benchmark compilation error and formatting. It should be ready for review now! |
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
|
@arpittkhandelwal Please rebase onto master to fix the reported problems. |
c16cfa8 to
9ffd2f2
Compare
I have pushed the rebased |
|
@arpittkhandelwal Are you still interested in working on this PR? |
9ffd2f2 to
af1f93a
Compare
Yes sir updated |
4ea0f5f to
714ce34
Compare
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
608dc81 to
17510da
Compare
Performance test reportHPX PerformanceComparison
Info
Comparison
Info
Comparison
Info
Explanation of Symbols
|
|
@arpittkhandelwal What's your plan with regard to moving this forward? |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 5 |
| Duplication | 0 |
TIP This summary will be updated as you push new changes. Give us feedback
|
Hi @hkaiser sir , I’ll take another look at this PR, address the issues, and update it shortly. |
…ctural state management
…nd update copyright
fdb37da to
6acdd90
Compare
|
@arpittkhandelwal thanks for the fixes. Could you please rerun the benchmark to see what impact the recent changes have had? |
@hkaiser I re-ran the shared_mutex_overhead benchmark (4 threads, 1,000,000 iterations): |
The improvement is nice but not overwhelming. Why is it much worse than your first numbers, do you know? |
Sir I have identified the source of the performance drop and finalized the optimizations. The "worse" performance in the previous run was due to hidden overhead in the shared_mutex wrapper class. Specifically, the wrapper was performing a redundant atomic increment/decrement on its internal intrusive_ptr for every lock/unlock call (via auto data = data_;). I have refactored the wrapper to bypass these refcount operations and combined it with the refined atomic state-management logic (reusing s1 and eliminating redundant loads). Updated Benchmark Results (4 threads, 1,000,000 iterations): Original Baseline: 0.57s |
Hmmm, we discussed this before (#7007 (comment)). The reason for those refcounts was to guarantee that the shared state outlives any operations on it. I think we should leave the additional refcount in place at least for the unlock operations. |
There was a problem hiding this comment.
Pull request overview
This PR optimizes hpx::shared_mutex to reduce contention in read-heavy workloads and adds a local performance benchmark to measure shared mutex overhead.
Changes:
- Added a lock-free fast path for
shared_mutex::lock_shared()and a fast path forunlock_shared()when it doesn’t need to wake waiters. - Refactored
hpx::detail::shared_mutexwrapper calls to avoid per-callintrusive_ptrrefcount churn. - Added and wired up a new local performance benchmark (
shared_mutex_overhead).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp |
Introduces fast paths and refactorings in shared mutex internals/wrapper. |
tests/performance/local/shared_mutex_overhead.cpp |
New benchmark measuring shared-lock overhead under multi-threaded readers. |
tests/performance/local/CMakeLists.txt |
Adds the new benchmark target and sets its parameters. |
Comments suppressed due to low confidence (1)
libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp:96
set_state(..., lk)no longer does the initial relaxed load check that could fast-reject stales1values before takingstate_change. With the new code, every failed CAS attempt will still acquirestate_change, which can increase contention in the slow paths that call this helper (e.g. unlock/upgrade transitions). Consider reintroducing a cheap pre-check (or another mechanism) to avoid takingstate_changewhens1is already known-stale.
bool set_state(shared_state& s1, shared_state& s,
std::unique_lock<mutex_type>& lk) noexcept
{
++s.data.tag;
lk = std::unique_lock<mutex_type>(state_change);
if (state.compare_exchange_strong(s1, s, std::memory_order_release,
std::memory_order_relaxed))
return true;
lk.unlock();
return false;
}
| #include <hpx/synchronization/shared_mutex.hpp> | ||
|
|
||
| #include <cstdint> | ||
| #include <iostream> |
There was a problem hiding this comment.
std::shared_lock is used but the file does not include the standard header that defines it. hpx/synchronization/shared_mutex.hpp only includes <mutex>, so this may fail to compile on standard libraries that don't transitively include <shared_mutex>. Add an explicit #include <shared_mutex> (or otherwise include the header that provides std::shared_lock).
| #include <iostream> | |
| #include <iostream> | |
| #include <shared_mutex> |
I definitely overlooked the potential for the mutex itself to be destroyed while a thread is suspended in a slow-path or completion handler. Correctness is definitely the priority here. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
libs/core/synchronization/include/hpx/synchronization/shared_mutex.hpp:96
set_state(..., lk)no longer checks whethers1still matches the current atomic state before acquiringstate_change. This means the slow path will now take the internal mutex even when the expected value is already stale, increasing contention/serialization in highly contended lock/unlock paths. Consider restoring the pre-check (or otherwise avoiding takingstate_changeunless the CAS has a realistic chance to succeed).
bool set_state(shared_state& s1, shared_state& s,
std::unique_lock<mutex_type>& lk) noexcept
{
++s.data.tag;
lk = std::unique_lock<mutex_type>(state_change);
if (state.compare_exchange_strong(
s1, s, std::memory_order_release, std::memory_order_relaxed))
return true;
lk.unlock();
return false;
}
| { | ||
| for (std::uint64_t i = 0; i < num_iterations; ++i) | ||
| { | ||
| std::shared_lock<hpx::shared_mutex> l(mtx); | ||
| } |
There was a problem hiding this comment.
std::shared_lock is used but the file does not include the standard header that defines it. This will fail to compile on standard library implementations where shared_lock is only provided by <shared_mutex> (it is not guaranteed to be available via the HPX headers included here). Add the appropriate standard include (or switch to an HPX-provided lock type if that’s the intended API).
| if (data->try_unlock_shared_fast()) | ||
| return; |
There was a problem hiding this comment.
shared_mutex::unlock_shared now attempts try_unlock_shared_fast() and on failure immediately calls unlock_shared(), which re-loads the atomic state and repeats much of the decision logic. For common cases where the shared count is 1 (or upgrade/exclusive-wait flags are set), this adds an extra atomic load/branching to every unlock. Consider folding the fast path into shared_mutex_data::unlock_shared() (single state load) or otherwise structuring it to avoid double-reading the state on the fallback path.
| if (data->try_unlock_shared_fast()) | |
| return; |
This PR introduces performance optimizations for
hpx::shared_mutexand resolves build issues encountered with C++20 modularity.Key Changes:
lock_shared:Added a fast-path to
hpx::detail::shared_mutex_data::lock_sharedthat attempts to acquire a shared lock using an atomic increment before falling back to the internal spinlock. This significantly reduces serialization in read-heavy scenarios, such as AGAS cache lookups.Refactored the
hpx::detail::shared_mutexwrapper class to avoid redundant atomic increment/decrement operations of the internalintrusive_ptron every call.Corrected the placement of
HPX_CXX_EXPORTincomponents_base_fwd.hppandcomponent_type.hppto ensure compatibility with C++20 modular builds.Added
tests/performance/local/shared_mutex_overhead.cppto quantify the overhead and contention ofshared_mutex.Performance Impact:
Benchmark results on a 4-thread reader-intensive workload (1,000,000 iterations per thread):
These optimizations will directly benefit high-concurrency read paths in HPX, particularly in the AGAS subsystem.