Optimize atomic operations on UP (single-VCPU)

The run time of mutex_lock() and mutex_unlock() is dominated by a single instruction, "lock xadd", which is generated by std::atomic::fetch_add().

On a single VCPU, the "lock" prefix isn't needed. Because the host is SMP it cannot ignore this prefix, but when the guest has a single VCPU, we know this prefix is not necessary. If we drop the "lock" prefix and use the ordinary increment instruction, the mutex becomes much faster - an uncontended lock/unlock pair drops from 22ns to just 9ns.  When mutexes are heavily used (e.g., in memcached they take as much as 20% of the run time), this can bring a noticable improvement.

What we should do is to remember where in the code we have the "lock" prefix (the single byte 0xf0), and when booting on a single vcpu, replace them by "nop" (0x90). Linux also has such a mechanism (see asm/alternative.h) - "LOCK_PREFIX" generates the "lock" instruction but also saves in a ".smp_locks" section the address of this lock, and any time the number of cpus grows beyond 1 or shrinks to 1, the code iterates over these locates and changes them to 0x90 or 0xf0.

Doing the above is easy if we implemented our own "fetch_add" and "compare_exchange" operation. However, currently we use C++11's std::atomic and it will be a shame to lose its advantages (like working on any processors, not just x86). Perhaps there's a solution, though: <atomic> uses the GCC builtins __atomic_fetch_add and friends (see http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). So if we re-implement those, it can be enough. I tried to redefine this function and got some strange compilation errors, but maybe by re-"#define"-ing it before including <atomic>, or some other ugly trick, we can force our own implementation.

A different approach we can consider (though it will probably be more complex) is to remove the lock prefix from all code in a certain function or section. This will be hard and risky, though - we need to understand where instructions begin and end, and what is code and what is not code. It will be safer if we can limit this transformation to single functions (such as lockfree_mutex_lock()) which are known not to be problematic in this regard.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize atomic operations on UP (single-VCPU) #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Optimize atomic operations on UP (single-VCPU) #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions