Skip to content

Optimize atomic operations on UP (single-VCPU) #28

@nyh

Description

@nyh

The run time of mutex_lock() and mutex_unlock() is dominated by a single instruction, "lock xadd", which is generated by std::atomic::fetch_add().

On a single VCPU, the "lock" prefix isn't needed. Because the host is SMP it cannot ignore this prefix, but when the guest has a single VCPU, we know this prefix is not necessary. If we drop the "lock" prefix and use the ordinary increment instruction, the mutex becomes much faster - an uncontended lock/unlock pair drops from 22ns to just 9ns. When mutexes are heavily used (e.g., in memcached they take as much as 20% of the run time), this can bring a noticable improvement.

What we should do is to remember where in the code we have the "lock" prefix (the single byte 0xf0), and when booting on a single vcpu, replace them by "nop" (0x90). Linux also has such a mechanism (see asm/alternative.h) - "LOCK_PREFIX" generates the "lock" instruction but also saves in a ".smp_locks" section the address of this lock, and any time the number of cpus grows beyond 1 or shrinks to 1, the code iterates over these locates and changes them to 0x90 or 0xf0.

Doing the above is easy if we implemented our own "fetch_add" and "compare_exchange" operation. However, currently we use C++11's std::atomic and it will be a shame to lose its advantages (like working on any processors, not just x86). Perhaps there's a solution, though: uses the GCC builtins __atomic_fetch_add and friends (see http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). So if we re-implement those, it can be enough. I tried to redefine this function and got some strange compilation errors, but maybe by re-"#define"-ing it before including , or some other ugly trick, we can force our own implementation.

A different approach we can consider (though it will probably be more complex) is to remove the lock prefix from all code in a certain function or section. This will be hard and risky, though - we need to understand where instructions begin and end, and what is code and what is not code. It will be safer if we can limit this transformation to single functions (such as lockfree_mutex_lock()) which are known not to be problematic in this regard.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions