-
Notifications
You must be signed in to change notification settings - Fork 79
Description
We currently use the inefficient schoolbook multiplication for mul_wide. It would be much better to use an optimized algorithm, e.g. Karatsuba. There's already a TODO left in the source code to do so.
This is important for our purposes as optimizing mul_wide() will optimize mul_montgomery_form() in DynResidue and Residue, which is a bottleneck for pow() which is, in turn, a bottleneck for our system. In fact, mul_montgomery_form() is just a mul_wide() followed by a montgomery_reduction().
I evaluated experimentally and benchmark results showed that running time for mul_wide() and montgomery_reduction() are equal at roughly 4µs, :
wrapping ops/mul, U4096*U4096
time: [4.2064 µs 4.2192 µs 4.2324 µs]
wrapping ops/rem, U4096/U4096
time: [555.09 ns 570.83 ns 589.77 ns]
wrapping ops/rem, U8192/U4096, full size
time: [1.7831 ms 1.7876 ms 1.7934 ms]
Montgomery arithmetic/DynResidue retrieve
time: [4.1380 µs 4.1579 µs 4.1820 µs]
Montgomery arithmetic/multiplication, U4096*U4096
time: [8.4631 µs 8.5012 µs 8.5433 µs]
Montgomery arithmetic/modpow, U4096^U4096
time: [35.566 ms 36.803 ms 38.838 ms]
From our calculation, Karatsuba improves by a factor of 12.62x. This would leave the main bottleneck for mul_montgomery_form() to be montgomery_reduction(), achieving a 2x improvement. However, it also seems that it would actually be cheaper to perform a mul_wide() followed by an rem - which is 8x cheaper than montgomery_reduction() - over the integers, which could actually leave rem as the bottleneck, achieving roughly 8x improvement of mul_montgomery_form()! This should translate directly to a similar improvement in pow_montgomery_form(), which is very significant. (Actually, from my benchmarking, a 4096-bit mod-pow costs 35ms in crypto-bigint now, and 13ms in rust-gmp, so either I'm missing something or we can out-achieve them by 3x.)
Another issue is that Karatsuba is not constant time, but it can be made constant time by using the lazy interpolation method. My colleague Yuval Spiizer (@ylkylk1) has in fact did his thesis on this matter, and in his paper (submitted and under review, not yet published) he improved upon the state-of-the-art by a 10% change asymptomatically by modifying Toom Cook's algorithm (which is a generalization of Karatsuba) - and his algorithm is constant time as well.
We can therefore help with this process.