Skip to content

Optimize mul_wide() (i.e. Karatsuba and Friends) #249

@ycscaly

Description

@ycscaly

We currently use the inefficient schoolbook multiplication for mul_wide. It would be much better to use an optimized algorithm, e.g. Karatsuba. There's already a TODO left in the source code to do so.

This is important for our purposes as optimizing mul_wide() will optimize mul_montgomery_form() in DynResidue and Residue, which is a bottleneck for pow() which is, in turn, a bottleneck for our system. In fact, mul_montgomery_form() is just a mul_wide() followed by a montgomery_reduction().

I evaluated experimentally and benchmark results showed that running time for mul_wide() and montgomery_reduction() are equal at roughly 4µs, :

wrapping ops/mul, U4096*U4096
                        time:   [4.2064 µs 4.2192 µs 4.2324 µs]                   
wrapping ops/rem, U4096/U4096
                        time:   [555.09 ns 570.83 ns 589.77 ns]
wrapping ops/rem, U8192/U4096, full size
                        time:   [1.7831 ms 1.7876 ms 1.7934 ms]
Montgomery arithmetic/DynResidue retrieve
                        time:   [4.1380 µs 4.1579 µs 4.1820 µs]
Montgomery arithmetic/multiplication, U4096*U4096
                        time:   [8.4631 µs 8.5012 µs 8.5433 µs]
Montgomery arithmetic/modpow, U4096^U4096
                        time:   [35.566 ms 36.803 ms 38.838 ms]

From our calculation, Karatsuba improves by a factor of 12.62x. This would leave the main bottleneck for mul_montgomery_form() to be montgomery_reduction(), achieving a 2x improvement. However, it also seems that it would actually be cheaper to perform a mul_wide() followed by an rem - which is 8x cheaper than montgomery_reduction() - over the integers, which could actually leave rem as the bottleneck, achieving roughly 8x improvement of mul_montgomery_form()! This should translate directly to a similar improvement in pow_montgomery_form(), which is very significant. (Actually, from my benchmarking, a 4096-bit mod-pow costs 35ms in crypto-bigint now, and 13ms in rust-gmp, so either I'm missing something or we can out-achieve them by 3x.)

Another issue is that Karatsuba is not constant time, but it can be made constant time by using the lazy interpolation method. My colleague Yuval Spiizer (@ylkylk1) has in fact did his thesis on this matter, and in his paper (submitted and under review, not yet published) he improved upon the state-of-the-art by a 10% change asymptomatically by modifying Toom Cook's algorithm (which is a generalization of Karatsuba) - and his algorithm is constant time as well.

We can therefore help with this process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions