Optimize `mul_wide()` (i.e. Karatsuba and Friends)

We currently use the inefficient schoolbook multiplication for `mul_wide`. It would be much better to use an optimized algorithm, e.g. Karatsuba. There's already a [TODO left in the source code](https://github.com/RustCrypto/crypto-bigint/blob/9729c3e0140b68f649c80c5d954c68531c429dfb/src/uint/mul.rs#L26) to do so.

This is important for our purposes as optimizing `mul_wide()` will optimize `mul_montgomery_form()` in `DynResidue` and `Residue`, which is a bottleneck for `pow()` which is, in turn, a bottleneck for our system. In fact, `mul_montgomery_form()` is just a `mul_wide()` followed by a `montgomery_reduction()`.

I evaluated experimentally and benchmark results showed that running time for `mul_wide()` and `montgomery_reduction()` are equal at roughly `4µs`, :
```
wrapping ops/mul, U4096*U4096
                        time:   [4.2064 µs 4.2192 µs 4.2324 µs]                   
wrapping ops/rem, U4096/U4096
                        time:   [555.09 ns 570.83 ns 589.77 ns]
wrapping ops/rem, U8192/U4096, full size
                        time:   [1.7831 ms 1.7876 ms 1.7934 ms]
Montgomery arithmetic/DynResidue retrieve
                        time:   [4.1380 µs 4.1579 µs 4.1820 µs]
Montgomery arithmetic/multiplication, U4096*U4096
                        time:   [8.4631 µs 8.5012 µs 8.5433 µs]
Montgomery arithmetic/modpow, U4096^U4096
                        time:   [35.566 ms 36.803 ms 38.838 ms]
```

From our calculation, Karatsuba improves by a factor of `12.62x`. This would leave the main bottleneck for `mul_montgomery_form()` to be `montgomery_reduction()`, achieving a `2x` improvement. However, it also seems that it would actually be cheaper to perform a `mul_wide()` followed by an `rem` - which is `8x` cheaper than `montgomery_reduction()` - over the integers, which could actually leave `rem` as the bottleneck, achieving roughly `8x` improvement of `mul_montgomery_form()`! This should translate directly to a similar improvement in `pow_montgomery_form()`, which is very significant. (Actually, from my benchmarking, a 4096-bit mod-pow costs `35ms` in crypto-bigint now, and `13ms` in `rust-gmp`, so either I'm missing something or we can out-achieve them by `3x`.)

Another issue is that Karatsuba is not constant time, but it can be made constant time by using the lazy interpolation method. My colleague Yuval Spiizer (@ylkylk1) has in fact did his thesis on this matter, and in his paper (submitted and under review, not yet published) he improved upon the state-of-the-art by a 10% change asymptomatically by modifying Toom Cook's algorithm (which is a generalization of Karatsuba) - and his algorithm is constant time as well.

We can therefore help with this process.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `mul_wide()` (i.e. Karatsuba and Friends) #249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize mul_wide() (i.e. Karatsuba and Friends) #249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimize `mul_wide()` (i.e. Karatsuba and Friends) #249