WebAssembly · alexcrichton · Sep 23, 2024 · Sep 20, 2024 · Sep 23, 2024 · Sep 23, 2024
diff --git a/proposals/128-bit-arithmetic/Overview.md b/proposals/128-bit-arithmetic/Overview.md
@@ -279,39 +279,39 @@ plaininstr_l ::= ...
 
 ## Alternatives
 
-### Alternative: Overflow Flags
+### Alternative: Overflow Flags as a value
 
 > **Note**: this alternative is the subject of [#6] and this section is intended
 > to summarize investigations and results of that issue. See [#6] for more
 > in-depth discussion too.
 
 [#6]: https://github.com/WebAssembly/128-bit-arithmetic/issues/6
 
-A major alternative to this proposal is to expose the lower-level primitives
-that 128-bit addition/subtraction are themselves built on for the underlying
-platforms. This hypothetically would remove the need for `i64.{add,sub}128`. The
-basic idea is that platforms such as x86\_64 and aarch64 expose overflow flags
-for arithmetic operations. These platforms additionally have instructions that
-consume the overflow flag with an arithmetic operation as well. In WebAssembly
-these might look like:
-
-* `i64.add_overflow_{u,s} : [i64 i64] -> [i64 i32]`
-* `i64.add_with_carry_{u,s} : [i64 i64 i32] -> [i64 i32]`
-
-Both instructions would produce a 64-bit result plus an overflow flag, an `i32`.
-The `i32` result would be defined as either 0 or 1 indicating whether an
-overflow happened during the operation. The `*_add_with_carry_*` variant would
-additionally take a third parameter which is an overflow flag from a previous
-instruction. To match what hardware has this would need to be defined as either
-0 or nonzero (note that this is subtly different from the result of
-`*_add_overflow_*`).
+No current native platform has a single instruction for 128-bit addition or
+subtraction. On x86\_64 and aarch64 for example these operations are implemented
+with a sequence of two instructions. This gives rise to an alternative to this
+proposal which is to support these instructions individually rather than the
+combined 128-bit operation.
+
+Many native platforms have an "overflow flag" in their processor state which
+instructions can read and write to. In WebAssembly these instructions for
+addition might look like this for example:
+
+* `i64.add_overflow_{u,s} : [i64 i64] -> [i64 $t]`
+* `i64.add_with_carry_{u,s} : [i64 i64 $t] -> [i64 $t]`
+
+Both instructions would produce a 64-bit result plus an overflow flag, here
+labeled as `$t`. The exact choice of type here has consequences on the
+implementation, and some possibilities are discussed below. Semantically though
+the `$t` results are "truthy" if the operation overflowed, and the input to
+`add_with_carry_u` means "add one more" if the value is "truthy".
 
 An example of using these instructions to implement 128-bit addition would be:
 
 ```wasm
 (module
   (func $add128 (param i64 i64 i64 i64) (result i64 i64)
-    (local $oflow i32)
+    (local $oflow $t)
     (i64.add_overflow_u (local.get 0) (local.get 2))
     local.set $oflow
     (i64.add_with_carry_u (local.get 1) (local.get 3) (local.get $oflow))
@@ -320,23 +320,31 @@ An example of using these instructions to implement 128-bit addition would be:
 )
 ```
 
-This is quite close to what x86\_64 would produce for an equivalent function for
-example:
+This is quite close to [what x86\_64 would produce][godbolt-add-i128] for an
+equivalent native function for example:
+
+[godbolt-add-i128]: https://godbolt.org/z/1x54aneoW
 
 ```
 0000000000000000 <add_i128>:
    0:	48 89 f8             	mov    %rdi,%rax
-   3:	48 01 d0             	add    %rdx,%rax
-   6:	48 11 ce             	adc    %rcx,%rsi
+   3:	48 01 d0             	add    %rdx,%rax    ;; i64.add_overflow_u
+   6:	48 11 ce             	adc    %rcx,%rsi    ;; i64.add_with_carry_u
    9:	48 89 f2             	mov    %rsi,%rdx
    c:	c3                   	ret
 ```
 
-The primary downside of this approach, when considering in 128-bit arithmetic,
-is that the performance of these instructions relies on "fusing" these
-instructions together. For example backend-based peephole optimization passes
-would be required. A naive lowering of the above WebAssembly done in an initial
-implementation of Wasmtime looks like (annotated):
+#### Overflow flag: `$t = i32`
+
+An implementation has been prototyped where `$t` here is `i32`. Overflow-flag
+producing operations always generate 0 or 1 and "truthy" is defined as
+zero-or-nonzero. Using this prototype an initial benchmark of "calculate the
+10\_000th fibonacci number" with a bignum library showed that with these two
+alternate instructions (instead of `i64.add128`) that **the generated code was
+slower than WebAssembly was before this proposal**.
+
+To understand why it's slower than before this is an example of the above
+128-bit addition function outlined above, with annotated assembly:
 
 ```
 0000000000000000 <wasm[0]::function[0]::add128>:
@@ -354,28 +362,109 @@ implementation of Wasmtime looks like (annotated):
   ret
 ```
 
-This is much less efficient than the native output on x86\_64 for a number of
-reasons:
-
-* The `setb + movzbl + add $0xfff.., ..` is all unnecessary. A peephole pass can
-  in theory remove this.
-* The final `setb %dl` is unnecessary because the result is dead code. A
-  peephole pass or otherwise can in theory remove this.
-
-An initial benchmark of "calculate the 10\_000th fibonacci number" with a bignum
-library showed that with these two alternate instructions (instead of
-`i64.add128`) that the generated code was **slower** than WebAssembly was before
-this proposal. This result indicates that if the motivation for this proposal is
-faster 128-bit arithmetic then runtimes will be required to implement the above
-optimizations (which Wasmtime, for example, does not already). Other runtime
-have not been surveyed yet to see if they already implement such optimizations.
-
-The conclusion so far is that overflow flags are not the best means to achieve
-good performance of 128-bit arithmetic at this time. Overflow flags might be
-useful to other use cases in their own right (unrelated to 128-bit arithmetic),
-but for 128-bit arithmetic focused cases the `i64.{add,sub}128` instructions are
-seen as simpler alternatives for compilers to implement in addition to
-toolchains to generate.
+This is quite far from the optimal x86\_64 code above and reveals some drawbacks
+of the "overflow flag as a value" model:
+
+* **On native architectures the overflow flag is not a value**, it's a single
+  bit in a single fixed register. It's not subject to register allocation and on
+  platforms like x86\_64 it can be clobbered by many instructions.
+* Native architectures generally don't like moving bits in and out of the flags
+  register (e.g. `setb` extracting above and `add $-1, ...` putting it back in).
+* Compilers like Cranelift in Wasmtime do not have preexisting support for
+  optimizing use of the flags register due to its unique nature.
+
+Improving the code generation of these instructions in Cranelift/Wasmtime would
+require significant investment into specialized optimizations just for these
+instructions (e.g. a peephole pass after lowering or significantly different IR
+constructs). It's predicted that similar significant investments would be
+required to optimize other compilers as well.
+
+#### Overflow flag: `$t = i1`
+
+Another possibility of `$t` in the above instructions is to introduce a brand
+new type to WebAssembly, `i1` (or `flags` or similar). That more accurately
+models what native architectures have in this regard. **The problem with this
+alternative, though, is that it fundamentally has the same problem** as the
+previous alternative where WebAssembly would be modeling the overflow flag as a
+*value* whereas in native architectures it's a piece of *state* on the processor
+that instructions can use.
+
+For example in the above native instructions that Cranelift/Wasmtime generated
+if the type were known to be `i1` then the `movzbl %dl,%r10d` instruction would
+not be necessary and the `add $0xffffffff,%r10d` could be shrunk to
+`add $0xff,%r10b`. Otherwise though there's still the same problems of moving
+out of the flags register for lowering and moving back in, which is a
+significant slowdown compared to the optimal lowering.
+
+This alternative not only suffers from the same problems as before with being
+significantly difficult to optimize, but it additionally has significant
+downsides in terms of adding a brand new type to WebAssembly's type system which
+is not a small operation to take on. Given that this doesn't actually make the
+optimization problem easier, this is not seen as a favorable alternative.
+
+#### Overflow flag: `$t = []`
+
+A third possibility of `$t` is to define it as "nothing". These instructions,
+for example, could be:
+
+* `i64.add_overflow_{u,s} : [i64 i64] -> [i64]`
+* `i64.add_with_carry_{u,s} : [i64 i64] -> [i64]`
+
+This would require the definition of new state in the wasm abstract machine
+where a single bit would live (an overflow flag). These instructions would
+implicitly operate on this state and would relieve the compiler from having to
+figure out how to schedule instructions by moving the burden to the producer.
+For example LLVM already supports native platforms with implicit overflow flag
+state so this would be another instance of that.
+
+Purely from the perspective of a WebAssembly compiler, however, this approach
+still has its drawbacks. On x86\_64, for example, many instruction clobber flags
+which means the compiler would have to meticulously save and restore the flags
+around instructions because there is no guarantee that `i64.add_with_carry_u`
+is adjacent to `i64.add_overflow_u`. Platforms like aarch64 might be easier
+where instructions opt-in to modifying flags, but platforms like riscv64 which
+don't have a flags register at all would still be equally inconvenienced as
+before.
+
+It's worth noting that this alternative would additionally require new
+instructions to move in and out of this state. For example if there are two
+overflow flags live at the same time a WebAssembly compiler would need to modify
+and update this flag appropriately. This addition would also mean that
+`i64.add_with_carry_*` would be one of the first instructions that would operate
+on implicit state rather than explicit operands.
+
+#### Overflow flags: Summary
+
+Modeling a native platform's overflow flag as a value, for example `i32`, is not
+an accurate reflection of how native architectures work. Efficiently bridging
+this gap in expressivity is extremely difficult for existing compilers as this
+is unlike any other compilation problem that WebAssembly compilers deal with
+today. The chosen type representation, be it `i1` or `i32`, does not
+significantly reduce the complexity of this problem, too.
+
+Attempting to model an overflow flag as implicit machine state in WebAssembly
+itself is significantly hindered due to native platform differences in how this
+state is managed. Implicit state alone still requires a significant increase
+in the complexity of existing compilers to bridge these differences. Reducing
+this complexity cost would require further changes to be made to this
+alternative.
+
+This proposal's instructions, `i64.{add,sub}128`, [have been
+benchmarked][overflow-flags-numbers] to show that `fib_10000` on x86\_64 goes
+from 120% slower-than-native before this proposal to 9% after. On
+aarch64 the numbers are 72% originally slower-than-native and 2%
+faster-than-native afterwards. The implementation of `i64.add128` required very
+little optimization work, and that which was implemented was similar to all other
+optimization work already implemented in Cranelift for WebAssembly.
+
+Overall `i64.add128` is expected to be a small addition to WebAssembly which is
+not significantly difficult for runtimes to implement. It's additionally
+expected, in the case of wide arithmetic, to reap the lion's share of the
+performance benefits and close the gap with native platforms. This contrasts
+`*.add_with_carry_*` which, while more general, carries significant complexity
+to close the performance gap with native.
+
+[overflow-flags-numbers]: https://github.com/WebAssembly/128-bit-arithmetic/issues/2#issuecomment-2307646174
 
 ### Alternative: 128-bit multiplication