runtime: new implementations for nearest lib calls#2171
runtime: new implementations for nearest lib calls#2171sunfishcode merged 1 commit intobytecodealliance:mainfrom
Conversation
|
Thanks for this! Do you have some wall-clock benchmarks as well which show the improvement? |
|
no, I just compared metrics from llvm-mca. Btw it will be great to have some online benchmark tools like |
|
Here benchmark results on MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):
So it seems new proposed approach in 3 times faster And btw all this was expected from llvm-mca metrics. I choose |
|
|
|
The wasmtime/cranelift/codegen/meta/src/isa/x86/encodings.rs Lines 1341 to 1345 in 5c5a30f wasmtime/cranelift/codegen/src/isa/x64/lower.rs Line 1749 in 8ac4bd1 |
|
@bjorn3 Good to know. Thanks! |
|
Also added sse 4.1 intrinsic to gist. Upd
|
sunfishcode
left a comment
There was a problem hiding this comment.
This looks good to me; just two possible optimization ideas:
833c27f to
356dd1e
Compare
|
Squashed commits |
use approach with copysign for handling negative zero format refactor for better branch prediction move copysign back to internal branch format fix use abs instead branches better comments switch arms for better branch prediction
a7512f0 to
9e58da4
Compare
|
Great, thanks! |
As @MaxGraey pointed out (thanks!) in bytecodealliance#4397, `round` has different behavior from `nearest`. And it looks like the native rust implementation is still pending stabilization. Right now we duplicate the wasmtime implementation, merged in bytecodealliance#2171. However, we definitely should switch to the rust native version when it is available.
As @MaxGraey pointed out (thanks!) in #4397, `round` has different behavior from `nearest`. And it looks like the native rust implementation is still pending stabilization. Right now we duplicate the wasmtime implementation, merged in #2171. However, we definitely should switch to the rust native version when it is available.
More efficient implementations for
wasmtime_f32_nearestandwasmtime_f64_nearestbased on musl'srintandrintfimplementations.new / old comparison: https://godbolt.org/z/Gxz3bP
Also instruction's metrics for new approach with if / else branch for handling
-0.0:and with new approach but using
copysignat the end for handling-0.0:Benchmark results
Upd So I chose the second approach. Also it branchless on ARM32
Upd 2
Another possible approach:
But this approach has lower IPC