Skip to content

Conversation

@radekdoulik
Copy link
Member

No description provided.

@ghost
Copy link

ghost commented Jan 26, 2023

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost assigned radekdoulik Jan 26, 2023
@radekdoulik
Copy link
Member Author

radekdoulik commented Jan 26, 2023

The emitted code:

> wa-info -d -f Vector.*Normalize.*Run dotnet.wasm
(func Wasm_Browser_Bench_Sample_Sample_VectorTask_Normalize_RunStep(param $0 i32, $1 i32))
 local $2 v128
 local.get $0
 i32.eqz
 if
  call mini_llvmonly_throw_nullref_exception
  unreachable

 local.get $0
 local.get $0
 v128.load offset:32 align:2    [SIMD]
 local.tee $2
 local.get $2
 local.get $2
 f32x4.mul    [SIMD]
 local.tee $2
 local.get $2
 v128.const 0x0b0a09080f0e0d0c0302010007060504    [SIMD]
 i8x16.swizzle    [SIMD]
 f32x4.add    [SIMD]
 local.tee $2
 local.get $2
 v128.const 0x07060504030201000f0e0d0c0b0a0908    [SIMD]
 i8x16.swizzle    [SIMD]
 f32x4.add    [SIMD]
 f32x4.extract.lane 0    [SIMD]
 f32.sqrt
 f32x4.splat    [SIMD]
 f32x4.div    [SIMD]
 v128.store offset:16    [SIMD]

@tannergooding
Copy link
Member

tannergooding commented Jan 26, 2023

What's the difference between the browser-bench and the tests we have in dotnet/performance?

For the codegen this doesn't look "quite" like I'd expect from a few perspectives: Namely this code isn't "deterministic" like it should be and may differ in behavior based on platform/architecture:

  • Vector128.Dot needs to add pairs, that is (x + y) + (z + w)
  • The managed code is doing (float)Math.Sqrt where-as the codegen is directly emitting a 32-bit sqrt

I'd expect something that is roughly (using unique locals for each step, SSA style):

* v0 = load fields (x, y, z, w) as 1x 128-bit vector
* v1 = v0 * v0
* v2 = swizzle v1, going from `(x, y, z, w)` to `(y, x, w, z)`
* v3 = v1 + v2
* v4 = swizzle v3, going from `(x, y, z, w)` to `(z, w, x, y)`
* v5 = v3 + v4
* s1 = v5.ToScalar()
* d1 = upcast s1 to double
* d2 = sqrt d1
* s2 = downcast d2 to float
* v8 = splat s2
* return v0 / v8

The code is "mostly" doing this, but there are a couple of edge cases where it looks like its not quite right.

@lewing lewing merged commit 8509432 into dotnet:main Jan 26, 2023
@radekdoulik
Copy link
Member Author

radekdoulik commented Jan 27, 2023

What's the difference between the browser-bench and the tests we have in dotnet/performance?

The browser-bench is our "litmus paper" benchmark. It has lesser coverage compared to dotnet/performance. It has more measured flavors, like interp, aot, simd, wasm eh, chrome, firefox. It also lives in the runtime repo, so it works most of the time and saved us many times :-) We have visualization of the data here https://radekdoulik.github.io/WasmPerformanceMeasurements. Currently it runs twice a day.

@radekdoulik
Copy link
Member Author

For the codegen this doesn't look "quite" like I'd expect from a few perspectives: Namely this code isn't "deterministic" like it should be and may differ in behavior based on platform/architecture:

* `Vector128.Dot` needs to add pairs, that is `(x + y) + (z + w)`

And that is exactly what it does. It does it the way you described it here, just the SSA introduced variables are already gone.

  • v0 = load fields (x, y, z, w) as 1x 128-bit vector
  • v1 = v0 * v0
  • v2 = swizzle v1, going from (x, y, z, w) to (y, x, w, z)
  • v3 = v1 + v2
  • v4 = swizzle v3, going from (x, y, z, w) to (z, w, x, y)
  • v5 = v3 + v4

@radekdoulik
Copy link
Member Author

radekdoulik commented Jan 27, 2023

* The managed code is doing `(float)Math.Sqrt` where-as the codegen is directly emitting a 32-bit sqrt

This looks like clang optimization, our llvm IR looks like this:

  %extract = extractelement <4 x float> %17, i32 0
  %t23 = fpext float %extract to double
  %t24 = call double @llvm.sqrt.f64(double %t23)
  %18 = fptrunc double %t24 to float
  %19 = insertelement <4 x float> undef, float %18, i32 0
  %v2s = extractelement <4 x float> %19, i32 0
  %20 = insertelement <1 x float> undef, float %v2s, i32 0
  %broadcast = shufflevector <1 x float> %20, <1 x float> undef, <4 x i32> zeroinitializer
  %21 = fdiv <4 x float> %t15, %broadcast

@radical @vargaz @lewing do you know what floating point behavior we use by default for wasm? Should we switch to more precise one or at least document it somewhere? https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior

@vargaz
Copy link
Contributor

vargaz commented Jan 27, 2023

We don't change the fp behavior so whatever the llvm default is, we use that.

@pavelsavara
Copy link
Member

do you know what floating point behavior we use by default for wasm? Should we switch to more precise one or at least document it somewhere? https://clang.llvm.org/docs/UsersManual.html#controlling-floating-point-behavior

https://github.com/emscripten-core/emscripten/blob/c5245e7c6c7b97d3f6e11de754dac64a3424fdfe/ChangeLog.md?plain=1#L3385-L3388

Probably more here https://emscripten.org/docs/porting/simd.html

@lewing
Copy link
Member

lewing commented Jan 28, 2023

As @radekdoulik mentioned browser-bench is our attempt to measure a few key scenarios without the complications of the dotnet project system breaking dotnet/performance results for extended periods. We'd love it if you included dotnet/performance results for wasm in your performance related prs, but as it is we have to be reactive to regressions and browser-bench helps us do that using personal hardware (we aren't currently running dotnet/performance with simd enabled on wasm for cost reasons)

Wasm as specified is generally much more deterministic than the host platform so anything that compiles down to wasm intrinsics should be compared to the spec. On top of that none of the operations here are particularly wasm specific and we depend on llvm as a platform in these cases. The goal in this PR is to make sure we are measuring a few simple paths after substantial regressions that were not identified in other tooling (c.f. #81201 (comment))

@ghost ghost locked as resolved and limited conversation to collaborators Feb 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants