Skip to content

Conversation

@hansl
Copy link
Contributor

@hansl hansl commented Dec 20, 2024

This adds a new InnerValue that uses NaN-boxing to allow JsValue to ultimately only take 64-bits (8 bytes). This allows JsValue to fit into one x86_64 register which should greatly improve performance.

For more details on NaN-boxing (and its alternative, Pointer tagging), see https://piotrduperas.com/posts/nan-boxing which is a great tutorial (in C) on how to get there. A proper explanation of each tags/range of values is described in the header of the inner.rs module.

Fixes #1373.

@hansl
Copy link
Contributor Author

hansl commented Dec 20, 2024

All tests in inner.rs are passing. Unfortunately, running cargo test -p boa_engine currently results in seg faults: invalid memory reference. This will remain a draft until at least all tests pass.

@hansl hansl marked this pull request as ready for review December 20, 2024 23:38
@hansl hansl requested a review from a team December 20, 2024 23:38
@raskad
Copy link
Member

raskad commented Dec 21, 2024

Still failing some tests:

Test result main count PR count difference
Total 48,625 48,625 0
Passed 43,616 43,606 -10
Ignored 1,471 1,471 0
Failed 3,538 3,548 +10
Panics 0 0 0
Conformance 89.70% 89.68% -0.02%
Broken tests (10):
test/built-ins/TypedArray/prototype/map/return-new-typedarray-conversion-operation.js (previously Passed)
test/built-ins/TypedArray/prototype/set/typedarray-arg-set-values-diff-buffer-other-type-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/set/array-arg-src-tonumber-value-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/fill/fill-values-conversion-operations.js (previously Passed)
test/built-ins/Map/valid-keys.js (previously Passed)
test/built-ins/DataView/prototype/setFloat64/set-values-return-undefined.js (previously Passed)
test/built-ins/DataView/prototype/setFloat32/set-values-return-undefined.js (previously Passed)
test/built-ins/TypedArrayConstructors/ctors/object-arg/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/DefineOwnProperty/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/Set/conversion-operation.js (previously Passed)

But the bigger issue that I see is performance. I ran the benchmarks locally and the performance was worse in all benchmarks, but in some up to almost 40%. I though a performance tradeoff vs lower memory usage seemed reasonable and expected for this change, but this seems way to extreme.

Benchmarks before:

RESULT Richards 78.5
RESULT DeltaBlue 71.8
RESULT Crypto 80.3
RESULT RayTrace 182
RESULT EarleyBoyer 227
RESULT RegExp 43.7
RESULT Splay 326
RESULT NavierStokes 187
SCORE 122

After:

RESULT Richards 49.9
RESULT DeltaBlue 43.7
RESULT Crypto 77.7
RESULT RayTrace 138
RESULT EarleyBoyer 161
RESULT RegExp 38.9
RESULT Splay 214
RESULT NavierStokes 163
SCORE 92.0

@jedel1043
Copy link
Member

jedel1043 commented Dec 21, 2024

@raskad @hansl As you probably already know, not all memory optimizations will yield (or even preserve) performance gains. It may be that way sometimes, but there's this balance between memory and speed that big memory optimizations (such as this one) will have to deal with.

The first step is to make every test work, then we'd have to optimize repeatedly to recover all the lost performance, and hopefully we'll have an engine that's as fast as before but using a fraction of the memory.

Having said that, we should see what the overall gains in memory are, because this should at least save a considerable amount of memory to be worth putting more work on.

@hansl
Copy link
Contributor Author

hansl commented Dec 21, 2024

There's no rush for this, since the actual breaking change already happened :)

I'll run benchmarks on my own, and I'll try to check a profiler for memory. In the next 2 weeks or so. It's also possible that I'm missing a few easy inline, and adding some bit magic might make things faster down the line.

First things first, I'll make sure test262 has no regressions.

@hansl
Copy link
Contributor Author

hansl commented Dec 21, 2024

@raskad BTW this should cover both; it should take less memory, AND be more performant. But I guess we pass JsValue more by reference than by value. Now that we can move JsValue for free, maybe we should reconsider some other parts of hte code as well. Later work.

This article shows a 28% increase in performance for a tagged union of pointers, in JavaScriptCore (Mozilla's engine). Tough TBF we weren't using pointers.

@codecov
Copy link

codecov bot commented Dec 21, 2024

Codecov Report

Attention: Patch coverage is 73.76426% with 138 lines in your changes missing coverage. Please review.

Project coverage is 52.97%. Comparing base (6ddc2b4) to head (c39fbbf).
Report is 370 commits behind head on main.

Files with missing lines Patch % Lines
core/engine/src/value/operations.rs 51.56% 62 Missing ⚠️
core/engine/src/value/conversions/try_from_js.rs 16.66% 25 Missing ⚠️
core/engine/src/value/inner/nan_boxed.rs 87.75% 24 Missing ⚠️
core/engine/src/value/mod.rs 83.33% 16 Missing ⚠️
core/engine/src/value/equality.rs 89.47% 4 Missing ⚠️
core/engine/src/value/hash.rs 60.00% 4 Missing ⚠️
core/engine/src/value/conversions/serde_json.rs 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4091      +/-   ##
==========================================
+ Coverage   47.24%   52.97%   +5.72%     
==========================================
  Files         476      488      +12     
  Lines       46892    52465    +5573     
==========================================
+ Hits        22154    27793    +5639     
+ Misses      24738    24672      -66     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@raskad
Copy link
Member

raskad commented Dec 21, 2024

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

@jedel1043
Copy link
Member

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

That makes a lot of sense actually, because clones for JsValues containing JsObjects now aren't as cheap as just increasing a ref count

The "proper" solution would be to make JsObject 8 bytes again by using manual vtables instead of dyn objects (something very similar to what we do on GcBox).

@hansl
Copy link
Contributor Author

hansl commented Dec 22, 2024

@raskad If we nan boxed Gc and GcBox instead of Box-ing this might go away, similar to what @jedel1043 is saying, since cloning will be essentially free. I need to put the value on the heap though, so just receiving and taking a pointer to a JsObject is not an option.

Let me make sure we don't regress on test262 then I'll look into the performance.

@hansl
Copy link
Contributor Author

hansl commented Dec 23, 2024

Okay this one is fascinating...

If I only create 1 value in an expression, e.g. -0 or [-0] or [[[[[-0]]]]], it keeps the parity just fine.

If I create multiple negative zeroes in an expression, only the first one keeps the parity. [-0, -0, -0, -0] becomes [-0, 0, 0, 0]... I'm looking for a place where we would cache these, but I can't seem to find one. I'm adding tests everywhere, but I'm flailing a bit.

Cloning the negative zero keeps the parity, copying is not supported, ... It seems the bug is not in my code, but that's the only thing I changed... Any idea?

@hansl
Copy link
Contributor Author

hansl commented Dec 26, 2024

There's something funny with the parser and/or optimizer, but in any case [-0, -0] results in both values not taking the same code path. Found a bug anyway, so that's great. Fixed the bug, ready for reviews.

I'm still planning on running the benchmark before the end of the year.

@hansl
Copy link
Contributor Author

hansl commented Jan 1, 2025

Benchmarks (using taskpolicy -b on MacOS makes CPU priority to the efficiency cores, which explains why it's so slow, but it is more fair to compare):

Enum-based

$ cargo build  --release --bin boa
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 19.7
PROGRESS DeltaBlue
RESULT DeltaBlue 18.5
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 21.2
PROGRESS RayTrace
RESULT RayTrace 48.8
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 58.8
PROGRESS RegExp
RESULT RegExp 10.9
PROGRESS Splay
RESULT Splay 82.3
PROGRESS NavierStokes
RESULT NavierStokes 50.4
SCORE 31.6

NaN-Boxed

$ cargo build  --release --bin boa --features nan-box-jsvalue
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 12.8
PROGRESS DeltaBlue
RESULT DeltaBlue 12.2
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 22.0
PROGRESS RayTrace
RESULT RayTrace 37.5
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 45.2
PROGRESS RegExp
RESULT RegExp 9.97
PROGRESS Splay
RESULT Splay 56.7
PROGRESS NavierStokes
RESULT NavierStokes 44.5
SCORE 24.9

@hansl
Copy link
Contributor Author

hansl commented Jan 5, 2025

On the latest commit, I changed the pointer tagging to using the top bits of the pointer. This means that we lose bits on the pointer value itself (after doing some research 48 bits should be enough), but we gain on performance by having non overlapping ranges for pointers. Seems like it's speeding it up a bit overall:

RESULT Richards 87.9
RESULT DeltaBlue 85.6
RESULT Crypto 142
RESULT RayTrace 250
RESULT EarleyBoyer 312
RESULT RegExp 47.4
RESULT Splay 437
RESULT NavierStokes 256
SCORE 161

The final score is getting real close to before this PR on my machine.

Comment on lines 22 to 25
//! | `BigInt` | `7FF8:PPPP:PPPP:PPPP` | 49-bits pointer. Assumes non-null pointer. |
//! | `Object` | `7FFA:PPPP:PPPP:PPPP` | 49-bits pointer. |
//! | `Symbol` | `7FFC:PPPP:PPPP:PPPP` | 49-bits pointer. |
//! | `String` | `7FFE:PPPP:PPPP:PPPP` | 49-bits pointer. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the pointer space we only have 1 unused slot left right which is 7FF8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7FF8 (or b0111_1111_1111_1000) is used by bigint if the pointer is non-null. The 16th MSB isn't used (yet?), so pointers are actually 48 bits (I'll correct the comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the commit helps clarifying this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry I meant to ask if 7FFF is unused and reserved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unused and reserved for now.

@hansl hansl requested review from jasonwilliams and raskad January 7, 2025 16:43
@hansl hansl requested a review from a team January 27, 2025 03:05
@jedel1043 jedel1043 mentioned this pull request Feb 4, 2025
Copy link
Member

@raskad raskad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greak work. Also very nice docs in nan_boxed!

Ran the benchmarks locally rebased on the current register enabled main branch and the difference looked ok to me.

@raskad raskad requested a review from a team February 10, 2025 23:35
@raskad raskad added this to the next-release milestone Feb 10, 2025
@raskad raskad added the enhancement New feature or request label Feb 10, 2025
@hansl
Copy link
Contributor Author

hansl commented Feb 16, 2025

@jedel1043 PTAL.

Copy link
Member

@jedel1043 jedel1043 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice implementation! Looks good

@jedel1043 jedel1043 enabled auto-merge February 16, 2025 03:16
@jedel1043 jedel1043 added this pull request to the merge queue Feb 16, 2025
Merged via the queue into boa-dev:main with commit 81ab11f Feb 16, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NaN boxed JavaScript Value

4 participants