Use NaN-boxing on value::InnerValue #4091

hansl · 2024-12-20T06:18:05Z

This adds a new InnerValue that uses NaN-boxing to allow JsValue to ultimately only take 64-bits (8 bytes). This allows JsValue to fit into one x86_64 register which should greatly improve performance.

For more details on NaN-boxing (and its alternative, Pointer tagging), see https://piotrduperas.com/posts/nan-boxing which is a great tutorial (in C) on how to get there. A proper explanation of each tags/range of values is described in the header of the inner.rs module.

Fixes #1373.

hansl · 2024-12-20T06:18:51Z

All tests in inner.rs are passing. Unfortunately, running cargo test -p boa_engine currently results in seg faults: invalid memory reference. This will remain a draft until at least all tests pass.

core/engine/src/value/inner.rs

raskad · 2024-12-21T15:32:30Z

Still failing some tests:

Test result	main count	PR count	difference
Total	48,625	48,625	0
Passed	43,616	43,606	-10
Ignored	1,471	1,471	0
Failed	3,538	3,548	+10
Panics	0	0	0
Conformance	89.70%	89.68%	-0.02%

Broken tests (10):

test/built-ins/TypedArray/prototype/map/return-new-typedarray-conversion-operation.js (previously Passed)
test/built-ins/TypedArray/prototype/set/typedarray-arg-set-values-diff-buffer-other-type-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/set/array-arg-src-tonumber-value-conversions.js (previously Passed)
test/built-ins/TypedArray/prototype/fill/fill-values-conversion-operations.js (previously Passed)
test/built-ins/Map/valid-keys.js (previously Passed)
test/built-ins/DataView/prototype/setFloat64/set-values-return-undefined.js (previously Passed)
test/built-ins/DataView/prototype/setFloat32/set-values-return-undefined.js (previously Passed)
test/built-ins/TypedArrayConstructors/ctors/object-arg/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/DefineOwnProperty/conversion-operation.js (previously Passed)
test/built-ins/TypedArrayConstructors/internals/Set/conversion-operation.js (previously Passed)

But the bigger issue that I see is performance. I ran the benchmarks locally and the performance was worse in all benchmarks, but in some up to almost 40%. I though a performance tradeoff vs lower memory usage seemed reasonable and expected for this change, but this seems way to extreme.

Benchmarks before:

RESULT Richards 78.5
RESULT DeltaBlue 71.8
RESULT Crypto 80.3
RESULT RayTrace 182
RESULT EarleyBoyer 227
RESULT RegExp 43.7
RESULT Splay 326
RESULT NavierStokes 187
SCORE 122

After:

RESULT Richards 49.9
RESULT DeltaBlue 43.7
RESULT Crypto 77.7
RESULT RayTrace 138
RESULT EarleyBoyer 161
RESULT RegExp 38.9
RESULT Splay 214
RESULT NavierStokes 163
SCORE 92.0

jedel1043 · 2024-12-21T15:42:49Z

@raskad @hansl As you probably already know, not all memory optimizations will yield (or even preserve) performance gains. It may be that way sometimes, but there's this balance between memory and speed that big memory optimizations (such as this one) will have to deal with.

The first step is to make every test work, then we'd have to optimize repeatedly to recover all the lost performance, and hopefully we'll have an engine that's as fast as before but using a fraction of the memory.

Having said that, we should see what the overall gains in memory are, because this should at least save a considerable amount of memory to be worth putting more work on.

hansl · 2024-12-21T21:08:22Z

There's no rush for this, since the actual breaking change already happened :)

I'll run benchmarks on my own, and I'll try to check a profiler for memory. In the next 2 weeks or so. It's also possible that I'm missing a few easy inline, and adding some bit magic might make things faster down the line.

First things first, I'll make sure test262 has no regressions.

hansl · 2024-12-21T21:13:00Z

@raskad BTW this should cover both; it should take less memory, AND be more performant. But I guess we pass JsValue more by reference than by value. Now that we can move JsValue for free, maybe we should reconsider some other parts of hte code as well. Later work.

This article shows a 28% increase in performance for a tagged union of pointers, in JavaScriptCore (Mozilla's engine). Tough TBF we weren't using pointers.

codecov · 2024-12-21T21:20:42Z

Codecov Report

Attention: Patch coverage is 73.76426% with 138 lines in your changes missing coverage. Please review.

Project coverage is 52.97%. Comparing base (6ddc2b4) to head (c39fbbf).
Report is 370 commits behind head on main.

Files with missing lines	Patch %	Lines
core/engine/src/value/operations.rs	51.56%	62 Missing ⚠️
core/engine/src/value/conversions/try_from_js.rs	16.66%	25 Missing ⚠️
core/engine/src/value/inner/nan_boxed.rs	87.75%	24 Missing ⚠️
core/engine/src/value/mod.rs	83.33%	16 Missing ⚠️
core/engine/src/value/equality.rs	89.47%	4 Missing ⚠️
core/engine/src/value/hash.rs	60.00%	4 Missing ⚠️
core/engine/src/value/conversions/serde_json.rs	66.66%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4091      +/-   ##
==========================================
+ Coverage   47.24%   52.97%   +5.72%     
==========================================
  Files         476      488      +12     
  Lines       46892    52465    +5573     
==========================================
+ Hits        22154    27793    +5639     
+ Misses      24738    24672      -66

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

raskad · 2024-12-21T22:14:23Z

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

jedel1043 · 2024-12-21T22:50:43Z

@hansl I did a quick perf run on the release-dbg profile and noticed that there are a lot of operations relating to the Box that the JsObject is now wrapped in. For example in the function JsObject::try_get we call self.clone().into() on the JsObject to turn in into a JsValue. In my baseline the into call only makes up 0.5% of the function runtime but now with the alloc for the Box it takes 6.2%. Similarly the 2.3% time to drop that JsValue later then takes 8.4%.
I'm totally unsure if that would explain everything, since I cant really pin it down in the inverted call stack, but it might be a starting point.

That makes a lot of sense actually, because clones for JsValues containing JsObjects now aren't as cheap as just increasing a ref count

The "proper" solution would be to make JsObject 8 bytes again by using manual vtables instead of dyn objects (something very similar to what we do on GcBox).

hansl · 2024-12-22T05:03:17Z

@raskad If we nan boxed Gc and GcBox instead of Box-ing this might go away, similar to what @jedel1043 is saying, since cloning will be essentially free. I need to put the value on the heap though, so just receiving and taking a pointer to a JsObject is not an option.

Let me make sure we don't regress on test262 then I'll look into the performance.

hansl · 2024-12-23T04:49:47Z

Okay this one is fascinating...

If I only create 1 value in an expression, e.g. -0 or [-0] or [[[[[-0]]]]], it keeps the parity just fine.

If I create multiple negative zeroes in an expression, only the first one keeps the parity. [-0, -0, -0, -0] becomes [-0, 0, 0, 0]... I'm looking for a place where we would cache these, but I can't seem to find one. I'm adding tests everywhere, but I'm flailing a bit.

Cloning the negative zero keeps the parity, copying is not supported, ... It seems the bug is not in my code, but that's the only thing I changed... Any idea?

hansl · 2024-12-26T02:32:37Z

There's something funny with the parser and/or optimizer, but in any case [-0, -0] results in both values not taking the same code path. Found a bug anyway, so that's great. Fixed the bug, ready for reviews.

I'm still planning on running the benchmark before the end of the year.

hansl · 2025-01-01T23:02:39Z

Benchmarks (using taskpolicy -b on MacOS makes CPU priority to the efficiency cores, which explains why it's so slow, but it is more fair to compare):

Enum-based

$ cargo build  --release --bin boa
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 19.7
PROGRESS DeltaBlue
RESULT DeltaBlue 18.5
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 21.2
PROGRESS RayTrace
RESULT RayTrace 48.8
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 58.8
PROGRESS RegExp
RESULT RegExp 10.9
PROGRESS Splay
RESULT Splay 82.3
PROGRESS NavierStokes
RESULT NavierStokes 50.4
SCORE 31.6

NaN-Boxed

$ cargo build  --release --bin boa --features nan-box-jsvalue
...
$ taskpolicy -b ./target/release/boa -O ~/Sources/boa-dev/data/bench/bench-v8/combined.js
PROGRESS Richards
RESULT Richards 12.8
PROGRESS DeltaBlue
RESULT DeltaBlue 12.2
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 22.0
PROGRESS RayTrace
RESULT RayTrace 37.5
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 45.2
PROGRESS RegExp
RESULT RegExp 9.97
PROGRESS Splay
RESULT Splay 56.7
PROGRESS NavierStokes
RESULT NavierStokes 44.5
SCORE 24.9

Seems to work

hansl · 2025-01-05T05:45:09Z

On the latest commit, I changed the pointer tagging to using the top bits of the pointer. This means that we lose bits on the pointer value itself (after doing some research 48 bits should be enough), but we gain on performance by having non overlapping ranges for pointers. Seems like it's speeding it up a bit overall:

RESULT Richards 87.9
RESULT DeltaBlue 85.6
RESULT Crypto 142
RESULT RayTrace 250
RESULT EarleyBoyer 312
RESULT RegExp 47.4
RESULT Splay 437
RESULT NavierStokes 256
SCORE 161

The final score is getting real close to before this PR on my machine.

jasonwilliams · 2025-01-05T22:23:36Z

core/engine/src/value/inner/nan_boxed.rs

+//! | `BigInt`          | `7FF8:PPPP:PPPP:PPPP`    | 49-bits pointer. Assumes non-null pointer. |
+//! | `Object`          | `7FFA:PPPP:PPPP:PPPP`    | 49-bits pointer. |
+//! | `Symbol`          | `7FFC:PPPP:PPPP:PPPP`    | 49-bits pointer. |
+//! | `String`          | `7FFE:PPPP:PPPP:PPPP`    | 49-bits pointer. |


So in the pointer space we only have 1 unused slot left right which is 7FF8

7FF8 (or b0111_1111_1111_1000) is used by bigint if the pointer is non-null. The 16th MSB isn't used (yet?), so pointers are actually 48 bits (I'll correct the comment).

Let me know if the commit helps clarifying this.

Yes sorry I meant to ask if 7FFF is unused and reserved

It is unused and reserved for now.

raskad

Greak work. Also very nice docs in nan_boxed!

Ran the benchmarks locally rebased on the current register enabled main branch and the difference looked ok to me.

core/engine/src/value/inner/nan_boxed.rs

hansl · 2025-02-16T01:31:28Z

@jedel1043 PTAL.

jedel1043

Really nice implementation! Looks good

hansl added 3 commits December 19, 2024 13:34

Start of nan boxing values

6d3d9db

Continuing, but this endian thing is not helping

c64417d

Finish the boxing and add tests

e357908

hansl commented Dec 20, 2024

View reviewed changes

core/engine/src/value/inner.rs Outdated Show resolved Hide resolved

hansl added 4 commits December 20, 2024 11:52

Better tests and fixing a couple of bugs

d16a546

Rollback changes outside of core/engine/src/value/...

e2496c3

Fix Drop::drop and clippies

683444e

Move MSRV to 1.83

9d1ac9f

hansl marked this pull request as ready for review December 20, 2024 23:38

hansl requested a review from a team December 20, 2024 23:38

Move MSRV back to 1.82 and impl missing const fn

167d634

Fix a few clippies and ignore the one thats wrong

2a68855

hansl added 5 commits December 23, 2024 10:41

Merge branch 'main' into nan_box

56d513a

WIP

c7cd8b0

Duh!

52fc103

Merge remote-tracking branch 'upstream/main' into nan_box

084b25e

Oops

345cd69

hansl added 2 commits December 25, 2024 21:36

Clippies

710d8f1

refactor bit-masks and ranges

c6d0669

Some simple attempts to improve performance with NonNull

69bbf82

Seems to work

hansl force-pushed the nan_box branch from 9388be4 to 69bbf82 Compare January 5, 2025 03:38

Use top 2 bits for pointer tagging, speeding up as_variant

ca9f657

Remove the feature flag for NaN boxing now that were closer in perf

e29cedb

jasonwilliams reviewed Jan 5, 2025

View reviewed changes

Add a bits section to explain the scheme

c949461

hansl requested review from jasonwilliams and raskad January 7, 2025 16:43

hansl added 2 commits January 7, 2025 09:31

Inline ALL THE THINGS - slight increase in perf

c21c886

Merge branch 'main' into nan_box

5d406c7

hansl requested a review from a team January 27, 2025 03:05

jedel1043 mentioned this pull request Feb 4, 2025

NaN-boxed JsValues #1830

Closed

raskad approved these changes Feb 10, 2025

View reviewed changes

raskad requested a review from a team February 10, 2025 23:35

raskad added this to the next-release milestone Feb 10, 2025

raskad added the enhancement New feature or request label Feb 10, 2025

hansl added 2 commits February 12, 2025 20:35

Merge remote-tracking branch 'upstream/main' into nan_box

221df90

new clippies

d12f4fd

jedel1043 requested changes Feb 15, 2025

View reviewed changes

core/engine/src/value/inner/nan_boxed.rs Show resolved Hide resolved

Add clarification of the safety of NaN boxing across archs

48ea764

jedel1043 approved these changes Feb 16, 2025

View reviewed changes

clippies

c39fbbf

jedel1043 enabled auto-merge February 16, 2025 03:16

jedel1043 added this pull request to the merge queue Feb 16, 2025

Merged via the queue into boa-dev:main with commit 81ab11f Feb 16, 2025
15 checks passed

andreievg mentioned this pull request Jun 8, 2025

Panic with Pointer is not 4-bits aligned or over 51-bits #4275

Open

Uh oh!

Use NaN-boxing on value::InnerValue #4091

Use NaN-boxing on value::InnerValue #4091

Uh oh!

Conversation

hansl commented Dec 20, 2024

Uh oh!

hansl commented Dec 20, 2024

Uh oh!

Uh oh!

raskad commented Dec 21, 2024

Uh oh!

jedel1043 commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hansl commented Dec 21, 2024

Uh oh!

hansl commented Dec 21, 2024

Uh oh!

codecov bot commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

raskad commented Dec 21, 2024

Uh oh!

jedel1043 commented Dec 21, 2024

Uh oh!

hansl commented Dec 22, 2024

Uh oh!

hansl commented Dec 23, 2024

Uh oh!

hansl commented Dec 26, 2024

Uh oh!

hansl commented Jan 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enum-based

NaN-Boxed

Uh oh!

hansl commented Jan 5, 2025

Uh oh!

jasonwilliams Jan 5, 2025

Choose a reason for hiding this comment

Uh oh!

hansl Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

hansl Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

jasonwilliams Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

hansl Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

raskad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hansl commented Feb 16, 2025

Uh oh!

jedel1043 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jedel1043 commented Dec 21, 2024 •

edited

Loading

codecov bot commented Dec 21, 2024 •

edited

Loading

hansl commented Jan 1, 2025 •

edited

Loading