Skip to content

Conversation

@ahamlat
Copy link
Contributor

@ahamlat ahamlat commented Nov 24, 2025

PR description

This PR reimplements four arithmetic opcodes using the new UInt256 introduced in #9188.
It updates: AND, OR, XOR and NOT. The changes deliver the following improvements:

Opcode Baseline (ns/op) Optimized (ns/op) Improvement (%)
AND 92.406 73.915 20.01%
OR 94.804 72.537 23.49%
XOR 92.872 70.806 23.76%
NOT 55.527 42.931 22.68%

You can find below the details of the benchmarks.

AND Opcode

Benchmark                                        Mode  Cnt   Score   Error  Units
AndOperationBenchmark.executeOperation           avgt   15  92.406 ± 0.816  ns/op
AndOperationOptimizedBenchmark.executeOperation  avgt   15  73.915 ± 0.754  ns/op

OR Opcode

Benchmark                                       Mode  Cnt   Score   Error  Units
OrOperationBenchmark.executeOperation           avgt   15  94.804 ± 1.068  ns/op
OrOperationOptimizedBenchmark.executeOperation  avgt   15  72.537 ± 0.305  ns/op

XOR Opcode

Benchmark                                        Mode  Cnt   Score   Error  Units
XorOperationBenchmark.executeOperation           avgt   15  92.872 ± 3.150  ns/op
XorOperationOptimizedBenchmark.executeOperation  avgt   15  70.806 ± 0.277  ns/op

NOT Opcode

Benchmark                                        Mode  Cnt   Score   Error  Units
NotOperationBenchmark.executeOperation           avgt   15  55.527 ± 0.206  ns/op
NotOperationOptimizedBenchmark.executeOperation  avgt   15  42.931 ± 0.168  ns/op

This PR adds JMH benchmarks for each arithmetic opcode to validate performance improvements.
To run a benchmark for a specific opcode, use the following command (example for AND):

./gradlew clean :ethereum:core:jmh -Pf=5 -Pwi=10 -Pi=10 -Pincludes=AndOperation

It also adds property-based tests for each opcode to ensure that the new implementations behave as expected.

The implementation also includes an optimization to the fromBytesBE method, which accounts for roughly half of the overall improvement. You can find below the improvement we get without changin fromBytesBE

Opcode Baseline (ns/op) Optimized (ns/op) Improvement (%)
AND 92.692 85.100 8.19%
XOR 94.557 84.303 10.84%
NOT 55.576 51.272 7.74%
OR 94.460 83.484 11.62%

This new implementation was tested on MulMod and showed significant improvement for cases where fromBytesBE method takes a big share in execution time. You can find the numbers here.

Fixed Issue(s)

Thanks for sending a pull request! Have you done the following?

  • Checked out our contribution guidelines?
  • Considered documentation and added the doc-change-required label to this PR if updates are required.
  • Considered the changelog and included an update if required.
  • For database changes (e.g. KeyValueSegmentIdentifier) considered compatibility and performed forwards and backwards compatibility tests

Locally, you can run these tests to catch failures early:

  • spotless: ./gradlew spotlessApply
  • unit tests: ./gradlew build
  • acceptance tests: ./gradlew acceptanceTest
  • integration tests: ./gradlew integrationTest
  • reference tests: ./gradlew ethereum:referenceTests:referenceTests
  • hive tests: Engine or other RPCs modified?

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Copy link
Contributor

@thomas-quadratic thomas-quadratic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ahamlat this is nice. Improvement on fromBytesBE is good. Very happy about the fast paths.

Did you try with ByteBuffer getInt instead of your helper getIntBE ? It is the same code, but I was under the impression that ByteBuffer.getInt has some hardware acceleration. But I am not sure.

Doing bitwise ops sequentially for all limbs seem the right approach to me right now.

Benchmarks show the average over all sizes. I think a possible improvement would be to parametrize sizes like for mod ops, but that would be probably only be interesting for fromBytesBE; bitwise ops are done on all limbs.

result[5] = this.limbs[5] & other.limbs[5];
result[6] = this.limbs[6] & other.limbs[6];
result[7] = this.limbs[7] & other.limbs[7];
int resultLength = nSetLimbs(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, you don't necessarily have to do this operation, you could just set the length to N_LIMBS (or Math.min(this.length, other.length)). If that can lead to performance improvements.
But this is what we discussed the other time, do we want to optimise nSetLimbs with Arrays.mismatch and use it with little cost all the time ? Or do we keep this length interpretation ?
Similarly for other bitwise ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, with existing implementation, I can just replace with N_LIMBS.
I can do that change, but related to Arrays.mismatch, I think that was a proposal from @lu-pinto so will let him address it in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// Helper method to read 4 bytes as big-endian int
private static int getIntBE(final byte[] bytes, final int offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but I think that ByteBuffer.getInt in Java is the same code. However the compiler can use some intrinsics for it, where I am not sure it can with your code.
I can test it if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation that is executing is from HeapByteBuffer and it is quite different from the new suggested implementation. I removed completely the use of ByteBuffer on fromBytesBE.
Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the implementation that is executing before this PR

public int getInt() {
    return SCOPED_MEMORY_ACCESS.getIntUnaligned(session(), hb, byteOffset(nextGetIndex(4)), bigEndian);
}

@ahamlat
Copy link
Contributor Author

ahamlat commented Nov 25, 2025

Did you try with ByteBuffer getInt instead of your helper getIntBE ?

As I removed the bytebuffer, I can't use that method anymore and the new one showed better performances.

I think a possible improvement would be to parametrize sizes like for mod ops, but that would be probably only be interesting for fromBytesBE; bitwise ops are done on all limbs.

Bitwise opcode are very simple and don't have a complex path execution. I think we should keep the benchmarks simple to be able to evaluate the performances very quickly.
I executed the Mod benchmarks with the new implemention from this PR and you can find the results here.

image

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
@ahamlat
Copy link
Contributor Author

ahamlat commented Nov 26, 2025

It has indeed better performances when setting the length of the UInt256 result to 8 limbs

Benchmark                                        Mode  Cnt   Score   Error  Units
AndOperationBenchmark.executeOperation           avgt   15  94.534 ± 5.494  ns/op
AndOperationOptimizedBenchmark.executeOperation  avgt   15  69.519 ± 0.172  ns/op
Benchmark                                       Mode  Cnt   Score   Error  Units
OrOperationBenchmark.executeOperation           avgt   15  94.715 ± 1.157  ns/op
OrOperationOptimizedBenchmark.executeOperation  avgt   15  70.134 ± 1.206  ns/op
Benchmark                                        Mode  Cnt   Score   Error  Units
XorOperationBenchmark.executeOperation           avgt   15  94.613 ± 1.007  ns/op
XorOperationOptimizedBenchmark.executeOperation  avgt   15  69.575 ± 0.173  ns/op

@ahamlat
Copy link
Contributor Author

ahamlat commented Dec 1, 2025

@thomas-quadratic @lu-pinto I addressed all the comments, could you take another look ?

// Assert - compare with Bytes.and() (existing implementation)
final Bytes bytesA = Bytes32.leftPad(Bytes.wrap(a));
final Bytes bytesB = Bytes32.leftPad(Bytes.wrap(b));
final byte[] expected = bytesA.and(bytesB).toArrayUnsafe();
Copy link
Member

@lu-pinto lu-pinto Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more at ease if you would compare it with BigInteger instead of tuweni

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see there's one with BigInteger next. Why compare with both then? Is it not overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to as much as possible, so I don't think it is overkill, but I can remove it if you think we should only compare with the existing implementation.

final byte[] expected = bytesA.not().toArrayUnsafe();
assertThat(resultBytes).containsExactly(expected);

System.out.println("✓ Test PASSED - matches Bytes.not()");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the printouts in tests? AI generated? That looks strange...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was for testing purposes and forgot to remove it. Let me remove it.

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Copy link
Member

@lu-pinto lu-pinto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ahamlat ahamlat merged commit c844ea1 into hyperledger:main Dec 2, 2025
46 checks passed
AliZDev-v0 pushed a commit to AliZDev-v0/besu that referenced this pull request Dec 10, 2025
* Optimize AND, OR, XOR and NOT opcodes using new UInt256 implementation

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Co-authored-by: Luis Pinto <luis.pinto@consensys.net>
Signed-off-by: Ali Zhagparov <alijakparov.kz@gmail.com>
pinges pushed a commit to pinges/besu that referenced this pull request Dec 15, 2025
* Optimize AND, OR, XOR and NOT opcodes using new UInt256 implementation

Signed-off-by: Ameziane H. <ameziane.hamlat@consensys.net>
Co-authored-by: Luis Pinto <luis.pinto@consensys.net>
Signed-off-by: stefan.pingel@consensys.net <stefan.pingel@consensys.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants