SHxADD / ADDUW Instruction Optimization on AArch64 by yuqiliu617 · Pull Request #504 · nervosnetwork/ckb-vm

yuqiliu617 · 2026-03-25T15:17:18Z

Summary

Eliminated redundant pre-shift mov and lsl instructions from 7 Zba extension handlers (ADDUW, SH1ADD, SH2ADD, SH3ADD, SH1ADDUW, SH2ADDUW, SH3ADDUW) by exploiting ARM64's extended-register and shifted-register add forms.

Background

The RISC-V Zba extension provides shift-and-add instructions in two families:

SHxADD (64-bit): rd = (rs1 << N) + rs2 for $N \in \{1, 2, 3\}$
SHxADDUW / ADDUW (32-bit): rd = (zero_extend(rs1[31:0]) << N) + rs2 for $N \in \{0, 1, 2, 3\}$

ARM64 add natively supports both patterns via operand modifiers:

Shifted register: add Xd, Xn, Xm, LSL #N — computes Xn + (Xm << N) in one instruction
Extended register: add Xd, Xn, Wm, UXTW #N — zero-extends Wm to 64 bits, shifts left by N, then adds to Xn

Before / After

Before (SH2ADDUW — representative):

  ldr RS1, REGISTER_ADDRESS(RS1)
  ldr RS2, REGISTER_ADDRESS(RS2)
  mov RS1w, RS1w              ← zero-extend to 64 bits
  lsl RS1, RS1, 2             ← shift left by 2
  add RS1, RS1, RS2           ← add RS2

After (SH2ADDUW):

  ldr RS1, REGISTER_ADDRESS(RS1)
  ldr RS2, REGISTER_ADDRESS(RS2)
  add RS1, RS2, RS1w, uxtw #2  ← zero-extend RS1w, shift by 2, add RS2 — all in one

Before (SH2ADD — representative):

  ldr RS1, REGISTER_ADDRESS(RS1)
  ldr RS2, REGISTER_ADDRESS(RS2)
  lsl RS1, RS1, 2             ← shift left by 2
  add RS1, RS1, RS2           ← add RS2

After (SH2ADD):

  ldr RS1, REGISTER_ADDRESS(RS1)
  ldr RS2, REGISTER_ADDRESS(RS2)
  add RS1, RS2, RS1, lsl #2   ← shift RS1 by 2, add RS2 — all in one

Instruction savings per handler

Instruction	Before	After	Saved
`ADDUW`	`mov RS1w, RS1w` + `add`	`add RS1, RS2, RS1, uxtw`	−1
`SH1ADD`	`lsl RS1, RS1, 1` + `add`	`add RS1, RS2, RS1, lsl #1`	−1
`SH2ADD`	`lsl RS1, RS1, 2` + `add`	`add RS1, RS2, RS1, lsl #2`	−1
`SH3ADD`	`lsl RS1, RS1, 3` + `add`	`add RS1, RS2, RS1, lsl #3`	−1
`SH1ADDUW`	`mov` + `lsl RS1, RS1, 1` + `add`	`add RS1, RS2, RS1, uxtw #1`	−2
`SH2ADDUW`	`mov` + `lsl RS1, RS1, 2` + `add`	`add RS1, RS2, RS1, uxtw #2`	−2
`SH3ADDUW`	`mov` + `lsl RS1, RS1, 3` + `add`	`add RS1, RS2, RS1, uxtw #3`	−2

Why This Is Safe

add Xd, Xn, Wm, UXTW #N: ARM64 guarantees that Wm is zero-extended to 64 bits before shifting, matching RISC-V's zero_extend(rs1[31:0]) semantics exactly.
add Xd, Xn, Xm, LSL #N: ARM64 applies a 64-bit left shift, matching RISC-V's 64-bit SHxADD semantics.
The shift amounts 1, 2, 3 are immediate constants embedded in the instruction encoding; no runtime masking is needed.
The destination register (RS1) is the same as before, so WRITE_RD is unchanged.

Benchmark

tests/programs/shadd_microbench.S — 125K iterations × 8 unrolled instructions, covering all 7 handlers (SH1ADD appears twice). Instructions are chained through t2 to serialize execution and measure handler latency.

benches/shadd_benchmark.rs — Criterion benchmark; reports mean, median, standard deviation, and 95% confidence intervals.

cargo bench --features=asm shadd_microbench

Results

Benchmark environment: Aliyun ecs.g8y.small, YiTian 710 (1 core), 4 GB RAM

	Before	After	Change
Mean	[2.622, 2.637, 2.655] ms	[2.426, 2.437, 2.451] ms	[−8.4%, −7.6%, −6.8%]
Median	[2.599, 2.601, 2.603] ms	[2.408, 2.409, 2.411] ms	[−7.5%, −7.4%, −7.3%]
Std Dev	0.269 ms	0.203 ms	−24.6%
MAD	25.3 µs	20.2 µs	−20.1%

Interpretation

Mean and median both improve by approximately 7.5%. This is consistent with the instruction savings: the UW variants each lose 2 instructions (saving 33% of the non-load body), and the regular SHxADD variants each lose 1 instruction. The microbench is evenly split between the two families (4 UW + 4 non-UW), so the average saving is ~1.5 instructions per handler, or roughly 25% of the pre-optimization body.

The standard deviation drops by 24.6% and MAD by 20.1%, indicating lower run-to-run variability alongside the throughput gain.

.editorconfig

benches/shadd_benchmark.rs

benches/vm_bench_utils.rs

tests/programs/shadd_microbench.S

mohanson · 2026-03-31T01:35:42Z

Please rebase develop branch, and then delete any unnecessary files.

Support uniformed RISCV toolchain binaries

Copilot

Pull request overview

Optimizes AArch64 implementations of RISC-V Zba ADDUW and SHxADD/SHxADDUW by using ARM64’s shifted-register and extended-register add forms to remove redundant mov/lsl sequences, reducing instruction count in the hot handlers. Also refactors the native RISC-V test-program build script to support configurable toolchain prefixes/flags and to reduce duplication.

Changes:

Replace mov/lsl + add sequences in 7 Zba handlers with single-instruction add ... lsl #N / add ... uxtw #N forms on AArch64.
Refactor tests/programs/_build_all_native.sh to use helper functions and configurable toolchain prefix/flags.
Adjust several assembly build invocations to go through the shared asm_link helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
tests/programs/_build_all_native.sh	Refactors build steps via helper functions and introduces toolchain configurability (but currently introduces shell-compatibility and output-name issues).
src/machine/asm/execute_aarch64.S	Collapses Zba `ADDUW`/`SHxADD*` handlers into single `add` instructions using shift/extend modifiers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/programs/_build_all_native.sh

Add two shell functions, `asm_link` and `gcc_compile`, to simplify the compilation commands

Use aarch64's native shift/extend add patterns

yuqiliu617 force-pushed the contrib/shxadd branch 3 times, most recently from edf1572 to 969b8c8 Compare March 25, 2026 15:55

mohanson reviewed Mar 26, 2026

View reviewed changes

.editorconfig Outdated Show resolved Hide resolved

mohanson reviewed Mar 31, 2026

View reviewed changes

benches/shadd_benchmark.rs Outdated Show resolved Hide resolved

mohanson reviewed Mar 31, 2026

View reviewed changes

benches/vm_bench_utils.rs Outdated Show resolved Hide resolved

mohanson reviewed Mar 31, 2026

View reviewed changes

tests/programs/shadd_microbench.S Outdated Show resolved Hide resolved

yuqiliu617 force-pushed the contrib/shxadd branch from 969b8c8 to ebe7a24 Compare March 31, 2026 01:45

refactor: generalize test build script

d58fa55

Support uniformed RISCV toolchain binaries

yuqiliu617 force-pushed the contrib/shxadd branch from ebe7a24 to 1142997 Compare March 31, 2026 02:05

mohanson previously approved these changes Mar 31, 2026

View reviewed changes

mohanson requested a review from Copilot March 31, 2026 02:21

Copilot started reviewing on behalf of mohanson March 31, 2026 02:21 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

tests/programs/_build_all_native.sh Show resolved Hide resolved

tests/programs/_build_all_native.sh Outdated Show resolved Hide resolved

mohanson dismissed their stale review via 4663824 March 31, 2026 02:27

yuqiliu617 added 2 commits March 30, 2026 19:30

refactor: simplify build script

a2e34d3

Add two shell functions, `asm_link` and `gcc_compile`, to simplify the compilation commands

feat: optimize shift add instruction handlers

bd42e4e

Use aarch64's native shift/extend add patterns

yuqiliu617 force-pushed the contrib/shxadd branch from 4663824 to bd42e4e Compare March 31, 2026 02:31

mohanson approved these changes Mar 31, 2026

View reviewed changes

mohanson merged commit 7ef2936 into nervosnetwork:develop Mar 31, 2026
8 checks passed

yuqiliu617 deleted the contrib/shxadd branch March 31, 2026 02:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SHxADD / ADDUW Instruction Optimization on AArch64#504

SHxADD / ADDUW Instruction Optimization on AArch64#504
mohanson merged 3 commits intonervosnetwork:developfrom
yuqiliu617:contrib/shxadd

yuqiliu617 commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohanson commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuqiliu617 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Before / After

Instruction savings per handler

Why This Is Safe

Benchmark

Results

Interpretation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohanson commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuqiliu617 commented Mar 25, 2026 •

edited

Loading