SHxADD / ADDUW Instruction Optimization on AArch64#504
Merged
mohanson merged 3 commits intonervosnetwork:developfrom Mar 31, 2026
Merged
SHxADD / ADDUW Instruction Optimization on AArch64#504mohanson merged 3 commits intonervosnetwork:developfrom
mohanson merged 3 commits intonervosnetwork:developfrom
Conversation
edf1572 to
969b8c8
Compare
mohanson
reviewed
Mar 26, 2026
mohanson
reviewed
Mar 31, 2026
mohanson
reviewed
Mar 31, 2026
mohanson
reviewed
Mar 31, 2026
Collaborator
|
Please rebase develop branch, and then delete any unnecessary files. |
969b8c8 to
ebe7a24
Compare
Support uniformed RISCV toolchain binaries
ebe7a24 to
1142997
Compare
mohanson
previously approved these changes
Mar 31, 2026
There was a problem hiding this comment.
Pull request overview
Optimizes AArch64 implementations of RISC-V Zba ADDUW and SHxADD/SHxADDUW by using ARM64’s shifted-register and extended-register add forms to remove redundant mov/lsl sequences, reducing instruction count in the hot handlers. Also refactors the native RISC-V test-program build script to support configurable toolchain prefixes/flags and to reduce duplication.
Changes:
- Replace
mov/lsl+addsequences in 7 Zba handlers with single-instructionadd ... lsl #N/add ... uxtw #Nforms on AArch64. - Refactor
tests/programs/_build_all_native.shto use helper functions and configurable toolchain prefix/flags. - Adjust several assembly build invocations to go through the shared
asm_linkhelper.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/programs/_build_all_native.sh | Refactors build steps via helper functions and introduces toolchain configurability (but currently introduces shell-compatibility and output-name issues). |
| src/machine/asm/execute_aarch64.S | Collapses Zba ADDUW/SHxADD* handlers into single add instructions using shift/extend modifiers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add two shell functions, `asm_link` and `gcc_compile`, to simplify the compilation commands
Use aarch64's native shift/extend add patterns
4663824 to
bd42e4e
Compare
mohanson
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminated redundant pre-shift
movandlslinstructions from 7 Zba extension handlers (ADDUW,SH1ADD,SH2ADD,SH3ADD,SH1ADDUW,SH2ADDUW,SH3ADDUW) by exploiting ARM64's extended-register and shifted-registeraddforms.Background
The RISC-V Zba extension provides shift-and-add instructions in two families:
rd = (rs1 << N) + rs2forrd = (zero_extend(rs1[31:0]) << N) + rs2forARM64
addnatively supports both patterns via operand modifiers:add Xd, Xn, Xm, LSL #N— computesXn + (Xm << N)in one instructionadd Xd, Xn, Wm, UXTW #N— zero-extendsWmto 64 bits, shifts left by N, then adds toXnBefore / After
Before (SH2ADDUW — representative):
After (SH2ADDUW):
Before (SH2ADD — representative):
After (SH2ADD):
Instruction savings per handler
ADDUWmov RS1w, RS1w+addadd RS1, RS2, RS1, uxtwSH1ADDlsl RS1, RS1, 1+addadd RS1, RS2, RS1, lsl #1SH2ADDlsl RS1, RS1, 2+addadd RS1, RS2, RS1, lsl #2SH3ADDlsl RS1, RS1, 3+addadd RS1, RS2, RS1, lsl #3SH1ADDUWmov+lsl RS1, RS1, 1+addadd RS1, RS2, RS1, uxtw #1SH2ADDUWmov+lsl RS1, RS1, 2+addadd RS1, RS2, RS1, uxtw #2SH3ADDUWmov+lsl RS1, RS1, 3+addadd RS1, RS2, RS1, uxtw #3Why This Is Safe
add Xd, Xn, Wm, UXTW #N: ARM64 guarantees thatWmis zero-extended to 64 bits before shifting, matching RISC-V'szero_extend(rs1[31:0])semantics exactly.add Xd, Xn, Xm, LSL #N: ARM64 applies a 64-bit left shift, matching RISC-V's 64-bit SHxADD semantics.RS1) is the same as before, soWRITE_RDis unchanged.Benchmark
tests/programs/shadd_microbench.S — 125K iterations × 8 unrolled instructions, covering all 7 handlers (SH1ADD appears twice). Instructions are chained through
t2to serialize execution and measure handler latency.benches/shadd_benchmark.rs — Criterion benchmark; reports mean, median, standard deviation, and 95% confidence intervals.
Results
Benchmark environment: Aliyun
ecs.g8y.small, YiTian 710 (1 core), 4 GB RAMInterpretation
Mean and median both improve by approximately 7.5%. This is consistent with the instruction savings: the UW variants each lose 2 instructions (saving 33% of the non-load body), and the regular SHxADD variants each lose 1 instruction. The microbench is evenly split between the two families (4 UW + 4 non-UW), so the average saving is ~1.5 instructions per handler, or roughly 25% of the pre-optimization body.
The standard deviation drops by 24.6% and MAD by 20.1%, indicating lower run-to-run variability alongside the throughput gain.