Division and Remainder Instruction Optimization on AArch64 by yuqiliu617 · Pull Request #506 · nervosnetwork/ckb-vm

yuqiliu617 · 2026-03-31T03:07:22Z

Summary

Two independent optimizations were applied to the AArch64 assembly handlers for the RISC-V division and remainder instructions.

Optimization 1: Branchless `csel`

Summary

Replaced conditional branches with ARM64 csel (Conditional SELect) in all 8 division/remainder instruction handlers (DIV, DIVU, DIVW, DIVUW, REM, REMU, REMW, REMUW). This eliminates branch misprediction penalties and improves instruction-level parallelism.

Before / After

Before (DIV, branched):

  cmp RS2, 0
  bne .div_branch2              ← branch taken on common path
  mov RS1, UINT64_MAX
  b .div_branch3                ← extra jump to rejoin
.div_branch2:
  sdiv RS1, RS1, RS2
.div_branch3:
  WRITE_RD(RS1)

After (DIV, branchless):

  sdiv TEMP1, RS1, RS2          ← always compute
  mov TEMP2, UINT64_MAX         ← always prepare fallback
  cmp RS2, 0
  csel RS1, TEMP2, TEMP1, eq    ← RS2==0 ? -1 : quotient
  WRITE_RD(RS1)

Remainder uses the same pattern: compute via sdiv + msub, then csel between the computed remainder and the original dividend (RISC-V spec: rem by zero returns the dividend).

Why This Is Safe

ARM64 sdiv/udiv with divisor=0 returns 0 without trapping, so speculatively executing the division is harmless.
ARM64 sdiv with INT64_MIN / -1 returns INT64_MIN, matching RISC-V spec. No fixup needed for the overflow case.

Trade-offs

Gains:

Zero branches in division handlers
Straight-line execution with better ILP
Constant-time regardless of input

Costs:

Division always executes even when divisor is zero
Uses one extra temp register (TEMP2 for the fallback constant)

Optimization 2: 32-bit Division and Remainder

Summary

Replaced the sign/zero-extend-then-divide-64-bit pattern in DIVW, DIVUW, REMW and REMUW, with direct 32-bit division instructions (e.g. sdiv Wd, Wn, Wm). This removes 2 instructions per handler by letting the 32-bit instruction forms handle operand masking natively.

Before / After

Before (DIVW):

  ldr RS1, REGISTER_ADDRESS(RS1)
  ldr RS2, REGISTER_ADDRESS(RS2)
  sxtw RS1, RS1w                ← sign-extend to make 64-bit sdiv correct
  sxtw RS2, RS2w                ← sign-extend to make 64-bit sdiv correct
  sdiv TEMP1, RS1, RS2          ← 64-bit signed division
  sxtw TEMP1, TEMP1w            ← sign-extend result
  ...

After (DIVW):

  ldr RS1, REGISTER_ADDRESS(RS1)
  ldr RS2, REGISTER_ADDRESS(RS2)
  sdiv TEMP1w, RS1w, RS2w       ← 32-bit signed division, upper bits ignored natively
  sxtw TEMP1, TEMP1w            ← sign-extend result
  ...

Why This Is Safe

ARM64 32-bit sdiv Wd, Wn, Wm reads only the low 32 bits of its source registers and writes a zero-extended 32-bit result to the destination, ignoring upper bits exactly as RISC-V requires.
ARM64 32-bit sdiv on INT32_MIN / -1 produces 0x80000000 without trapping. The subsequent sxtw sign-extends this to 0xFFFFFFFF80000000 = INT32_MIN, which is the correct RISC-V DIVW overflow result.
The cmp RS2w, 0 / csel pattern for divide-by-zero is unchanged.

Benchmark

1M chained div and divw instructions (125K iterations × 8 unrolled). Chained dependency serializes divisions to measure handler latency rather than throughput.

Environment: Aliyun ecs.g8y.small, YiTian 710 (1 core), 4 GB RAM

div_microbench (measures Optimization 1: csel):

	Before	After	Change
Mean	[3.544, 3.575, 3.613] ms	[3.722, 3.738, 3.758] ms	[+3.4%, +4.6%, +5.6%]
Median	[3.485, 3.488, 3.491] ms	[3.689, 3.692, 3.694] ms	[+5.7%, +5.8%, +5.9%]
Std Dev	0.557 ms	0.285 ms	−48.9%
MAD	38.3 µs	34.3 µs	−10.3%

divw_microbench (measures Optimization 2: 32-bit division; before = post-csel, after = post-32-bit-div):

	Before	After	Change
Mean	[3.835, 3.852, 3.874] ms	[3.586, 3.608, 3.634] ms	[−7.1%, −6.3%, −5.5%]
Median	[3.801, 3.803, 3.806] ms	[3.547, 3.549, 3.552] ms	[−6.8%, −6.7%, −6.6%]
Std Dev	0.314 ms	0.394 ms	+25.6%
MAD	35.5 µs	34.5 µs	−2.8%

Interpretation

Optimization 1: The 5–6% regression is expected: csel adds mov TEMP2, UINT64_MAX to the common (non-zero divisor) path, which the branched version reaches without that instruction. The standard deviation nearly halves (−48.9%) because csel removes the rare-but-expensive branch-misprediction tail. The benefit of csel materializes in production code where the divisor can be zero unpredictably.

Optimization 2: Median improves by 6.7% and mean by 6.3%. Replacing the two sxtw instructions with native 32-bit sdiv/udiv shortens the unconditional fast path, producing a clear speed improvement across all runs.

Utilize `csel` to eliminate branches

Use 32-bit registers for division directly instead of extending and dividing

Copilot

Pull request overview

This PR updates the AArch64 assembly implementation of RISC-V M-extension division and remainder instructions to reduce control-flow and instruction count in the handlers.

Changes:

Replaced divide-by-zero conditional branches with cmp + csel in DIV/DIVU/DIVW/DIVUW and REM/REMU/REMW/REMUW handlers.
Switched *W handlers to native 32-bit sdiv/udiv forms (and corresponding 32-bit msub) to avoid explicit operand extends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/machine/asm/execute_aarch64.S

Use `csinv` in the division instruction handlers to eliminate the redundant `mov TEMP2 UINT64_MAX`

yuqiliu617 added 2 commits March 30, 2026 19:49

perf: re-implement division instruction handlers

2128aa1

Utilize `csel` to eliminate branches

perf: optimize 32-bit division instructions

0275415

Use 32-bit registers for division directly instead of extending and dividing

mohanson requested a review from Copilot March 31, 2026 03:17

Copilot started reviewing on behalf of mohanson March 31, 2026 03:18 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

yuqiliu617 added 2 commits March 31, 2026 12:50

perf: further optimization with csinv

8c4defe

Use `csinv` in the division instruction handlers to eliminate the redundant `mov TEMP2 UINT64_MAX`

test: add correctness tests for division instructions

83081c3

mohanson approved these changes Apr 1, 2026

View reviewed changes

mohanson merged commit e456bb0 into nervosnetwork:develop Apr 1, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Division and Remainder Instruction Optimization on AArch64#506

Division and Remainder Instruction Optimization on AArch64#506
mohanson merged 4 commits intonervosnetwork:developfrom
yuqiliu617:contrib/div

yuqiliu617 commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuqiliu617 commented Mar 31, 2026

Summary

Optimization 1: Branchless csel

Summary

Before / After

Why This Is Safe

Trade-offs

Optimization 2: 32-bit Division and Remainder

Summary

Before / After

Why This Is Safe

Benchmark

Interpretation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimization 1: Branchless `csel`