In crypto/md5/md5block_arm64.s there are two similar rounds:
#define ROUND1(a, b, c, d, index, const, shift)
ADDW $const, a;
ADDW R8, a;
MOVW (index*4)(R1), R8;
EORW c, R9;
ANDW b, R9;
EORW d, R9;
ADDW R9, a;
RORW $(32-shift), a;
MOVW c, R9;
ADDW b, a
#define ROUND2(a, b, c, d, index, const, shift)
ADDW $const, a;
ADDW R8, a;
MOVW (index*4)(R1), R8;
ANDW b, R10;
BICW R9, c, R9;
ORRW R9, R10;
MOVW c, R9;
ADDW R10, a;
MOVW c, R10;
RORW $(32-shift), a;
ADDW b, a
The go code of ROUND1 is: a = b + bits.RotateLeft32((((c^d)&b)^d)+a+x0+const, shift)
ps: (c^d)&b)^d) is equal (b&c) | ((^b)&d).
The go code of ROUND2 is: a = b + bits.RotateLeft32((((b^c)&d)^c)+a+x0+const, shift)
ps: (b^c)&d)^c) is equal (b&d) | (c&(^d)).
Why it uses one register in ROUND1(R9) but uses two registers(R9, R10) in ROUND2, and they are both fastest?