Register naming in Capstone 5 has changed for ARM.

This isn't so much a bug report as a 'there's a change in behaviour... did you know?' report.

## The difference

I have an operating system which uses Capstone as its disassembly system (for reporting faults, etc). The output of the disassembly is used as expectations for the tests. This means that its test output (and, obviously, Capstone's output) must remain the same between runs to ensure that the expectations are met. They started failing once Capstone 5 was released, because the representation of registers has changed for ARM.

Specifically, I'm seeing that register 13 in ARM which was reported as `sp` is now being represented as `r13` (when CS_OPT_SYNTAX_NOREGNAME is in force)

This isn't a problem for me per-se... although I would prefer to see `sp` as the name of the register, but we can accept `r13` although it's not as nice. There isn't a way to rename registers from within the application, so I do not appear to be able to revert the behaviour to what it was before - I can do a search and replace, however that's a little more expensive.

To be clear about the problem, here is the behaviour of disassembling the instruction `LDR r1, [sp, #4]` with both capstone 4 and 5:

## Capstone 4

```
charles@laputa ~/projects/RO/pyromaniac (master)> pip install -U 'capstone<5'
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting capstone<5
Installing collected packages: capstone
  Found existing installation: capstone 5.0.0.post1
    Uninstalling capstone-5.0.0.post1:
      Successfully uninstalled capstone-5.0.0.post1
Successfully installed capstone-4.0.2
charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (4, 0, 1024)

0x1000:	ldr	r1, [sp, #4]
  op#0: type=1 (ARM_OP_REG)
        reg = 67 (R1)
  op#1: type=3 (ARM_OP_MEM)
        base = 12 (R13)
        index = 0 (Runknown)
        disp = 4
        lshift = 0 (Runknown)
```

## Capstone 5

```
charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (5, 0, 1280)

0x1000:	ldr	r1, [r13, #4]
  op#0: type=1 (ARM_OP_REG)
        reg = 67 (R1)
  op#1: type=3 (ARM_OP_MEM)
        base = 12 (R13)
        index = 0 (Runknown)
        disp = 4
        lshift = 0 (Runknown)
```

## Test program to generate the above output

This is my general disassembly tool for investigating the contents of the capstone output; it's a little wordy, but the important bit is the `md.syntax = CS_OPT_SYNTAX_NOREGNAME` and that the instruction being decoded is `b'\x04\x10\x9d\xe5',` (`LDR r1,[r13, #4]`).

```python
#!/usr/bin/env python

import sys

from capstone import *
import capstone.arm_const

reg_map = [
        capstone.arm_const.ARM_REG_R0,
        capstone.arm_const.ARM_REG_R1,
        capstone.arm_const.ARM_REG_R2,
        capstone.arm_const.ARM_REG_R3,
        capstone.arm_const.ARM_REG_R4,
        capstone.arm_const.ARM_REG_R5,
        capstone.arm_const.ARM_REG_R6,
        capstone.arm_const.ARM_REG_R7,
        capstone.arm_const.ARM_REG_R8,
        capstone.arm_const.ARM_REG_R9,
        capstone.arm_const.ARM_REG_R10,
        capstone.arm_const.ARM_REG_R11,
        capstone.arm_const.ARM_REG_R12,
        capstone.arm_const.ARM_REG_SP,
        capstone.arm_const.ARM_REG_LR,
        capstone.arm_const.ARM_REG_PC,
    ]
inv_reg_map = dict((regval, regnum) for regnum, regval in enumerate(reg_map))

shift_names = {
        capstone.arm_const.ARM_SFT_INVALID: None,
        capstone.arm_const.ARM_SFT_ASR: 'ASR',
        capstone.arm_const.ARM_SFT_ASR_REG: 'ASR',
        capstone.arm_const.ARM_SFT_LSL: 'LSL',
        capstone.arm_const.ARM_SFT_LSL_REG: 'LSL',
        capstone.arm_const.ARM_SFT_LSR: 'LSR',
        capstone.arm_const.ARM_SFT_LSR_REG: 'LSR',
        capstone.arm_const.ARM_SFT_ROR: 'ROR',
        capstone.arm_const.ARM_SFT_ROR_REG: 'ROR',
        capstone.arm_const.ARM_SFT_RRX: 'RRX',
        capstone.arm_const.ARM_SFT_RRX_REG: 'RRX'
    }

optype_names = dict((getattr(capstone.arm_const, optype), optype) for optype in dir(capstone.arm_const) if optype.startswith('ARM_OP_'))

md = Cs(CS_ARCH_ARM, CS_MODE_ARM)
md.detail = True
md.mnemonic_setup(capstone.arm_const.ARM_INS_SVC, "SWI")
# Turn off APCS register naming
md.syntax = capstone.CS_OPT_SYNTAX_NOREGNAME

last_i = None

def show_disasm(code):
    global last_i
    for i in md.disasm(code, 0x1000):
        last_i = i
        print("")
        print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
        for index, operand in enumerate(i.operands):
            print("  op#%i: type=%i (%s)" % (index, operand.type, optype_names.get(operand.type, 'unknown')))
            if operand.type == capstone.arm_const.ARM_OP_IMM:
                print("        imm = %i" % (operand.imm,))
            if operand.type == capstone.arm_const.ARM_OP_REG:
                print("        reg = %i (R%s)" % (operand.reg, inv_reg_map[operand.reg]))
            if operand.type == capstone.arm_const.ARM_OP_MEM:
                print("        base = %i (R%s)" % (operand.mem.base, inv_reg_map.get(operand.mem.base, 'unknown')))
                print("        index = %i (R%s)" % (operand.mem.index, inv_reg_map.get(operand.mem.index, 'unknown')))
                print("        disp = %i" % (operand.mem.disp,))
                print("        lshift = %i (R%s)" % (operand.mem.lshift, inv_reg_map.get(operand.mem.lshift, 'unknown')))
            if operand.shift.type != capstone.arm_const.ARM_SFT_INVALID:
                if operand.shift.type in (capstone.arm_const.ARM_SFT_LSL,
                                          capstone.arm_const.ARM_SFT_LSR,
                                          capstone.arm_const.ARM_SFT_ASR,
                                          capstone.arm_const.ARM_SFT_ROR):
                    sname = shift_names[operand.shift.type]
                    print("        shift = %s #%i" % (sname, operand.shift.value))
                elif operand.shift.type in (capstone.arm_const.ARM_SFT_LSL_REG,
                                            capstone.arm_const.ARM_SFT_LSR_REG,
                                            capstone.arm_const.ARM_SFT_ASR_REG,
                                            capstone.arm_const.ARM_SFT_ROR_REG):
                    sname = shift_names[operand.shift.type]
                    reg = inv_reg_map[operand.shift.value]
                    print("        shift = %s R%s" % (sname, reg))
                else:
                    print("        shift = type=%i value=%i" % (operand.shift.type, operand.shift.value))

def insn__repr__(self):
    word = bytes(bytearray(reversed(list(self.bytes)))).encode('hex')
    return "<{}(word=0x{}, {} operands)>".format(self.__class__.__name__, word, len(self.operands))
capstone.CsInsn.__repr__ = insn__repr__

def armop__repr__(self):
    params = ['type={}'.format(optype_names.get(self.type, 'unknown'))]
    if self.type == capstone.arm_const.ARM_OP_IMM:
        params.append('imm={}'.format(self.imm))
    elif self.type == capstone.arm_const.ARM_OP_REG:
        params.append('reg={}'.format(inv_reg_map[self.reg]))
    elif self.type == capstone.arm_const.ARM_OP_MEM:
        params.append('basereg={}'.format(inv_reg_map.get(self.mem.base, 'unknown')))
        params.append('indexreg={}'.format(inv_reg_map.get(self.mem.index, 'unknown')))
        params.append('displacement={}'.format(self.mem.disp))
        params.append('lshift={}'.format(self.mem.lshift))
    if self.shift.type != capstone.arm_const.ARM_SFT_INVALID:
        if self.shift.type in (capstone.arm_const.ARM_SFT_LSL,
                               capstone.arm_const.ARM_SFT_LSR,
                               capstone.arm_const.ARM_SFT_ASR,
                               capstone.arm_const.ARM_SFT_ROR):
            sname = shift_names[self.shift.type]
            params.append("shift={} #{}".format(sname, self.shift.value))
        else:
            params.append("shift=type{} #{}".format(self.shift.type, self.shift.value))
    return "<{}({})>".format(self.__class__.__name__, ', '.join(params))
capstone.arm.ArmOp.__repr__ = armop__repr__

print("cs_version() = %r" % (cs_version(),))

one_example = False
if len(sys.argv) == 2:
    try:
        one_example = int(sys.argv[1])
    except ValueError:
        sys.exit("Syntax: %s <example-number>" % (sys.argv[0],))

examples = [
        b'\x05\x00\x00\xef', # SWI 5
        b'\x20\x00\x50\xe3', # CMP r0, #&20
        b'\x40\x00\x9f\x05', # LDREQ   r0,[pc,#64]
        b'\x05\x00\x00\x2f', # SWI 5
        b'\x08\x00\x00\xeb', # BL pc+8*4
        b'\xba\x50\x8f\xb2', # ADDLT r5, pc, #186
        b'\x6C\x43\x9f\xE5', # LDR r4, [pc, #&36c]
        b'\x0b\xb0\x97\xe7', # LDR     r11, [r7, r11]
        b'\x04\x00\x5f\xe5', # LDRB r0, [pc, #4]
        b'\x03\x00\x92\xe8', # LDMIA   r2, {r0, r1}
        b'\x03\x00\x92\xd8', # LDMLEIA r2, {r0, r1}
        b'\x00\x18\xa0\xe1', # LSL r1, r0, #&10 => MOV r1, r0, LSL #16
        b'\x21\x18\xa0\xe1', # LSR r1, r1, #&10 => MOV r1, r1, LSR #16
        b'\x26\xc4\xb0\xe1', # LSRS r12, r6, #8 => MOVS r12, r6, LSR #8
        b'\x12\x13\xa0\xe1', # LSL r1, r2, r3   => MOV r1, r2, LSL r3
        b'\x52\x13\xa0\xe1', # ASR r1, r2, r3   => MOV r1, r2, ASR r3
        b'\x62\x10\xa0\xe1', # RRX r1, r2       => MOV r1, r2, RRX
        b'\x53\x30\xeb\xe7', # ?
        b'\x01\x0f\x81\xe2', # ADD r0, r1, #1, #30  => ADD r0, r1, #2
        b'\x1e\x10\x81\x11', # ORRNE r1, r1, r14, LSL r0
        b'\x06\x10\xe0\xe3', # MVN r1,#&6
        b'\x02\x10\x9f\xe7', # LDR r1,[pc,r2]
        b'\x04\x10\x9d\xe5', # LDR r1,[r13, #4]
    ]
if one_example is False:
    for code in examples:
        show_disasm(code)
else:
    code = examples[one_example]
    show_disasm(code)
```

## Cause of the change

In v4, the decoding was performed by the `getRegisterName2` function for the `CS_OPT_SYNTAX_NOREGRNAME` in `ARMGenAsmWriter.inc`, which for register id 12 (see above that the base register has the value of 12) we get out the string `sp`:

https://github.com/capstone-engine/capstone/blob/v4/arch/ARM/ARMGenAsmWriter.inc#L8634C1-L8634C26

And in the v5 code, the decoding is performed by the `getRegisterName_digit` in `ARMGenRegisterName_digit.inc`, and again we use register id 12 (again the base register number is 12) which has a string `r13`.

https://github.com/capstone-engine/capstone/blob/v5/arch/ARM/ARMGenRegisterName_digit.inc#L77

Obviously these two files are automatically generated, and arguably the use of `r13` when you're not using the register naming schemes is more accurate. However, except for `APCS_U`, register 13 has always been the stack pointer - I believe under APCS_U the stack pointer was in `r12`, and unless you're using RISCiX you're not going to care about APCS_U. In all other cases, I believe r13 has the convention of being the stack pointer - and if you're interworking with Thumb, it must be a stack pointer.

## Expected behaviour

I expected the behaviour of the output to not change between versions, but it's not a strong expectation, as this is a major version update. It would have been nice if the change in register names had been included in the 5.0 change notes in https://github.com/capstone-engine/capstone/releases - just to be clear that it had updated.

What would be nice would be if it were possible to rename registers dynamically, but I suspect that's not going to be easy.

I intend to include a special case to rename `r13` to `sp` when disassembling, to retain the old behaviour, if capstone 5 is detected, although I'm not convinced myself that this is a good idea in the long term - that's my problem, not yours.

I just wanted to highlight that there is a change in behaviour and that it was unexpected. It's not necessarily a bug unless you are guaranteeing the output format is unchanging between major releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register naming in Capstone 5 has changed for ARM. #2078

The difference

Capstone 4

Capstone 5

Test program to generate the above output

Cause of the change

Expected behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Register naming in Capstone 5 has changed for ARM. #2078

Description

The difference

Capstone 4

Capstone 5

Test program to generate the above output

Cause of the change

Expected behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions