Skip to content

Register naming in Capstone 5 has changed for ARM. #2078

@gerph

Description

@gerph

This isn't so much a bug report as a 'there's a change in behaviour... did you know?' report.

The difference

I have an operating system which uses Capstone as its disassembly system (for reporting faults, etc). The output of the disassembly is used as expectations for the tests. This means that its test output (and, obviously, Capstone's output) must remain the same between runs to ensure that the expectations are met. They started failing once Capstone 5 was released, because the representation of registers has changed for ARM.

Specifically, I'm seeing that register 13 in ARM which was reported as sp is now being represented as r13 (when CS_OPT_SYNTAX_NOREGNAME is in force)

This isn't a problem for me per-se... although I would prefer to see sp as the name of the register, but we can accept r13 although it's not as nice. There isn't a way to rename registers from within the application, so I do not appear to be able to revert the behaviour to what it was before - I can do a search and replace, however that's a little more expensive.

To be clear about the problem, here is the behaviour of disassembling the instruction LDR r1, [sp, #4] with both capstone 4 and 5:

Capstone 4

charles@laputa ~/projects/RO/pyromaniac (master)> pip install -U 'capstone<5'
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting capstone<5
Installing collected packages: capstone
  Found existing installation: capstone 5.0.0.post1
    Uninstalling capstone-5.0.0.post1:
      Successfully uninstalled capstone-5.0.0.post1
Successfully installed capstone-4.0.2
charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (4, 0, 1024)

0x1000:	ldr	r1, [sp, #4]
  op#0: type=1 (ARM_OP_REG)
        reg = 67 (R1)
  op#1: type=3 (ARM_OP_MEM)
        base = 12 (R13)
        index = 0 (Runknown)
        disp = 4
        lshift = 0 (Runknown)

Capstone 5

charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (5, 0, 1280)

0x1000:	ldr	r1, [r13, #4]
  op#0: type=1 (ARM_OP_REG)
        reg = 67 (R1)
  op#1: type=3 (ARM_OP_MEM)
        base = 12 (R13)
        index = 0 (Runknown)
        disp = 4
        lshift = 0 (Runknown)

Test program to generate the above output

This is my general disassembly tool for investigating the contents of the capstone output; it's a little wordy, but the important bit is the md.syntax = CS_OPT_SYNTAX_NOREGNAME and that the instruction being decoded is b'\x04\x10\x9d\xe5', (LDR r1,[r13, #4]).

#!/usr/bin/env python

import sys

from capstone import *
import capstone.arm_const

reg_map = [
        capstone.arm_const.ARM_REG_R0,
        capstone.arm_const.ARM_REG_R1,
        capstone.arm_const.ARM_REG_R2,
        capstone.arm_const.ARM_REG_R3,
        capstone.arm_const.ARM_REG_R4,
        capstone.arm_const.ARM_REG_R5,
        capstone.arm_const.ARM_REG_R6,
        capstone.arm_const.ARM_REG_R7,
        capstone.arm_const.ARM_REG_R8,
        capstone.arm_const.ARM_REG_R9,
        capstone.arm_const.ARM_REG_R10,
        capstone.arm_const.ARM_REG_R11,
        capstone.arm_const.ARM_REG_R12,
        capstone.arm_const.ARM_REG_SP,
        capstone.arm_const.ARM_REG_LR,
        capstone.arm_const.ARM_REG_PC,
    ]
inv_reg_map = dict((regval, regnum) for regnum, regval in enumerate(reg_map))

shift_names = {
        capstone.arm_const.ARM_SFT_INVALID: None,
        capstone.arm_const.ARM_SFT_ASR: 'ASR',
        capstone.arm_const.ARM_SFT_ASR_REG: 'ASR',
        capstone.arm_const.ARM_SFT_LSL: 'LSL',
        capstone.arm_const.ARM_SFT_LSL_REG: 'LSL',
        capstone.arm_const.ARM_SFT_LSR: 'LSR',
        capstone.arm_const.ARM_SFT_LSR_REG: 'LSR',
        capstone.arm_const.ARM_SFT_ROR: 'ROR',
        capstone.arm_const.ARM_SFT_ROR_REG: 'ROR',
        capstone.arm_const.ARM_SFT_RRX: 'RRX',
        capstone.arm_const.ARM_SFT_RRX_REG: 'RRX'
    }

optype_names = dict((getattr(capstone.arm_const, optype), optype) for optype in dir(capstone.arm_const) if optype.startswith('ARM_OP_'))

md = Cs(CS_ARCH_ARM, CS_MODE_ARM)
md.detail = True
md.mnemonic_setup(capstone.arm_const.ARM_INS_SVC, "SWI")
# Turn off APCS register naming
md.syntax = capstone.CS_OPT_SYNTAX_NOREGNAME

last_i = None

def show_disasm(code):
    global last_i
    for i in md.disasm(code, 0x1000):
        last_i = i
        print("")
        print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
        for index, operand in enumerate(i.operands):
            print("  op#%i: type=%i (%s)" % (index, operand.type, optype_names.get(operand.type, 'unknown')))
            if operand.type == capstone.arm_const.ARM_OP_IMM:
                print("        imm = %i" % (operand.imm,))
            if operand.type == capstone.arm_const.ARM_OP_REG:
                print("        reg = %i (R%s)" % (operand.reg, inv_reg_map[operand.reg]))
            if operand.type == capstone.arm_const.ARM_OP_MEM:
                print("        base = %i (R%s)" % (operand.mem.base, inv_reg_map.get(operand.mem.base, 'unknown')))
                print("        index = %i (R%s)" % (operand.mem.index, inv_reg_map.get(operand.mem.index, 'unknown')))
                print("        disp = %i" % (operand.mem.disp,))
                print("        lshift = %i (R%s)" % (operand.mem.lshift, inv_reg_map.get(operand.mem.lshift, 'unknown')))
            if operand.shift.type != capstone.arm_const.ARM_SFT_INVALID:
                if operand.shift.type in (capstone.arm_const.ARM_SFT_LSL,
                                          capstone.arm_const.ARM_SFT_LSR,
                                          capstone.arm_const.ARM_SFT_ASR,
                                          capstone.arm_const.ARM_SFT_ROR):
                    sname = shift_names[operand.shift.type]
                    print("        shift = %s #%i" % (sname, operand.shift.value))
                elif operand.shift.type in (capstone.arm_const.ARM_SFT_LSL_REG,
                                            capstone.arm_const.ARM_SFT_LSR_REG,
                                            capstone.arm_const.ARM_SFT_ASR_REG,
                                            capstone.arm_const.ARM_SFT_ROR_REG):
                    sname = shift_names[operand.shift.type]
                    reg = inv_reg_map[operand.shift.value]
                    print("        shift = %s R%s" % (sname, reg))
                else:
                    print("        shift = type=%i value=%i" % (operand.shift.type, operand.shift.value))

def insn__repr__(self):
    word = bytes(bytearray(reversed(list(self.bytes)))).encode('hex')
    return "<{}(word=0x{}, {} operands)>".format(self.__class__.__name__, word, len(self.operands))
capstone.CsInsn.__repr__ = insn__repr__

def armop__repr__(self):
    params = ['type={}'.format(optype_names.get(self.type, 'unknown'))]
    if self.type == capstone.arm_const.ARM_OP_IMM:
        params.append('imm={}'.format(self.imm))
    elif self.type == capstone.arm_const.ARM_OP_REG:
        params.append('reg={}'.format(inv_reg_map[self.reg]))
    elif self.type == capstone.arm_const.ARM_OP_MEM:
        params.append('basereg={}'.format(inv_reg_map.get(self.mem.base, 'unknown')))
        params.append('indexreg={}'.format(inv_reg_map.get(self.mem.index, 'unknown')))
        params.append('displacement={}'.format(self.mem.disp))
        params.append('lshift={}'.format(self.mem.lshift))
    if self.shift.type != capstone.arm_const.ARM_SFT_INVALID:
        if self.shift.type in (capstone.arm_const.ARM_SFT_LSL,
                               capstone.arm_const.ARM_SFT_LSR,
                               capstone.arm_const.ARM_SFT_ASR,
                               capstone.arm_const.ARM_SFT_ROR):
            sname = shift_names[self.shift.type]
            params.append("shift={} #{}".format(sname, self.shift.value))
        else:
            params.append("shift=type{} #{}".format(self.shift.type, self.shift.value))
    return "<{}({})>".format(self.__class__.__name__, ', '.join(params))
capstone.arm.ArmOp.__repr__ = armop__repr__

print("cs_version() = %r" % (cs_version(),))

one_example = False
if len(sys.argv) == 2:
    try:
        one_example = int(sys.argv[1])
    except ValueError:
        sys.exit("Syntax: %s <example-number>" % (sys.argv[0],))

examples = [
        b'\x05\x00\x00\xef', # SWI 5
        b'\x20\x00\x50\xe3', # CMP r0, #&20
        b'\x40\x00\x9f\x05', # LDREQ   r0,[pc,#64]
        b'\x05\x00\x00\x2f', # SWI 5
        b'\x08\x00\x00\xeb', # BL pc+8*4
        b'\xba\x50\x8f\xb2', # ADDLT r5, pc, #186
        b'\x6C\x43\x9f\xE5', # LDR r4, [pc, #&36c]
        b'\x0b\xb0\x97\xe7', # LDR     r11, [r7, r11]
        b'\x04\x00\x5f\xe5', # LDRB r0, [pc, #4]
        b'\x03\x00\x92\xe8', # LDMIA   r2, {r0, r1}
        b'\x03\x00\x92\xd8', # LDMLEIA r2, {r0, r1}
        b'\x00\x18\xa0\xe1', # LSL r1, r0, #&10 => MOV r1, r0, LSL #16
        b'\x21\x18\xa0\xe1', # LSR r1, r1, #&10 => MOV r1, r1, LSR #16
        b'\x26\xc4\xb0\xe1', # LSRS r12, r6, #8 => MOVS r12, r6, LSR #8
        b'\x12\x13\xa0\xe1', # LSL r1, r2, r3   => MOV r1, r2, LSL r3
        b'\x52\x13\xa0\xe1', # ASR r1, r2, r3   => MOV r1, r2, ASR r3
        b'\x62\x10\xa0\xe1', # RRX r1, r2       => MOV r1, r2, RRX
        b'\x53\x30\xeb\xe7', # ?
        b'\x01\x0f\x81\xe2', # ADD r0, r1, #1, #30  => ADD r0, r1, #2
        b'\x1e\x10\x81\x11', # ORRNE r1, r1, r14, LSL r0
        b'\x06\x10\xe0\xe3', # MVN r1,#&6
        b'\x02\x10\x9f\xe7', # LDR r1,[pc,r2]
        b'\x04\x10\x9d\xe5', # LDR r1,[r13, #4]
    ]
if one_example is False:
    for code in examples:
        show_disasm(code)
else:
    code = examples[one_example]
    show_disasm(code)

Cause of the change

In v4, the decoding was performed by the getRegisterName2 function for the CS_OPT_SYNTAX_NOREGRNAME in ARMGenAsmWriter.inc, which for register id 12 (see above that the base register has the value of 12) we get out the string sp:

https://github.com/capstone-engine/capstone/blob/v4/arch/ARM/ARMGenAsmWriter.inc#L8634C1-L8634C26

And in the v5 code, the decoding is performed by the getRegisterName_digit in ARMGenRegisterName_digit.inc, and again we use register id 12 (again the base register number is 12) which has a string r13.

https://github.com/capstone-engine/capstone/blob/v5/arch/ARM/ARMGenRegisterName_digit.inc#L77

Obviously these two files are automatically generated, and arguably the use of r13 when you're not using the register naming schemes is more accurate. However, except for APCS_U, register 13 has always been the stack pointer - I believe under APCS_U the stack pointer was in r12, and unless you're using RISCiX you're not going to care about APCS_U. In all other cases, I believe r13 has the convention of being the stack pointer - and if you're interworking with Thumb, it must be a stack pointer.

Expected behaviour

I expected the behaviour of the output to not change between versions, but it's not a strong expectation, as this is a major version update. It would have been nice if the change in register names had been included in the 5.0 change notes in https://github.com/capstone-engine/capstone/releases - just to be clear that it had updated.

What would be nice would be if it were possible to rename registers dynamically, but I suspect that's not going to be easy.

I intend to include a special case to rename r13 to sp when disassembling, to retain the old behaviour, if capstone 5 is detected, although I'm not convinced myself that this is a good idea in the long term - that's my problem, not yours.

I just wanted to highlight that there is a change in behaviour and that it was unexpected. It's not necessarily a bug unless you are guaranteeing the output format is unchanging between major releases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions