This isn't so much a bug report as a 'there's a change in behaviour... did you know?' report.
The difference
I have an operating system which uses Capstone as its disassembly system (for reporting faults, etc). The output of the disassembly is used as expectations for the tests. This means that its test output (and, obviously, Capstone's output) must remain the same between runs to ensure that the expectations are met. They started failing once Capstone 5 was released, because the representation of registers has changed for ARM.
Specifically, I'm seeing that register 13 in ARM which was reported as sp is now being represented as r13 (when CS_OPT_SYNTAX_NOREGNAME is in force)
This isn't a problem for me per-se... although I would prefer to see sp as the name of the register, but we can accept r13 although it's not as nice. There isn't a way to rename registers from within the application, so I do not appear to be able to revert the behaviour to what it was before - I can do a search and replace, however that's a little more expensive.
To be clear about the problem, here is the behaviour of disassembling the instruction LDR r1, [sp, #4] with both capstone 4 and 5:
Capstone 4
charles@laputa ~/projects/RO/pyromaniac (master)> pip install -U 'capstone<5'
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.
Collecting capstone<5
Installing collected packages: capstone
Found existing installation: capstone 5.0.0.post1
Uninstalling capstone-5.0.0.post1:
Successfully uninstalled capstone-5.0.0.post1
Successfully installed capstone-4.0.2
charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (4, 0, 1024)
0x1000: ldr r1, [sp, #4]
op#0: type=1 (ARM_OP_REG)
reg = 67 (R1)
op#1: type=3 (ARM_OP_MEM)
base = 12 (R13)
index = 0 (Runknown)
disp = 4
lshift = 0 (Runknown)
Capstone 5
charles@laputa ~/projects/RO/pyromaniac (master)> ./diss.py -1
cs_version() = (5, 0, 1280)
0x1000: ldr r1, [r13, #4]
op#0: type=1 (ARM_OP_REG)
reg = 67 (R1)
op#1: type=3 (ARM_OP_MEM)
base = 12 (R13)
index = 0 (Runknown)
disp = 4
lshift = 0 (Runknown)
Test program to generate the above output
This is my general disassembly tool for investigating the contents of the capstone output; it's a little wordy, but the important bit is the md.syntax = CS_OPT_SYNTAX_NOREGNAME and that the instruction being decoded is b'\x04\x10\x9d\xe5', (LDR r1,[r13, #4]).
#!/usr/bin/env python
import sys
from capstone import *
import capstone.arm_const
reg_map = [
capstone.arm_const.ARM_REG_R0,
capstone.arm_const.ARM_REG_R1,
capstone.arm_const.ARM_REG_R2,
capstone.arm_const.ARM_REG_R3,
capstone.arm_const.ARM_REG_R4,
capstone.arm_const.ARM_REG_R5,
capstone.arm_const.ARM_REG_R6,
capstone.arm_const.ARM_REG_R7,
capstone.arm_const.ARM_REG_R8,
capstone.arm_const.ARM_REG_R9,
capstone.arm_const.ARM_REG_R10,
capstone.arm_const.ARM_REG_R11,
capstone.arm_const.ARM_REG_R12,
capstone.arm_const.ARM_REG_SP,
capstone.arm_const.ARM_REG_LR,
capstone.arm_const.ARM_REG_PC,
]
inv_reg_map = dict((regval, regnum) for regnum, regval in enumerate(reg_map))
shift_names = {
capstone.arm_const.ARM_SFT_INVALID: None,
capstone.arm_const.ARM_SFT_ASR: 'ASR',
capstone.arm_const.ARM_SFT_ASR_REG: 'ASR',
capstone.arm_const.ARM_SFT_LSL: 'LSL',
capstone.arm_const.ARM_SFT_LSL_REG: 'LSL',
capstone.arm_const.ARM_SFT_LSR: 'LSR',
capstone.arm_const.ARM_SFT_LSR_REG: 'LSR',
capstone.arm_const.ARM_SFT_ROR: 'ROR',
capstone.arm_const.ARM_SFT_ROR_REG: 'ROR',
capstone.arm_const.ARM_SFT_RRX: 'RRX',
capstone.arm_const.ARM_SFT_RRX_REG: 'RRX'
}
optype_names = dict((getattr(capstone.arm_const, optype), optype) for optype in dir(capstone.arm_const) if optype.startswith('ARM_OP_'))
md = Cs(CS_ARCH_ARM, CS_MODE_ARM)
md.detail = True
md.mnemonic_setup(capstone.arm_const.ARM_INS_SVC, "SWI")
# Turn off APCS register naming
md.syntax = capstone.CS_OPT_SYNTAX_NOREGNAME
last_i = None
def show_disasm(code):
global last_i
for i in md.disasm(code, 0x1000):
last_i = i
print("")
print("0x%x:\t%s\t%s" %(i.address, i.mnemonic, i.op_str))
for index, operand in enumerate(i.operands):
print(" op#%i: type=%i (%s)" % (index, operand.type, optype_names.get(operand.type, 'unknown')))
if operand.type == capstone.arm_const.ARM_OP_IMM:
print(" imm = %i" % (operand.imm,))
if operand.type == capstone.arm_const.ARM_OP_REG:
print(" reg = %i (R%s)" % (operand.reg, inv_reg_map[operand.reg]))
if operand.type == capstone.arm_const.ARM_OP_MEM:
print(" base = %i (R%s)" % (operand.mem.base, inv_reg_map.get(operand.mem.base, 'unknown')))
print(" index = %i (R%s)" % (operand.mem.index, inv_reg_map.get(operand.mem.index, 'unknown')))
print(" disp = %i" % (operand.mem.disp,))
print(" lshift = %i (R%s)" % (operand.mem.lshift, inv_reg_map.get(operand.mem.lshift, 'unknown')))
if operand.shift.type != capstone.arm_const.ARM_SFT_INVALID:
if operand.shift.type in (capstone.arm_const.ARM_SFT_LSL,
capstone.arm_const.ARM_SFT_LSR,
capstone.arm_const.ARM_SFT_ASR,
capstone.arm_const.ARM_SFT_ROR):
sname = shift_names[operand.shift.type]
print(" shift = %s #%i" % (sname, operand.shift.value))
elif operand.shift.type in (capstone.arm_const.ARM_SFT_LSL_REG,
capstone.arm_const.ARM_SFT_LSR_REG,
capstone.arm_const.ARM_SFT_ASR_REG,
capstone.arm_const.ARM_SFT_ROR_REG):
sname = shift_names[operand.shift.type]
reg = inv_reg_map[operand.shift.value]
print(" shift = %s R%s" % (sname, reg))
else:
print(" shift = type=%i value=%i" % (operand.shift.type, operand.shift.value))
def insn__repr__(self):
word = bytes(bytearray(reversed(list(self.bytes)))).encode('hex')
return "<{}(word=0x{}, {} operands)>".format(self.__class__.__name__, word, len(self.operands))
capstone.CsInsn.__repr__ = insn__repr__
def armop__repr__(self):
params = ['type={}'.format(optype_names.get(self.type, 'unknown'))]
if self.type == capstone.arm_const.ARM_OP_IMM:
params.append('imm={}'.format(self.imm))
elif self.type == capstone.arm_const.ARM_OP_REG:
params.append('reg={}'.format(inv_reg_map[self.reg]))
elif self.type == capstone.arm_const.ARM_OP_MEM:
params.append('basereg={}'.format(inv_reg_map.get(self.mem.base, 'unknown')))
params.append('indexreg={}'.format(inv_reg_map.get(self.mem.index, 'unknown')))
params.append('displacement={}'.format(self.mem.disp))
params.append('lshift={}'.format(self.mem.lshift))
if self.shift.type != capstone.arm_const.ARM_SFT_INVALID:
if self.shift.type in (capstone.arm_const.ARM_SFT_LSL,
capstone.arm_const.ARM_SFT_LSR,
capstone.arm_const.ARM_SFT_ASR,
capstone.arm_const.ARM_SFT_ROR):
sname = shift_names[self.shift.type]
params.append("shift={} #{}".format(sname, self.shift.value))
else:
params.append("shift=type{} #{}".format(self.shift.type, self.shift.value))
return "<{}({})>".format(self.__class__.__name__, ', '.join(params))
capstone.arm.ArmOp.__repr__ = armop__repr__
print("cs_version() = %r" % (cs_version(),))
one_example = False
if len(sys.argv) == 2:
try:
one_example = int(sys.argv[1])
except ValueError:
sys.exit("Syntax: %s <example-number>" % (sys.argv[0],))
examples = [
b'\x05\x00\x00\xef', # SWI 5
b'\x20\x00\x50\xe3', # CMP r0, #&20
b'\x40\x00\x9f\x05', # LDREQ r0,[pc,#64]
b'\x05\x00\x00\x2f', # SWI 5
b'\x08\x00\x00\xeb', # BL pc+8*4
b'\xba\x50\x8f\xb2', # ADDLT r5, pc, #186
b'\x6C\x43\x9f\xE5', # LDR r4, [pc, #&36c]
b'\x0b\xb0\x97\xe7', # LDR r11, [r7, r11]
b'\x04\x00\x5f\xe5', # LDRB r0, [pc, #4]
b'\x03\x00\x92\xe8', # LDMIA r2, {r0, r1}
b'\x03\x00\x92\xd8', # LDMLEIA r2, {r0, r1}
b'\x00\x18\xa0\xe1', # LSL r1, r0, #&10 => MOV r1, r0, LSL #16
b'\x21\x18\xa0\xe1', # LSR r1, r1, #&10 => MOV r1, r1, LSR #16
b'\x26\xc4\xb0\xe1', # LSRS r12, r6, #8 => MOVS r12, r6, LSR #8
b'\x12\x13\xa0\xe1', # LSL r1, r2, r3 => MOV r1, r2, LSL r3
b'\x52\x13\xa0\xe1', # ASR r1, r2, r3 => MOV r1, r2, ASR r3
b'\x62\x10\xa0\xe1', # RRX r1, r2 => MOV r1, r2, RRX
b'\x53\x30\xeb\xe7', # ?
b'\x01\x0f\x81\xe2', # ADD r0, r1, #1, #30 => ADD r0, r1, #2
b'\x1e\x10\x81\x11', # ORRNE r1, r1, r14, LSL r0
b'\x06\x10\xe0\xe3', # MVN r1,#&6
b'\x02\x10\x9f\xe7', # LDR r1,[pc,r2]
b'\x04\x10\x9d\xe5', # LDR r1,[r13, #4]
]
if one_example is False:
for code in examples:
show_disasm(code)
else:
code = examples[one_example]
show_disasm(code)
Cause of the change
In v4, the decoding was performed by the getRegisterName2 function for the CS_OPT_SYNTAX_NOREGRNAME in ARMGenAsmWriter.inc, which for register id 12 (see above that the base register has the value of 12) we get out the string sp:
https://github.com/capstone-engine/capstone/blob/v4/arch/ARM/ARMGenAsmWriter.inc#L8634C1-L8634C26
And in the v5 code, the decoding is performed by the getRegisterName_digit in ARMGenRegisterName_digit.inc, and again we use register id 12 (again the base register number is 12) which has a string r13.
https://github.com/capstone-engine/capstone/blob/v5/arch/ARM/ARMGenRegisterName_digit.inc#L77
Obviously these two files are automatically generated, and arguably the use of r13 when you're not using the register naming schemes is more accurate. However, except for APCS_U, register 13 has always been the stack pointer - I believe under APCS_U the stack pointer was in r12, and unless you're using RISCiX you're not going to care about APCS_U. In all other cases, I believe r13 has the convention of being the stack pointer - and if you're interworking with Thumb, it must be a stack pointer.
Expected behaviour
I expected the behaviour of the output to not change between versions, but it's not a strong expectation, as this is a major version update. It would have been nice if the change in register names had been included in the 5.0 change notes in https://github.com/capstone-engine/capstone/releases - just to be clear that it had updated.
What would be nice would be if it were possible to rename registers dynamically, but I suspect that's not going to be easy.
I intend to include a special case to rename r13 to sp when disassembling, to retain the old behaviour, if capstone 5 is detected, although I'm not convinced myself that this is a good idea in the long term - that's my problem, not yours.
I just wanted to highlight that there is a change in behaviour and that it was unexpected. It's not necessarily a bug unless you are guaranteeing the output format is unchanging between major releases.
This isn't so much a bug report as a 'there's a change in behaviour... did you know?' report.
The difference
I have an operating system which uses Capstone as its disassembly system (for reporting faults, etc). The output of the disassembly is used as expectations for the tests. This means that its test output (and, obviously, Capstone's output) must remain the same between runs to ensure that the expectations are met. They started failing once Capstone 5 was released, because the representation of registers has changed for ARM.
Specifically, I'm seeing that register 13 in ARM which was reported as
spis now being represented asr13(when CS_OPT_SYNTAX_NOREGNAME is in force)This isn't a problem for me per-se... although I would prefer to see
spas the name of the register, but we can acceptr13although it's not as nice. There isn't a way to rename registers from within the application, so I do not appear to be able to revert the behaviour to what it was before - I can do a search and replace, however that's a little more expensive.To be clear about the problem, here is the behaviour of disassembling the instruction
LDR r1, [sp, #4]with both capstone 4 and 5:Capstone 4
Capstone 5
Test program to generate the above output
This is my general disassembly tool for investigating the contents of the capstone output; it's a little wordy, but the important bit is the
md.syntax = CS_OPT_SYNTAX_NOREGNAMEand that the instruction being decoded isb'\x04\x10\x9d\xe5',(LDR r1,[r13, #4]).Cause of the change
In v4, the decoding was performed by the
getRegisterName2function for theCS_OPT_SYNTAX_NOREGRNAMEinARMGenAsmWriter.inc, which for register id 12 (see above that the base register has the value of 12) we get out the stringsp:https://github.com/capstone-engine/capstone/blob/v4/arch/ARM/ARMGenAsmWriter.inc#L8634C1-L8634C26
And in the v5 code, the decoding is performed by the
getRegisterName_digitinARMGenRegisterName_digit.inc, and again we use register id 12 (again the base register number is 12) which has a stringr13.https://github.com/capstone-engine/capstone/blob/v5/arch/ARM/ARMGenRegisterName_digit.inc#L77
Obviously these two files are automatically generated, and arguably the use of
r13when you're not using the register naming schemes is more accurate. However, except forAPCS_U, register 13 has always been the stack pointer - I believe under APCS_U the stack pointer was inr12, and unless you're using RISCiX you're not going to care about APCS_U. In all other cases, I believe r13 has the convention of being the stack pointer - and if you're interworking with Thumb, it must be a stack pointer.Expected behaviour
I expected the behaviour of the output to not change between versions, but it's not a strong expectation, as this is a major version update. It would have been nice if the change in register names had been included in the 5.0 change notes in https://github.com/capstone-engine/capstone/releases - just to be clear that it had updated.
What would be nice would be if it were possible to rename registers dynamically, but I suspect that's not going to be easy.
I intend to include a special case to rename
r13tospwhen disassembling, to retain the old behaviour, if capstone 5 is detected, although I'm not convinced myself that this is a good idea in the long term - that's my problem, not yours.I just wanted to highlight that there is a change in behaviour and that it was unexpected. It's not necessarily a bug unless you are guaranteeing the output format is unchanging between major releases.