Add cross-assembler and linker for FLUX bytecode#23
Add cross-assembler and linker for FLUX bytecode#23SuperInstance wants to merge 1 commit intomainfrom
Conversation
- Full cross-assembler with 100+ opcodes, label resolution (@Label and name:), two-pass assembly - Macro preprocessor: #define, #ifdef/#ifndef/#else/#endif, .set, .include - Multiple output formats: binary, hex, JSON, Intel HEX, Python list - Linker with object file serialization, symbol resolution, relocation table - BinaryPatcher for post-assembly binary patching - ELF-like header generation - Branch aliases: BEQ/JE, BNE/JNE, BLT/JL, BGE/JGE, BGT/JG, BLE/JLE - Arithmetic aliases: ADD, SUB, MUL, DIV, MOD, NEG, NOT, AND, OR, XOR, SHL, SHR - # comment support (non-preprocessor lines) - Data directives: .byte, .word, .dword, .ascii, .asciz, .fill, .align, .org - SIMD vector ops, A2A protocol ops, trust/capability ops, float ops - 96 tests covering all features
|
|
||
| # Build record: :LLAAAATT[DD...]CC | ||
| # LL = byte count, AAAA = address, TT = record type (00=data) | ||
| checksum = chunk_size + (addr >> 8) & 0xFF + addr & 0xFF + 0x00 |
There was a problem hiding this comment.
🔴 Intel HEX checksum calculation incorrect due to operator precedence
The checksum calculation at src/flux/asm/cross_assembler.py:582 uses chunk_size + (addr >> 8) & 0xFF + addr & 0xFF + 0x00. In Python, + has higher precedence than &, so this evaluates as (chunk_size + (addr >> 8)) & (0xFF + addr) & (0xFF) instead of the intended chunk_size + ((addr >> 8) & 0xFF) + (addr & 0xFF). Verified: for addr=16, chunk_size=2, the buggy expression yields 2 while the correct value is 18. This produces corrupt Intel HEX output for any data larger than 16 bytes.
| checksum = chunk_size + (addr >> 8) & 0xFF + addr & 0xFF + 0x00 | |
| checksum = chunk_size + ((addr >> 8) & 0xFF) + (addr & 0xFF) + 0x00 |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| shdr_offset: int, timestamp: float) -> bytes: | ||
| """Build the ELF64-like file header.""" | ||
| ident = bytearray(16) | ||
| ident[0:4] = FLUX_MAGIC |
There was a problem hiding this comment.
🔴 FLUX_MAGIC is 5 bytes but assigned to 4-byte slice, causing 65-byte header
FLUX_MAGIC = b"\x7fFLUX" is 5 bytes, but ident[0:4] = FLUX_MAGIC at src/flux/asm/elf_header.py:243 assigns it into a 4-byte slice. Python's bytearray slice assignment resizes the array, making ident 17 bytes instead of 16. This cascades: header[0:16] = ident makes header 65 bytes instead of 64. The rest of generate() assumes a 64-byte header for offset calculations (phdr_offset = header_size = 64), so all program headers and section data are misaligned by 1 byte in the output binary.
| ident[0:4] = FLUX_MAGIC | |
| ident[0:5] = FLUX_MAGIC |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| elif line.startswith(".set"): | ||
| self._handle_set(line, loc) | ||
| elif line.startswith(".include"): | ||
| self._handle_include(line, loc, filename) |
There was a problem hiding this comment.
🔴 .include directive discards preprocessed content — included files have no effect
_handle_include (line 231) returns the preprocessed included file content as a str, but _handle_directive (line 128) calls it without capturing the return value: self._handle_include(line, loc, filename). Since _handle_directive returns None and the caller in preprocess() does continue after calling it, the included file's assembly lines are completely lost. Only side effects on self.macros persist — no code from the included file is emitted.
Prompt for agents
The _handle_include method returns the preprocessed included file content as a string, but _handle_directive discards the return value at line 128. The output_lines list in the preprocess() method never receives the included content.
To fix this, the architecture needs rethinking. One approach: instead of returning the content, _handle_include should directly append to output_lines. But output_lines is local to preprocess(). Options:
1. Make output_lines an instance variable so _handle_include's recursive preprocess() call can contribute to it.
2. Have _handle_directive return the included content, and have preprocess() check and append it to output_lines.
3. Change _handle_include to not recursively call preprocess() but instead inline the include content into the current preprocess()'s line list.
The simplest fix might be option 2: have _handle_directive return Optional[str], and in preprocess(), capture the return and split/extend output_lines when non-None.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| if stripped.startswith("#") or stripped.startswith(".set ") or stripped.startswith(".include "): | ||
| self._handle_directive(stripped, loc, filename) | ||
| continue |
There was a problem hiding this comment.
🔴 #define, #undef, .set, .include processed inside inactive #ifdef blocks
In preprocess() (lines 85-87), all preprocessor directives are dispatched to _handle_directive without checking self.is_active. The is_active guard at line 90 only protects non-directive lines. This means #define, #undef, .set, and .include inside inactive #ifdef blocks are still executed. Verified: #ifdef UNDEFINED\n#define LEAKED 42\n#endif results in LEAKED being defined even though the conditional block is inactive.
Prompt for agents
In _handle_directive (macros.py:111-129), the conditional directives (#ifdef, #ifndef, #else, #endif) must always be processed to maintain the conditional stack, but #define, #undef, .set, and .include should only be processed when self.is_active is True.
The fix should add a guard in _handle_directive after the conditional directives are handled. For example, after the #else elif block and before the #undef elif block, add:
elif not self.is_active:
return # Skip non-conditional directives in inactive blocks
This ensures the conditional stack is always maintained correctly while preventing side effects from #define/#undef/.set/.include inside inactive blocks.
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| elif directive == ".ascii": | ||
| match = re.search(r'"([^"]*)"', line) | ||
| return len(match.group(1)) if match else 0 |
There was a problem hiding this comment.
🔴 .ascii/.asciz size estimate doesn't account for escape sequences, corrupting label addresses
_estimate_directive_size (line 322-324) returns the raw string length from the regex match for .ascii/.asciz directives, but _emit_directive processes escape sequences via _unescape_string, which converts two-character sequences like \t and \n into single bytes. This causes Pass 1 label addresses to be larger than the actual emitted byte count in Pass 2. Verified: .ascii "hello\tworld\n" estimates 14 bytes but emits 12, causing subsequent labels to have incorrect offsets (e.g., end label recorded at offset 14 but HALT actually emitted at offset 12).
| elif directive == ".ascii": | |
| match = re.search(r'"([^"]*)"', line) | |
| return len(match.group(1)) if match else 0 | |
| elif directive == ".ascii": | |
| match = re.search(r'"([^"]*)"', line) | |
| return len(self._unescape_string(match.group(1)).encode("utf-8")) if match else 0 |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| struct.pack_into("<H", header, 56, n_phdrs) | ||
| struct.pack_into("<H", header, 58, ELF64_SECTION_HEADER_SIZE) | ||
| struct.pack_into("<H", header, 60, n_sections) | ||
| struct.pack_into("<H", header, 62, 4) # shndx of .shstrtab (index 4 in our layout) |
There was a problem hiding this comment.
🟡 shstrtab section index hardcoded as 4 but is actually at index 3
At src/flux/asm/elf_header.py:265, the section header string table index (e_shstrndx) is hardcoded as 4, but all_sections at line 151 is [null(0), code(1), data(2), strtab(3), symtab(4), ...]. The .shstrtab section is at index 3, not 4. This causes ELF loaders/readers to look at the .symtab section instead of .shstrtab for section name resolution.
| struct.pack_into("<H", header, 62, 4) # shndx of .shstrtab (index 4 in our layout) | |
| struct.pack_into("<H", header, 62, 3) # shndx of .shstrtab (index 3 in our layout) |
Was this helpful? React with 👍 or 👎 to provide feedback.
Debug
| """Build a string table for section names.""" | ||
| table = bytearray(b'\x00') # Start with null byte | ||
| for name in names: | ||
| table.extend(name.encode("utf-8")) | ||
| table.append(0x00) | ||
| return bytes(table) |
There was a problem hiding this comment.
🟡 String table has extra leading null byte causing all section name indices to be off by 1
_build_string_table (line 312) starts with a \x00 byte, then iterates through names appending each name + null. Since names starts with "" (the null section), this produces two consecutive null bytes at the start. The _build_section_header name index calculation (line 290-295) iterates through all_names accumulating len(name) + 1 without accounting for the extra initial null byte. Verified: .flux.code is at actual table offset 2, but the computed name_idx is 1 (pointing to an empty string).
| """Build a string table for section names.""" | |
| table = bytearray(b'\x00') # Start with null byte | |
| for name in names: | |
| table.extend(name.encode("utf-8")) | |
| table.append(0x00) | |
| return bytes(table) | |
| def _build_string_table(self, names: list[str]) -> bytes: | |
| """Build a string table for section names.""" | |
| table = bytearray() | |
| for name in names: | |
| table.extend(name.encode("utf-8")) | |
| table.append(0x00) | |
| return bytes(table) |
Was this helpful? React with 👍 or 👎 to provide feedback.
Cross-assembler with 100+ opcodes, label resolution (@Label and name: syntax), macros (#define, #ifdef, .include), multiple output formats (binary, hex, JSON, Intel HEX, Python list), linker with symbol resolution and relocation, binary patcher, and ELF header generation.
Features
name:and@labelsyntax with forward references#define,#ifdef/#ifndef/#else/#endif,#undef,.set,.include;,//, and#(non-preprocessor) comment support.byte,.word,.dword,.ascii,.asciz,.fill,.align,.orgTests
96 tests covering all features — errors, macros, assembler, patcher, linker, ELF headers, and integration.