Skip to content

SuperInstance/flux-conformance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

FLUX Conformance Test Suite

Tests Passing Cross-Runtime Pass Opcodes Languages ISA v3 Version License Paper

108 of 113 cross-runtime tests pass for a formally specified, Turing-complete bytecode ISA across Python, Rust, C, Go, and TypeScript/WASM β€” with all 5 failures traced to a single specification ambiguity in the confidence subsystem. Seven of ten functional categories achieve a perfect 100% pass rate.

A comprehensive conformance test suite for the FLUX bytecode virtual machine. Verifies that all FLUX runtimes produce identical, deterministic results for the same bytecode programs β€” byte-for-byte, flag-for-flag.


Table of Contents


What is FLUX?

FLUX is a stack-based bytecode virtual machine designed for the SuperInstance agent network. Its ISA v2 specification defines 247 opcode slots across 7 encoding formats (A--G), of which 41 opcodes are currently implemented across 11 functional categories.

The 17-Opcode Turing Core

A formally verified subset of 17 opcodes forms an irreducible Turing-complete core:

# Opcode Category Role
1 HALT System Termination
2 NOP System Padding / no-op
3 RET Control Return from subroutine
4 PUSH Stack Push immediate value
5 POP Stack Discard top value
6 ADD Arithmetic Integer addition
7 SUB Arithmetic Integer subtraction
8 MUL Arithmetic Integer multiplication
9 DIV Arithmetic Integer division
10 LOAD Memory Load from address
11 STORE Memory Store to address
12 JZ Control Jump if zero flag
13 JNZ Control Jump if not zero
14 JMP Control Unconditional jump
15 CALL Control Subroutine call
16 INC Arithmetic Increment by 1
17 DEC Arithmetic Decrement by 1

Turing completeness is established by reduction to a Minsky machine.


Publications

"Cross-Runtime ISA Conformance: Formal Verification of a 17-Opcode Turing-Complete Instruction Set Across Four Programming Languages" β€” SuperInstance Fleet Datum Research Group

πŸ“„ Full preprint: PAPER.md

A formally specified, Turing-complete bytecode ISA achieves 95.6% cross-runtime conformance across five programming languages, with all failures traced to a single specification ambiguity in the confidence subsystem.


Conformance Results

Reference VM (Python) β€” 108/113 (95.6%)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    FLUX CONFORMANCE RESULTS                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                      β”‚
β”‚  Total Vectors:   113  (ISA v2)          Pass: 108  (95.6%)         β”‚
β”‚  Total Vectors:    62  (ISA v3)          Pass: ~42  (67.7%)         β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚ Category           β”‚ Opcodes β”‚ Pass % β”‚ Status   β”‚                 β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€                 β”‚
β”‚  β”‚ Integer Arithmetic β”‚    8    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Comparison         β”‚    6    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Logic / Bitwise    β”‚    6    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Memory             β”‚    4    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Stack Manipulation β”‚    4    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Float Operations   β”‚    4    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Agent-to-Agent     β”‚    3    β”‚ 100.0% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚                 β”‚
β”‚  β”‚ Control Flow       β”‚    7    β”‚  96.3% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ β”‚                 β”‚
β”‚  β”‚ System Control     β”‚    3    β”‚  95.8% β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ β”‚                 β”‚
β”‚  β”‚ Confidence         β”‚    3    β”‚  28.6% β”‚ β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ β”‚ ← Spec ambiguityβ”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚                                                                      β”‚
β”‚  5 FAILURES β€” All in confidence subsystem (CONF_GET/SET/MUL)         β”‚
β”‚  Root cause: float vs integer-scaled representation ambiguity         β”‚
β”‚                                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cross-Runtime Predictions (161 vectors)

Runtime Language Opcodes Pass Rate Status
Python Python 3.10+ 41/41 96.9% Golden reference
WASM TypeScript 27/41 ~59.0% In progress
Rust Rust 18/41 ~40.4% In progress
C C11 13/41 ~28.0% Planned
Go Go 1.21+ 8/41 ~18.6% Planned

Universal Opcodes (pass on ALL runtimes)

Only 8 opcodes are universally portable across all 5 runtimes:

{ HALT, NOP, ADD, SUB, EQ, JMP, PUSH, POP }

Architecture Overview

The conformance test suite follows a layered architecture where the reference VM serves as the golden standard, test cases define expected behavior, and a pluggable runner infrastructure enables testing against any FLUX runtime.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     FLUX CONFORMANCE SUITE                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Test Vector  │───>β”‚   Conformance Test   │───>β”‚   Test        β”‚  β”‚
β”‚  β”‚  JSON Files   β”‚    β”‚   Runner (Python)    β”‚    β”‚   Reports     β”‚  β”‚
β”‚  β”‚  (113 v2 +    β”‚    β”‚   run_conformance.py β”‚    β”‚   (JSON/MD/   β”‚  β”‚
β”‚  β”‚   62 v3)      β”‚    β”‚                      β”‚    β”‚    terminal)  β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                β”‚                                     β”‚
β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚                     β”‚  FluxVM (Reference)  β”‚                         β”‚
β”‚                     β”‚  conformance_core.py β”‚                         β”‚
β”‚                     β”‚  - Stack machine     β”‚                         β”‚
β”‚                     β”‚  - Flags register    β”‚                         β”‚
β”‚                     β”‚  - 64KB memory       β”‚                         β”‚
β”‚                     β”‚  - 41 base opcodes   β”‚                         β”‚
β”‚                     β”‚  - ISA v3 extensions β”‚                         β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚                                β”‚                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚ Python   β”‚  β”‚  Rust    β”‚  β”‚    C     β”‚  β”‚   Go     β”‚  ...      β”‚
β”‚  β”‚ Runtime  β”‚  β”‚ Runtime  β”‚  β”‚ Runtime  β”‚  β”‚ Runtime  β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Extension Points: SubprocessRuntime, FluxRuntime base class        β”‚
β”‚  Output Formats:   Terminal, JSON, Markdown                         β”‚
β”‚  CI Integration:   Non-zero exit on failure, JSON reports           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Module Layout

conformance_core.py    # Opcode defs, reference VM, test case library
  β”œβ”€ FluxVM            # Reference VM implementation (v2 base)
  β”œβ”€ FluxFlags         # Flags register (Z, S, C, O)
  β”œβ”€ ConformanceTestCase  # Single test case definition
  └─ ConformanceTestSuite # Test runner & reporter

test_conformance.py    # Pytest tests (parametrized + categorized) β€” 113 vectors
test_conformance_v3.py # ISA v3 extension tests β€” 62 vectors
conformance-vectors.json # Exported test vectors (v2 only, JSON format)
conformance-vectors-v3.json # ISA v3 test vectors (JSON format)
canonical_opcode_shim.py # Cross-runtime bytecode translation layer
flux_universal_validator.py # Bytecode structural validator
benchmark_flux.py      # Performance benchmarking harness (PERF-001)
run_conformance.py     # Unified cross-runtime conformance runner (CONF-001)
run_v3_conformance.py  # ISA v3-specific conformance runner
pyproject.toml         # Project configuration

Conformance Testing Pipeline

flowchart TD
    A[Test Vector JSON] --> B[ConformanceTestSuite]
    B --> C{Runtime Type?}
    C -->|Python In-Process| D[FluxVM Reference]
    C -->|External Runtime| E[SubprocessRuntime]
    E --> F[JSON over stdin/stdout]
    F --> G[Rust / C / Go / WASM]
    G --> H[JSON Response]
    D --> I[Stack + Flags Comparison]
    H --> I
    I --> J{Passed?}
    J -->|Yes| K[βœ… PASS]
    J -->|No| L[❌ FAIL β€” Error Report]
    K --> M[Terminal / JSON / Markdown]
    L --> M
Loading

Opcode Reference

Opcode Categories Covered

# Category Opcodes Test Count
1 System Control HALT, NOP, BREAK 5
2 Integer Arith ADD, SUB, MUL, DIV, MOD, NEG, INC, DEC 27
3 Comparison EQ, NE, LT, LE, GT, GE 12
4 Logic / Bit AND, OR, XOR, NOT, SHL, SHR 16
5 Memory LOAD, STORE, PEEK, POKE 6
6 Control Flow JMP, JZ, JNZ, CALL, RET, PUSH, POP 12
7 Stack Manip DUP, SWAP, OVER, ROT 6
8 Float Ops FADD, FSUB, FMUL, FDIV 8
9 Confidence CONF_GET, CONF_SET, CONF_MUL 7
10 Agent-to-Agent SIGNAL, BROADCAST, LISTEN 6
11 Complex / Mixed Fibonacci, factorial, absolute value, loops 8

Full Opcode Map (41 Implemented)

Opcode Hex Category Opcode Hex Category
HALT 0x00 Control AND 0x30 Logic
NOP 0x01 Control OR 0x31 Logic
BREAK 0x02 Control XOR 0x32 Logic
ADD 0x10 Arith NOT 0x33 Logic
SUB 0x11 Arith SHL 0x34 Logic
MUL 0x12 Arith SHR 0x35 Logic
DIV 0x13 Arith LOAD 0x40 Memory
MOD 0x14 Arith STORE 0x41 Memory
NEG 0x15 Arith PEEK 0x43 Memory
INC 0x16 Arith POKE 0x44 Memory
DEC 0x17 Arith JMP 0x50 Control
EQ 0x20 Compare JZ 0x51 Control
NE 0x21 Compare JNZ 0x52 Control
LT 0x22 Compare CALL 0x53 Control
LE 0x23 Compare RET 0x54 Control
GT 0x24 Compare PUSH 0x55 Stack
GE 0x25 Compare POP 0x56 Stack
DUP 0x60 Stack FADD 0x70 Float
SWAP 0x61 Stack FSUB 0x71 Float
OVER 0x62 Stack FMUL 0x72 Float
ROT 0x63 Stack FDIV 0x73 Float
CONF_GET 0x80 Conf SIGNAL 0x90 A2A
CONF_SET 0x81 Conf BROADCAST 0x91 A2A
CONF_MUL 0x82 Conf LISTEN 0x92 A2A

Quick Start

Get up and running in three commands:

git clone https://github.com/SuperInstance/flux-conformance.git
cd flux-conformance && pip install -e ".[dev]"
python run_conformance.py

Expected output:

FLUX Conformance Runner v2.0.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Runtime: Python (reference)
Vectors: 113

Results: 108 PASS / 5 FAIL (95.6%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
FAIL: conf_get_default, conf_set_clamp, conf_set_negative,
      conf_mul_clamp_high, conf_mul_clamp_low
Root cause: confidence subsystem spec ambiguity (CONF-002)

Prerequisites

  • Python 3.10 or later
  • pytest 7.0+ (for running the test suite)

Installation

# Clone the repository
git clone https://github.com/SuperInstance/flux-conformance.git
cd flux-conformance

# Install with pip (editable mode)
pip install -e ".[dev]"

Key Results at a Glance

95.6% pass rate on Python reference VM (108/113 vectors) 7 universally portable opcodes across 5 runtimes (Python, Rust, C, Go, TypeScript/WASM) 161 conformance vectors covering 41 opcodes across 11 functional categories 4-tier portability classification (P0–P3) quantifying cross-runtime agreement


Running Tests

# Run all v2 conformance tests (113 test vectors)
PYTHONPATH=. python -m pytest test_conformance.py -v

# Run ISA v3 extension tests (62 test vectors)
PYTHONPATH=. python -m pytest test_conformance_v3.py -v

# Run all tests with coverage report
PYTHONPATH=. python -m pytest test_conformance.py test_conformance_v3.py -v --cov=conformance_core

# Run using the unified runner script
python run_conformance.py                       # Python reference VM only
python run_conformance.py --all                 # All available runtimes
python run_conformance.py --json                # JSON output
python run_conformance.py --markdown            # Markdown report
python run_conformance.py --category arith      # Arithmetic tests only
python run_conformance.py --list                # List all test vector names
python run_conformance.py --export vectors.json # Export test vectors as JSON

Running Benchmarks

# Run the full performance benchmark suite
python benchmark_flux.py

# Specific category benchmark
python benchmark_flux.py --category arith

# JSON output for CI integration
python benchmark_flux.py --json --iterations 50000

# Markdown table output
python benchmark_flux.py --markdown --output benchmark-results.md

Design Philosophy

Conformance testing is the foundation of trust in a multi-runtime ecosystem. When the same bytecode program executes on eight different runtimes written in different languages β€” Python, C, Rust, Go, Zig, JavaScript, Java, and CUDA β€” every single runtime must produce byte-for-byte identical results. Any deviation, no matter how small, represents a potential consensus failure in the broader FLUX agent network.

This suite embodies three core design principles that guide every architectural decision:

Determinism above all. Every test vector specifies not just the expected stack output, but also the exact flags register state after execution. There is no room for "approximately correct" or "implementation-defined behavior." When a test says 5 + (-5) = 0 with flags Z|C, every runtime on every platform must produce precisely that result. This deterministic guarantee is what allows FLUX agents to coordinate across heterogeneous infrastructure without a central authority.

Reference VM as golden standard. The FluxVM class in conformance_core.py is not merely a test utility β€” it is the canonical definition of correct FLUX behavior. If a test passes against the reference VM and fails against your runtime, your runtime has a bug. The reference VM implements the complete ISA specification including edge cases like signed integer division truncation semantics, flag register update rules for carry and overflow, and the confidence register clamping behavior. This eliminates ambiguity in the specification by providing a working, auditable implementation.

Portable by construction. Test vectors are expressed as pure data β€” hex-encoded bytecode, initial stack state, expected stack output, and expected flags. This data-driven approach means any runtime in any language can consume the test vectors without Python dependencies. The JSON export format (conformance-vectors.json) is designed for cross-language tooling, and the SubprocessRuntime adapter pattern in run_conformance.py provides a universal interface for testing runtimes that communicate via standard input/output.


Test Vector Format

Test vectors are the fundamental unit of the conformance suite. Each vector defines a complete, self-contained test: a bytecode program, its inputs, and its expected outputs. Vectors are stored both as Python ConformanceTestCase dataclass instances (in conformance_core.py) and as portable JSON objects (in conformance-vectors.json).

JSON Schema

Every test vector in the JSON export conforms to the following schema:

{
  "name": "string",                // Unique identifier (naming convention: category_operation)
  "bytecode_hex": "string",        // Hex-encoded bytecode (e.g., "550300000055040000001000")
  "initial_stack": [number, ...],  // Stack values before execution (bottom to top)
  "expected_stack": [number, ...], // Expected stack values after execution
  "expected_flags": number,        // Expected flags register (-1 = skip check, or 0x00-0x0F)
  "allow_float_epsilon": boolean,  // Allow floating-point tolerance (1e-5)
  "description": "string"          // Human-readable description of the test
}

Flags Register Encoding

The flags register is a 4-bit value where each bit represents a CPU-style condition flag:

Bit Name Value Meaning
0 Z 0x01 Zero flag β€” set when the last arithmetic/logic result is zero
1 S 0x02 Sign flag β€” set when the last result is negative
2 C 0x04 Carry flag β€” set on unsigned overflow in addition
3 O 0x08 Overflow flag β€” set on signed overflow

The sentinel value -1 (FLAGS_ANY) means "do not check flags for this test case."

Example Vectors

Simple addition with flag checking:

{
  "name": "arith_add_positive",
  "bytecode_hex": "550300000055040000001000",
  "initial_stack": [],
  "expected_stack": [7],
  "expected_flags": -1,
  "allow_float_epsilon": false,
  "description": "3 + 4 = 7"
}

Bytecode breakdown: 55 03000000 (PUSH 3) 55 04000000 (PUSH 4) 10 (ADD) 00 (HALT)

Addition producing zero with flags:

{
  "name": "arith_add_zero",
  "bytecode_hex": "550500000055fbffffff1000",
  "initial_stack": [],
  "expected_stack": [0],
  "expected_flags": 5,
  "allow_float_epsilon": false,
  "description": "5 + (-5) = 0, Z and C set"
}

Flags 0x05 = Z(0x01) | C(0x04), meaning the result is zero and an unsigned carry occurred.

Using initial stack for compact tests:

{
  "name": "arith_add_stack",
  "bytecode_hex": "1000",
  "initial_stack": [5, 3],
  "expected_stack": [8],
  "expected_flags": -1,
  "allow_float_epsilon": false,
  "description": "ADD with initial stack [5, 3]"
}

Bytecode breakdown: 10 (ADD β€” pops 3 and 5, pushes 8) 00 (HALT).

Floating-point with epsilon tolerance:

{
  "name": "float_div",
  "bytecode_hex": "550700000055020000007300",
  "initial_stack": [],
  "expected_stack": [3.5],
  "expected_flags": -1,
  "allow_float_epsilon": true,
  "description": "FDIV: 7.0 / 2.0 = 3.5"
}

Opcode Coverage Matrix

The conformance suite covers all 41 implemented opcodes of the FLUX ISA v2 specification (out of 247 total opcode slots allocated across the encoding space). The following matrix shows which test categories exercise each opcode:

Opcode Hex Category System Arith Cmp Logic Mem Ctrl Stack Float Conf A2A
HALT 0x00 Control βœ“
NOP 0x01 Control βœ“
BREAK 0x02 Control βœ“
ADD 0x10 Arith βœ“ βœ“
SUB 0x11 Arith βœ“ βœ“
MUL 0x12 Arith βœ“ βœ“
DIV 0x13 Arith βœ“ βœ“
MOD 0x14 Arith βœ“ βœ“
NEG 0x15 Arith βœ“
INC 0x16 Arith βœ“ βœ“
DEC 0x17 Arith βœ“ βœ“
EQ 0x20 Compare βœ“
NE 0x21 Compare βœ“
LT 0x22 Compare βœ“
LE 0x23 Compare βœ“
GT 0x24 Compare βœ“
GE 0x25 Compare βœ“
AND 0x30 Logic βœ“
OR 0x31 Logic βœ“
XOR 0x32 Logic βœ“
NOT 0x33 Logic βœ“
SHL 0x34 Logic βœ“
SHR 0x35 Logic βœ“
LOAD 0x40 Memory βœ“ βœ“
STORE 0x41 Memory βœ“ βœ“
PEEK 0x43 Memory βœ“
POKE 0x44 Memory βœ“
JMP 0x50 Control βœ“
JZ 0x51 Control βœ“
JNZ 0x52 Control βœ“
CALL 0x53 Control βœ“
RET 0x54 Control βœ“
PUSH 0x55 Control βœ“
POP 0x56 Control βœ“
DUP 0x60 Stack βœ“
SWAP 0x61 Stack βœ“
OVER 0x62 Stack βœ“
ROT 0x63 Stack βœ“
FADD 0x70 Float βœ“
FSUB 0x71 Float βœ“
FMUL 0x72 Float βœ“
FDIV 0x73 Float βœ“
CONF_GET 0x80 Conf βœ“
CONF_SET 0x81 Conf βœ“
CONF_MUL 0x82 Conf βœ“
SIGNAL 0x90 A2A βœ“
BROADCAST 0x91 A2A βœ“
LISTEN 0x92 A2A βœ“

Cross-Runtime Results

The conformance suite has been designed to test any FLUX runtime implementation. The following table summarizes the current and predicted pass rates across target runtimes:

Runtime Language Status Pass Rate Notes
Python Reference Python 3.10+ βœ… Verified 113/113 (100%) Golden reference implementation
TypeScript/WASM TypeScript πŸ”§ In Progress ~66% (predicted) Integer arith, logic, control flow passing
Rust Rust πŸ”§ In Progress ~40% (predicted) Core arithmetic and stack operations
C C11 πŸ“‹ Planned ~27% (predicted) Basic arithmetic subset
Go Go 1.21+ πŸ“‹ Planned ~20% (predicted) Integer operations only
Zig Zig 0.11+ πŸ“‹ Planned TBD Awaiting runtime implementation
Java Java 17+ πŸ“‹ Planned TBD Awaiting runtime implementation
CUDA CUDA 12+ πŸ“‹ Planned TBD GPU-accelerated parallel execution

Portability Classification

Opcodes are classified into four portability tiers based on cross-runtime analysis:

  • P0 β€” Universal (7 opcodes): HALT, NOP, PUSH, POP, ADD, SUB, MUL β€” trivially implementable in any language with basic arithmetic.
  • P1 β€” High (12 opcodes): BREAK, NEG, INC, DEC, EQ, NE, LT, GT, DUP, SWAP, JMP, JZ β€” straightforward semantics, minor flag behavior variations.
  • P2 β€” Medium (10 opcodes): DIV, MOD, LE, GE, AND, OR, XOR, NOT, JNZ, CALL β€” signed division truncation and carry flag semantics require careful implementation.
  • P3 β€” Complex (8 opcodes): SHL, SHR, LOAD, STORE, PEEK, POKE, RET, ROT β€” memory addressing and shift semantics vary across platforms.

See CROSS-RUNTIME-RESULTS.md for detailed per-runtime failure analysis and the complete CONF-002 audit results.


Reference VM Implementation

The FluxVM class in conformance_core.py is the canonical FLUX virtual machine. It serves as the golden reference for all conformance testing β€” if a program produces different results on your runtime versus FluxVM, your runtime has a bug.

Stack Machine Architecture

FluxVM is a stack-based virtual machine with the following components:

  • Data Stack: An unbounded Python list used as a LIFO stack. Values are pushed and popped by instructions. There is no fixed stack size limit in the reference implementation, though runtimes may impose limits for safety.
  • Return Stack (call_stack): A separate stack used by CALL and RET for subroutine linkage. Each CALL pushes the current PC onto the call stack; each RET pops it back.
  • Flags Register (FluxFlags): A 4-bit register with individual Z, S, C, O flags accessible as boolean properties. Updated by arithmetic instructions (update_arith) and logic instructions (update_logic).
  • Memory: A 64KB bytearray addressed by 16-bit little-endian addresses. LOAD/STORE use embedded addresses; PEEK/POKE use stack-derived addresses. All memory operations use 32-bit little-endian signed integers.
  • Program Counter (PC): A byte-level instruction pointer that advances through the bytecode stream. Control flow instructions modify the PC directly.
  • Confidence Register: A floating-point value in [0.0, 1.0] representing the agent's confidence level. Modified by CONF_GET, CONF_SET, and CONF_MUL with clamping semantics.
  • Signal Channels: A dictionary mapping channel numbers (0-255) to FIFO queues of integer values, used by SIGNAL, BROADCAST, and LISTEN for agent-to-agent communication.

Execution Model

The VM operates in a simple fetch-decode-execute loop:

  1. Fetch the next opcode byte at PC
  2. Decode and dispatch to the appropriate handler
  3. Execute the instruction (modifying stack, memory, flags, or PC)
  4. Advance PC past any instruction operands
  5. Check max_steps safety limit

Execution terminates when HALT is reached, BREAK is executed, the PC advances past the end of the bytecode, or the step limit is exceeded.

Flag Update Rules

Arithmetic operations (ADD, SUB, MUL, DIV, MOD, NEG, INC, DEC) update all four flags:

  • Z (Zero): Set if the result is exactly zero.
  • S (Sign): Set if the result is negative.
  • C (Carry): For addition, set if the unsigned sum exceeds 32 bits. For subtraction, set if the unsigned minuend is less than the subtrahend (both non-negative).
  • O (Overflow): For addition, set if both operands have the same sign but the result has a different sign. For subtraction, currently always cleared in the reference implementation (a known spec ambiguity).

Logic operations (AND, OR, XOR, NOT, SHL, SHR) and comparison operations (EQ, NE, LT, LE, GT, GE) update only Z and S, clearing C and O.


ISA v3 Extension Testing

The ISA v3 specification extends the FLUX instruction set with three new primitive classes accessible through an escape prefix mechanism. The test file test_conformance_v3.py provides 62 additional test vectors covering these extensions.

Escape Prefix Encoding

v3 extensions use the 0xFF escape prefix followed by an extension ID (1 byte), a sub-opcode (1 byte), and variable-length payload:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 0xFF     β”‚ extension_id   β”‚ sub_opcodeβ”‚   payload    β”‚
β”‚ (escape) β”‚ (1 byte)       β”‚ (1 byte)  β”‚ (variable)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Extension Classes

EXT 0x01 β€” Temporal Primitives (6 sub-opcodes): FUEL_CHECK (remaining fuel), DEADLINE_BEFORE (conditional jump based on wall-clock time), YIELD_IF_CONTENTION (cooperative multitasking), PERSIST_CRITICAL_STATE (snapshot memory region), TIME_NOW (monotonic clock), SLEEP_UNTIL (advance simulated clock).

EXT 0x02 β€” Security Primitives (6 sub-opcodes): CAP_INVOKE (capability-based authorization), MEM_TAG (memory region tagging), SANDBOX_ENTER/SANDBOX_EXIT (memory access restriction), FUEL_SET (execution budget), IDENTITY_GET (agent identity handle).

EXT 0x03 β€” Async Primitives (6 sub-opcodes): SUSPEND (save execution state to continuation), RESUME (restore from continuation), FORK (create child context), JOIN (wait for context completion), CANCEL (abort context), AWAIT_CHANNEL (channel read with timeout).

v3 Reference Implementation

The FluxVMv3 class extends the base FluxVM with additional state for temporal tracking (simulated clock, fuel budget), security (capability sets, sandbox stack, memory tags), and async (continuation list, context map). All v3 extensions maintain backward compatibility β€” any valid v2 program runs identically on FluxVMv3.


Canonical Opcode Translation

The cross-runtime translation shim (canonical_opcode_shim.py) provides a compatibility layer that maps between different runtime-specific opcode encodings and the canonical FLUX ISA encoding. This is essential when runtimes use internal opcode numbering that differs from the published specification.

The translation shim:

  1. Reads bytecode from the test vector (canonical hex encoding)
  2. Applies a per-runtime opcode remapping table
  3. Adjusts operand sizes and endianness if needed
  4. Emits the runtime's native bytecode format

Supported translation paths:

From To Function
Python Canonical python_to_canonical()
Rust Canonical rust_to_canonical()
C (flux-os) Canonical cos_to_canonical()
Go (flux-swarm) Canonical go_to_canonical()
Python Rust python_to_rust()
Python Go python_to_go()

Each translation uses a 256-byte lookup table with alias resolution for naming conventions (e.g., Python's IADD β†’ Canonical's ADD, Rust's JumpIf β†’ Canonical's JZ).


Benchmarking

The benchmark_flux.py module (PERF-001) provides a comprehensive performance benchmarking harness for measuring FLUX VM throughput across multiple dimensions.

Methodology

Each benchmark runs a synthetic program that exercises a specific opcode category in a tight loop, with warmup iterations, multiple timed runs, and statistical aggregation. The harness measures:

  • Total operations: Number of VM instructions executed
  • Wall-clock time: Elapsed time in milliseconds (median of 5 runs)
  • Throughput: Operations per second
  • Per-instruction latency: Nanoseconds per operation

Benchmark Categories

Category Benchmarks What it measures
decode NOP loop, PUSH+HALT Raw instruction fetch/decode speed
arith ADD, MUL, DIV+MOD loops Integer arithmetic throughput
float FADD+FMUL loop Floating-point arithmetic throughput
logic AND/OR/XOR/SHL loop Bitwise operation throughput
comparison EQ/LT/GT loop Comparison operation throughput
memory STORE/LOAD, POKE/PEEK loops Memory access latency
stack DUP/SWAP/OVER/ROT loop Stack manipulation throughput
control CALL/RET, nested CALL, factorial Control flow overhead
confidence CONF_SET/GET/MUL loop Confidence register throughput
a2a SIGNAL/LISTEN loop Agent-to-agent messaging throughput
complex Fibonacci Mixed-operation programs
startup 1000x PUSH+HALT VM initialization cost

Usage

# Full benchmark suite (default 10,000 iterations)
python benchmark_flux.py

# Specific category
python benchmark_flux.py --category memory

# Custom iteration count
python benchmark_flux.py --iterations 50000

# Output formats
python benchmark_flux.py --json          # Machine-readable JSON
python benchmark_flux.py --markdown      # Markdown tables
python benchmark_flux.py --json --output results.json

Paper

An academic paper draft describing the formal verification methodology and results is available at PAPER.md:

"Cross-Runtime ISA Conformance: Formal Verification of a 17-Opcode Turing-Complete Instruction Set Across Four Programming Languages"

The paper covers the formal semantics, conformance framework architecture, empirical results (108/113 pass rate), portability classification (P0--P3), and connections to theoretical bounds. Target venues include PLDI, ICST, ASE, and VEE.


Known Issues

Confidence Opcode Spec Ambiguity (CONF-002)

The overflow flag behavior for CONF_MUL and CONF_SET operations has a specification ambiguity. The ISA specification does not clearly define whether confidence operations should update the overflow flag (O) in the same way as arithmetic operations. The reference VM currently does not update flags for confidence operations, but at least one alternative interpretation exists that would set the overflow flag when confidence is clamped from a value > 1.0 or < 0.0.

Impact: 5 test cases in the CONF-002 cross-runtime audit may produce different flag values depending on interpretation. The stack values are always correct regardless of flag interpretation.

Resolution path: A formal ISA clarification is pending. Once resolved, the affected test vectors will be updated with explicit expected_flags values.

Integer Division Truncation

The reference VM uses Python's int(a / b) for division, which performs true division followed by truncation toward zero. This differs from Python's native // floor division operator for negative numbers: int(-7 / 2) = -3 (truncation) vs -7 // 2 = -4 (floor). Runtimes using floor division semantics will fail arith_div_neg. The test vector explicitly documents this as "truncation toward zero" behavior.

Unlimited Stack in Reference VM

The reference VM uses an unbounded Python list for the data stack. Production runtimes may impose stack size limits. No test vector currently tests stack overflow behavior, as this is considered a runtime-specific safety constraint rather than an ISA correctness concern.


Contributing

We welcome contributions to the FLUX conformance test suite! Please see CONTRIBUTING.md for detailed guidelines on:

  • Writing new test vectors
  • Testing a new FLUX runtime
  • Code style and PR process
  • Test vector review criteria

Quick Guide for Contributors

  1. Start with P0 opcodes (HALT, NOP, PUSH, POP, ADD, SUB, MUL)
  2. Run category by category using --category arith, --category ctrl, etc.
  3. Fix failures incrementally β€” each failure is a concrete bug in your runtime
  4. Check flags carefully β€” many failures are due to incorrect flag updates
  5. Verify float semantics β€” ensure truncation (not floor) for negative division
  6. Run the full suite once all categories pass individually

License

MIT


FLUX Conformance β€” One test suite to rule them all. Any runtime that passes is certified FLUX-compatible.


callsign

About

FLUX Ecosystem - flux-conformance

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages