diff --git a/Cargo.lock b/Cargo.lock
index 7b6ed3c62..001b4b841 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1955,6 +1955,7 @@ dependencies = [
  "rayon",
  "serde",
  "stark",
+ "tiny-keccak",
 ]
 
 [[package]]
diff --git a/README.md b/README.md
index df751528d..60916e24d 100644
--- a/README.md
+++ b/README.md
@@ -11,6 +11,9 @@ Right now, this is a project under development and experimentation and must not
 ### Dependencies
 
 - Rust nightly with `rust-src` component
+- Clang with RISC-V target support and LLD linker (used by `make compile-programs-asm`)
+  - **macOS**: `brew install llvm` (the Homebrew LLVM includes `clang` and `lld` with RISC-V support)
+  - **Linux**: `apt install clang lld` (or equivalent for your distribution)
 
 ### Dev dependencies
 
@@ -26,11 +29,10 @@ Install Rust using [rustup](https://rustup.rs/):
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ```
 
-Then install the nightly toolchain with the `rust-src` component (required for building `std` for the custom RISC-V target):
+Add the `rust-src` component to the pinned nightly toolchain used to build guest programs (required for building `std` for the custom RISC-V target — `make compile-programs-rust` will auto-fetch the toolchain itself):
 
 ```sh
-rustup toolchain install nightly
-rustup component add rust-src --toolchain nightly
+rustup component add rust-src --toolchain nightly-2026-02-01
 ```
 
 #### Compile sysroot
@@ -44,7 +46,7 @@ wget https://lambda.alignedlayer.com/lambda-vm-sysroot-rv64im.tar.gz
 sudo mkdir -p /opt && sudo tar -xzf lambda-vm-sysroot-rv64im.tar.gz -C /opt
 ```
 
-##### Compile it directly             
+##### Compile them directly
     
 ```sh                                                
    sudo apt-get install -y autoconf automake autotools-dev curl python3 \                                           
@@ -65,20 +67,62 @@ sudo mkdir -p /opt && sudo tar -xzf lambda-vm-sysroot-rv64im.tar.gz -C /opt
   cp -r /opt/riscv64-newlib/riscv64-unknown-elf/lib /opt/lambda-vm-sysroot/                
 ```
 
-#### Install the dependencies
+Then, you can check that the executor works by running:
 
 ```sh
-make deps
+make test-executor
 ```
 
-**Note:** At the moment, `make deps` only works on macOS.
+### Using the CLI
 
-Then, you can check that the executor works by running:
+The `cli` binary lets you execute, prove, and verify RISC-V ELF programs. Build it once with:
 
 ```sh
-make test-executor
+cargo build --release -p cli
 ```
 
+The binary will be available at `target/release/cli`.
+
+To get a sample program to work with, compile the bundled assembly tests:
+
+```sh
+make compile-programs-asm
+```
+
+This emits ELF files under `executor/program_artifacts/asm/`. With those in place, you can run the three core commands:
+
+#### Execute
+
+Run a program without generating a proof. Useful for sanity checks and debugging:
+
+```sh
+cargo run -p cli --release -- execute executor/program_artifacts/asm/add.elf
+```
+
+#### Prove
+
+Generate a STARK proof of the execution:
+
+```sh
+cargo run -p cli --release -- prove executor/program_artifacts/asm/add.elf -o /tmp/proof.bin
+```
+
+#### Verify
+
+Verify a proof against the ELF it was generated from. The command exits `0` on success and `1` on failure:
+
+```sh
+cargo run -p cli --release -- verify /tmp/proof.bin executor/program_artifacts/asm/add.elf
+```
+
+For the full CLI reference — including private inputs, blowup factor tuning, timing, and flamegraph profiling — see [`bin/cli/README.md`](./bin/cli/README.md).
+
+### Writing a guest program
+
+Guest programs are written in Rust (or RISC-V assembly) and cross-compiled to the custom RV64IM target. The guest SDK [`lambda-vm-syscalls`](./syscalls/README.md) provides the syscalls a program uses to read private input, commit public output, halt, and call precompiles like Keccak. The [`executor`](./executor/README.md) crate is what loads your compiled ELF and emits the per-instruction logs the prover consumes.
+
+To add a new Rust guest, drop a project under `executor/programs/rust/<name>/` and run `make compile-programs-rust`. See [`executor/programs/rust/`](./executor/programs/rust/) for examples (`fibonacci`, `keccak`, `hashmap`, …).
+
 ## Design choices
 
 - The Instruction Set Architecture is RISCV64IM
@@ -98,7 +142,18 @@ Following [ethrex](https://github.com/lambdaclass/ethrex):
 
 ## Documentation
 
-Full documentation can be found in [docs](./docs/). It is currently a work in progress, we expect that as more features and components become ready, they will be included in the docs.
+High-level documentation lives in [`docs/`](./docs/):
+
+- [Overview of VM flow](./docs/general_flow.md) — the pipeline from source code to proof
+- [Proof system overview](./docs/cryptography/proof_system.md) — design goals and primitives
+- [Lookup arguments](./docs/cryptography/lookup.md) — how tables are linked via LogUp
+- [Recommended reading](./docs/other_resources.md) — papers and tutorials
+
+### Specification
+
+A formal specification of the VM is written in [Typst](https://typst.app/) under [`spec/`](./spec/) and rendered as a browsable wiki (HTML) or PDF using [`shiroa`](https://myriad-dreamin.github.io/shiroa/). With both tools installed, run `shiroa serve` from `spec/` to host the wiki locally.
+
+See [`spec/README.md`](./spec/README.md) for full setup instructions.
 
 ## Testing
 
@@ -114,6 +169,7 @@ Full documentation can be found in [docs](./docs/). It is currently a work in pr
 | `make test-asm` | Compile and run ASM tests |
 | `make test-rust` | Compile and run Rust tests |
 | `make test-executor` | Compile all programs and run executor tests |
+| `make test-math-cuda` | math-cuda parity tests (requires NVIDIA GPU + nvcc) |
 | `make build` | Build all workspace crates |
 | `make check` | Check all crates (faster than build, no codegen) |
 | `make clippy` | Run clippy on all crates |
@@ -128,8 +184,8 @@ To run all tests across the project use
 
 ### ASM Tests
 
-In order to add a new asm test you should add the `.s` file under `programs/asm`
-Then add the corresponding test under `tests/asm.rs`
+In order to add a new asm test you should add the `.s` file under `executor/programs/asm`
+Then add the corresponding test under `executor/tests/asm.rs`
 
 To run them you can use
 
@@ -139,9 +195,9 @@ This will compile them and run the tests
 
 ### Rust Tests
 
-In order to add a new rust test you should add the cargo project under `programs/rust` as a new directory.
+In order to add a new rust test you should add the cargo project under `executor/programs/rust` as a new directory.
 The folder should have the same name as the `Cargo.toml` program name.
-Then add the corresponding test under `tests/rust.rs`
+Then add the corresponding test under `executor/tests/rust.rs`
 
 You can run it with
 
@@ -151,8 +207,20 @@ You can run it with
 
 You can create a flamegraph for proof generation using the following target:
 
+```sh
+make flamegraph-prover
+```
+
+This profiles the synthetic `fibonacci_multi_column` STARK example in `crypto/stark` (i.e. the STARK engine itself, not a real guest ELF). To profile the VM prover end-to-end on a real ELF, use the dedicated bench in the `prover` crate:
+
+```sh
+samply record cargo bench --bench profile_vm_prover --features parallel
 ```
-  make flamegraph-prover
+
+For a quick GPU microbench (requires an NVIDIA GPU + `nvcc`):
+
+```sh
+make bench-math-cuda
 ```
 
 ## Debug Checks
diff --git a/bin/cli/README.md b/bin/cli/README.md
index 9ce2c2674..da82620c0 100644
--- a/bin/cli/README.md
+++ b/bin/cli/README.md
@@ -4,46 +4,88 @@ A command-line interface for executing, proving, and verifying RISC-V ELF progra
 
 ## Installation
 
-```bash
-cargo build -p cli --release
+```sh
+cargo build --release -p cli
 ```
 
 The binary will be available at `target/release/cli`.
 
+## Producing an ELF
+
+The CLI consumes RISC-V ELF binaries. The repo ships ready-to-use guest programs that you can compile with:
+
+```sh
+# RISC-V assembly tests → executor/program_artifacts/asm/*.elf
+make compile-programs-asm
+
+# Rust guest programs → executor/program_artifacts/rust/*.elf  (needs the sysroot + nightly toolchain)
+make compile-programs-rust
+
+# Benchmark programs → executor/program_artifacts/bench/*.elf  (needs the sysroot + nightly toolchain)
+make compile-bench
+```
+
+See the root [`README.md`](../../README.md) for the toolchain setup.
+
 ## Commands
 
 ### Execute
 
 Run a RISC-V ELF program without generating a proof. Useful for testing and debugging.
 
-```bash
-cargo run -p cli --release -- execute <PROGRAM.elf>
+```sh
+cargo run -p cli --release -- execute <PROGRAM.elf> [--private-input <FILE>] [--flamegraph <FILE>]
 ```
 
-See [Guest Program Flamegraphs](#guest-program-flamegraphs) for profiling execution.
+| Flag | Description |
+|---|---|
+| `--private-input <FILE>` | Pass private input bytes to the guest (read via `get_private_input()`). |
+| `--flamegraph <FILE>` | Generate folded-stack flamegraph output. See [Guest Program Flamegraphs](#guest-program-flamegraphs). |
 
 ### Prove
 
 Generate a STARK proof for a RISC-V ELF program execution.
 
-```bash
-cargo run -p cli --release -- prove <PROGRAM.elf> -o proof.bin
+```sh
+cargo run -p cli --release -- prove <PROGRAM.elf> -o proof.bin [flags]
 ```
 
+| Flag | Description |
+|---|---|
+| `-o, --output <FILE>` | Output path for the serialized proof bundle. Required. |
+| `--private-input <FILE>` | Pass private input bytes to the guest. |
+| `--blowup <N>` | FRI blowup factor (power of 2). Higher = fewer queries, smaller proof, slower proving. [default: 2] |
+| `--time` | Print total proving time. |
+| `--cycles` | Run one extra pre-pass outside the timer and print the dynamic instruction count. |
+| `--elements` | Build traces and print main-trace and aux-trace field element counts. |
+
 ### Verify
 
-Verify a proof generated by the `prove` command.
+Verify a proof generated by `prove`.
 
-```bash
-cargo run -p cli --release -- verify proof.bin <PROGRAM.elf>
+```sh
+cargo run -p cli --release -- verify <PROOF> <PROGRAM.elf> [flags]
 ```
 
-Returns exit code 0 on successful verification, 1 on failure.
+| Flag | Description |
+|---|---|
+| `--blowup <N>` | FRI blowup factor used during proving. Must match. [default: 2] |
+| `--time` | Print verification time. |
+
+Returns exit code `0` on successful verification, `1` on failure.
+
+### Count Elements
+
+Build traces and print main-trace and aux-trace field element counts **without** running the proof step. Useful for sizing.
+
+```sh
+cargo run -p cli --release -- count-elements <PROGRAM.elf> [--private-input <FILE>]
+```
 
 ## Examples
 
-```bash
-# Compile test programs (if not already done)
+```sh
+# Compile the bundled assembly tests
 make compile-programs-asm
 
 # Execute a simple program
@@ -52,6 +94,9 @@ cargo run -p cli --release -- execute executor/program_artifacts/asm/add.elf
 # Generate and verify a proof
 cargo run -p cli --release -- prove executor/program_artifacts/asm/add.elf -o /tmp/proof.bin
 cargo run -p cli --release -- verify /tmp/proof.bin executor/program_artifacts/asm/add.elf
+
+# Prove with private input and print metrics
+cargo run -p cli --release -- prove program.elf -o /tmp/proof.bin --private-input input.bin --time --cycles
 ```
 
 ## Guest Program Flamegraphs
@@ -60,7 +105,7 @@ Generate flamegraphs showing where the guest RISC-V program spends its execution
 
 ### Generate Folded Stacks
 
-```bash
+```sh
 cargo run -p cli --release -- execute <PROGRAM.elf> --flamegraph folded.txt
 ```
 
@@ -68,7 +113,7 @@ cargo run -p cli --release -- execute <PROGRAM.elf> --flamegraph folded.txt
 
 Requires [inferno](https://github.com/jonhoo/inferno) or [flamegraph.pl](https://github.com/brendangregg/FlameGraph):
 
-```bash
+```sh
 # Install inferno (one-time)
 cargo install inferno
 
@@ -78,7 +123,7 @@ cat folded.txt | inferno-flamegraph > flamegraph.svg
 
 ### Example
 
-```bash
+```sh
 # Generate flamegraph for quicksort benchmark
 cargo run -p cli --release -- execute executor/program_artifacts/bench/quicksort.elf --flamegraph /tmp/quicksort.txt
 cat /tmp/quicksort.txt | inferno-flamegraph --title "quicksort" > quicksort_flamegraph.svg
diff --git a/bin/cli/src/main.rs b/bin/cli/src/main.rs
index aa633ef9a..f166e751d 100644
--- a/bin/cli/src/main.rs
+++ b/bin/cli/src/main.rs
@@ -125,7 +125,7 @@ enum Commands {
         private_input: Option<PathBuf>,
 
         /// Blowup factor (power of 2). Higher = fewer queries, smaller proof, slower proving.
-        #[arg(long)]
+        #[arg(long, default_value = "2")]
         blowup: Option<u8>,
 
         /// Print proving time
@@ -153,10 +153,10 @@ enum Commands {
         elf: PathBuf,
 
         /// Blowup factor used during proving (must match)
-        #[arg(long)]
+        #[arg(long, default_value = "2")]
         blowup: Option<u8>,
 
-        /// Print timing breakdown
+        /// Print verification time
         #[arg(long)]
         time: bool,
     },
diff --git a/docs/cryptography/lookup.md b/docs/cryptography/lookup.md
index 99c2bf803..efde8ecdc 100644
--- a/docs/cryptography/lookup.md
+++ b/docs/cryptography/lookup.md
@@ -1,111 +1,141 @@
-# Lookup Arguments
+# Lookup arguments
 
-Lookup arguments are a cryptographic technique that allows a prover to demonstrate that values in one table appear in another table, without revealing the actual values. They are essential for building efficient virtual machines where different components (CPU, memory, ALU) need to verify consistency with each other.
+Lambda VM uses **LogUp** lookup arguments to connect its trace tables. Each table generates and consumes "tokens" on one or more **named buses**; the system is sound when, across every bus, the total sender contribution equals the total receiver contribution.
 
-## Why Lookup Arguments?
+Lookups are how the prover proves cross-table relations without duplicating constraints. For example, the CPU table proves it dispatched a bitwise AND instruction by sending a token `(AndByte, x, y, x & y)` on the `AndByte` bus; the BITWISE table proves it has a row matching that token by sending a receiver token on the same bus. If both sides match, the bus balances. If a sender has no matching receiver (or vice versa), the bus does not balance and verification fails.
 
-In a virtual machine proof, different execution tables need to communicate:
+The implementation lives in [`crypto/stark/src/lookup.rs`](../../crypto/stark/src/lookup.rs).
 
-- The **CPU table** performs operations and accesses memory
-- The **Memory table** stores and retrieves values
-- The **ALU table** computes arithmetic operations
+## The `BusInteraction` struct
 
-Without lookups, verifying that "the CPU read value X from address Y" would require expensive polynomial constraints. Lookup arguments provide an efficient way to prove these cross-table relationships.
+A single lookup contribution is a `BusInteraction`:
 
-## The LogUp Protocol
+```rust
+pub struct BusInteraction {
+    pub bus_id: u64,
+    pub multiplicity: Multiplicity,
+    pub values: Vec<BusValue>,
+    pub is_sender: bool,
+}
+```
 
-We use the **LogUp** (Logarithmic Derivative Lookup) protocol, which is based on a key mathematical insight: two multisets are equal if and only if their logarithmic derivatives are equal.
+| Field | Role |
+|---|---|
+| `bus_id` | Names the bus. Senders and receivers must use the same `bus_id` for their tokens to match. Different buses use different IDs so that, e.g., an `And` token doesn't accidentally cancel an `Xor` token. |
+| `multiplicity` | How many times this row contributes (see below). |
+| `values` | The token payload — the data being looked up. |
+| `is_sender` | Carries the sign. Senders add to the bus sum, receivers subtract. The balance check is `Σ sender − Σ receiver = 0`. |
 
-### Fingerprints
+Build them with `BusInteraction::sender(bus_id, mul, values)` or `::receiver(bus_id, mul, values)`.
 
-Each row in a table is compressed into a single field element called a **fingerprint**:
+## Named buses (`BusId`)
 
-```
-fingerprint = 1 / (z - (v₀ + v₁·α + v₂·α² + ...))
-```
+Bus IDs are declared in [`prover/src/tables/types.rs`](../../prover/src/tables/types.rs) as a `#[repr(u64)]` enum:
 
-Where:
-- `z` and `α` are random challenges sampled via Fiat-Shamir
-- `v₀, v₁, v₂, ...` are the column values in that row
+```rust
+#[repr(u64)]
+pub enum BusId {
+    IsByte = 0,
+    IsHalfword,
+    IsB20,
+    AndByte,
+    OrByte,
+    XorByte,
+    Msb8,
+    Msb16,
+    Zero,
+    Hwsl,
+    Lt,
+    Mul,
+    Dvrm,
+    Shift,
+    Memw,
+    Load,
+    Memory,
+    // ...
+}
+```
 
-The linear combination `v₀ + v₁·α + v₂·α² + ...` compresses multiple columns into one value, and `z` shifts it to enable the logarithmic derivative form.
+Each value is a `u64` discriminant, auto-incremented from 0. `BusInteraction::new` takes `impl Into<u64>` so you pass `BusId::AndByte` directly.
 
-### Running Sum
+## Multiplicity
 
-For each table interaction, we build an auxiliary column that accumulates fingerprints:
+How many copies of the token a row contributes. Most rows are `One`, but tables that deduplicate or have flag-gated participation use richer forms:
 
-```
-s[i+1] = s[i] + multiplicity[i] / (z - linear_combination[i])
+```rust
+pub enum Multiplicity {
+    One,                              // 1
+    Column(usize),                    // col[i]
+    Sum(usize, usize),                // col[a] + col[b]
+    Negated(usize),                   // 1 - col[i]    (col must be a bit)
+    Diff(usize, usize),               // col[a] - col[b]
+    Sum3(usize, usize, usize),        // col[a] + col[b] + col[c]
+    Linear(Vec<LinearTerm>),          // arbitrary signed combination
+}
 ```
 
-Where `multiplicity` indicates how many times this row participates in the lookup:
-- **Positive** for rows being "looked up" (proving side)
-- **Negative** for rows doing the "looking" (assuming side)
+`Linear` is the escape hatch — it supports signed coefficients and large unsigned coefficients (e.g. `2^{-32} mod p`), and is how interactions like `μ − read2 − read4 − read8` are expressed.
 
-### Bus Balancing
+## Bus values (the token payload)
 
-The key property: if all lookups are valid, the sum of all fingerprints across all tables equals zero. This is because every "send" (negative multiplicity) has a matching "receive" (positive multiplicity).
+Each entry in `BusInteraction.values` is a `BusValue`:
 
-```
-Σ (sends) + Σ (receives) = 0
+```rust
+pub enum BusValue {
+    Packed { start_column: usize, packing: Packing },
+    Linear(Vec<LinearTerm>),
+}
 ```
 
-This is verified by checking that the final values of all running-sum columns sum to zero.
+- `Packed` reads consecutive trace columns and combines them via a `Packing` formula (powers of 2). For example, `Packing::Word2L` at `start_column = 4` reads columns 4 and 5 and computes `c₄ + 2¹⁶·c₅`, producing one bus element representing a 32-bit word.
+- `Linear` is an arbitrary signed linear combination over columns and constants — used when the value is a flag, a constant tag, or a derived expression that doesn't fit a `Packing`.
 
-## Multi-Table Challenge Sharing
+The `Packing` enum supports primitive shapes (`Direct`, `Word2L`, `Word4L`) and compound shapes (`DWordHL`, `DWordBL`, `QuadHL`, …) that produce multiple bus elements. A 64-bit double-word stored as 4 half-words is one `BusValue::Packed { packing: DWordHL, .. }` that yields two bus elements.
 
-For the bus to balance correctly, **all tables must use the same random challenges** `(z, α)`. This is critical for security and correctness.
+## Two-stage value combination
 
-### Protocol Flow
+A token's contribution to the bus is computed in **two stages**:
 
-1. **Commit all main traces**: Each table commits its main execution trace to the transcript
-2. **Sample shared challenges**: After ALL main traces are committed, sample `z` and `α` once
-3. **Build auxiliary traces**: Each table builds its running-sum columns using the shared challenges
-4. **Commit auxiliary traces**: Each table commits its auxiliary trace
-5. **Continue STARK protocol**: Proceed with composition polynomial, FRI, etc.
+1. **Limb packing.** Within each `BusValue`, columns are combined using powers of 2 according to the chosen `Packing`. This is how multi-limb values are formed from their column-level representation (e.g. assembling a 32-bit word from four 8-bit byte columns).
 
-### Why Share Challenges?
+2. **Bus fingerprint.** All bus elements from the interaction — starting with the `bus_id`, then the elements produced by each `BusValue` — are folded together using powers of a single challenge α:
 
-If tables used different challenges:
-- Table A computes fingerprints with `(z₁, α₁)`
-- Table B computes fingerprints with `(z₂, α₂)`
-- The fingerprints don't match even for identical values
-- The bus cannot balance, and valid proofs become impossible
+   ```
+   fingerprint = z − (bus_id + α·v₁ + α²·v₂ + … + α^(k−1)·v_{k−1})
+   ```
 
-By sharing challenges, fingerprints for the same values are identical across tables, enabling the bus to balance.
+   The interaction's contribution at this row is `± multiplicity / fingerprint`, with the sign coming from `is_sender`.
 
-## Implementation
+The `bus_id` is the first bus element. This is what makes tokens on different buses non-interfering: two interactions on `BusId::Mul` and `BusId::Lt` have different fingerprints even when all the data values match.
 
-### Challenge Constants
+## Challenges
 
-```rust
-// Index of the `z` challenge - evaluation point for fingerprints
-pub const LOGUP_CHALLENGE_Z: usize = 0;
+LogUp uses two challenges sampled from the transcript after the main trace is committed:
 
-// Index of the `α` challenge - base for linear combination
-pub const LOGUP_CHALLENGE_ALPHA: usize = 1;
+- `z` — read as `challenges[0]`; the subtractor in the fingerprint denominator.
+- `α` — read as `challenges[LOGUP_CHALLENGE_ALPHA]` where `LOGUP_CHALLENGE_ALPHA = 1`; the base for the powers-of-α combination of bus elements.
 
-// Total number of LogUp challenges
-pub const LOGUP_NUM_CHALLENGES: usize = 2;
-```
+The total challenge count is `LOGUP_NUM_CHALLENGES = 2`. There is no separate `LOGUP_CHALLENGE_Z` constant — `z` is just the first challenge by convention.
 
-### Table Interactions
+## Bus balance
 
-Each AIR defines its lookup interactions via `TableInteraction`:
+For every bus to be sound, across all tables and all rows:
 
-```rust
-pub struct TableInteraction {
-    pub flag_columns: Vec<usize>,   // Columns indicating participation (multiplicity)
-    pub value_columns: Vec<usize>,  // Columns containing the looked-up values
-}
+```
+Σ (over senders)    multiplicity / fingerprint
+−
+Σ (over receivers)  multiplicity / fingerprint
+= 0
 ```
 
-### Auxiliary Trace
+In code this becomes: every table contributes a per-bus running sum (the "table contribution") to its auxiliary trace; the verifier checks that the sum of table contributions equals the expected bus balance. For most buses the expected balance is zero. The `Commit` bus is an exception: its expected balance is recomputed by the verifier from the public output bytes (see `compute_commit_bus_offset` in [`prover/src/lib.rs`](../../prover/src/lib.rs)) so that tampering with the proof's public output is caught.
 
-The auxiliary trace contains one running-sum column per interaction, plus optionally a grand-sum column that aggregates all interactions.
+When `--features debug-checks` is on, [`crypto/stark/src/bus_debug.rs`](../../crypto/stark/src/bus_debug.rs) prints per-bus sender vs. receiver sums to help diagnose imbalances during development.
 
-## Security Considerations
+## Implementation pointers
 
-1. **Challenge derivation**: Challenges must be sampled via Fiat-Shamir after all main traces are committed to prevent manipulation
-2. **Shared challenges**: All tables in a multi-table proof MUST use identical challenges
-3. **Field size**: The field must be large enough that random challenges don't accidentally cause fingerprint collisions
+- Interaction shape, packing, balance: [`crypto/stark/src/lookup.rs`](../../crypto/stark/src/lookup.rs)
+- Named bus IDs used by the VM: [`prover/src/tables/types.rs`](../../prover/src/tables/types.rs)
+- Per-table interactions: [`prover/src/tables/`](../../prover/src/tables/) (one file per table)
+- Verifier-side bus offset for the COMMIT bus: [`prover/src/lib.rs`](../../prover/src/lib.rs) (`compute_commit_bus_offset`)
+- Debug-checks bus diagnostics: [`crypto/stark/src/bus_debug.rs`](../../crypto/stark/src/bus_debug.rs)
diff --git a/docs/cryptography/proof_system.md b/docs/cryptography/proof_system.md
index 8683c1747..89b427426 100644
--- a/docs/cryptography/proof_system.md
+++ b/docs/cryptography/proof_system.md
@@ -8,17 +8,20 @@ The proof system is the component responsible for generating the certificate of
 5. Have short proofs.
 
 This section will cover the basic cryptographic primitives needed for the proof system and a description of the whole proof system and arguments used. Core concepts are:
-1. [Finite field](./finite_field.md)
-2. [Polynomials](./polynomials.md)
-3. [Extension field](./extension_field.md)
-4. [Hash function](./hash_function.md)
-5. [Fast-Fourier transform](./fast_fourier_transform.md)
-6. [Reed-Solomon codes](./reed_solomon_codes.md)
-7. [Constraint](./constraint.md)
-8. [Algebraic intermediate representation](./air.md)
-9. [Interactive oracle proof](./iop.md)
-10. [Fast Reed-Solomon Interactive Oracle Proof of Proximity (FRI)](./fri.md)
-11. [Provable security and conjectured security](./security.md)
-12. [Lookup argument](./lookup.md)
+
+> **Note:** the chapters below are a work in progress.
+
+1. Finite field
+2. Polynomials
+3. Extension field
+4. Hash function
+5. Fast-Fourier transform
+6. Reed-Solomon codes
+7. Constraint
+8. Algebraic intermediate representation
+9. Interactive oracle proof
+10. Fast Reed-Solomon Interactive Oracle Proof of Proximity (FRI)
+11. Provable security and conjectured security
+12. Lookup argument
 
 The flow of the proof system is described in the following section. 
\ No newline at end of file
diff --git a/docs/general_flow.md b/docs/general_flow.md
index dae345683..deee5e4fe 100644
--- a/docs/general_flow.md
+++ b/docs/general_flow.md
@@ -1,14 +1,20 @@
-# Description
-
-The different components that form the pipeline for proving the correctness of the execution of a given program on an input stream are:
-1. The source code of the program, written in high-level language
-2. The program binary, ready for the virtual machine
-3. The execution record of the binary over the VM architecture for a given input
-4. The witness of the computation, generated from the execution record. Typically, this will consist of several tables, called trace tables.
-5. The proof of validity of the witness for some language and VM architecture
-
-The steps are as follows:
-1. The *compiler* transforms the program into the binary.
-2. The *executor* takes a binary, an input stream and an VM architecture and produces the execution record.
-3. The *witness generator* transforms the execution record into a witness compatible with the chosen arithmetisation and constraint system.
-4. The *proof system* takes the witness and the constraint system, and produces a (set of) proof(s) that the former satisfies the latter.
\ No newline at end of file
+# Overview of VM flow
+
+The Lambda VM proves correct execution of a RISC-V (RV64IM) program against an input stream. The pipeline has five artifacts and four transformations.
+
+## Artifacts
+
+1. **Source code** — high-level Rust (using [`syscalls/`](../syscalls/) for guest-host I/O) or RISC-V assembly.
+2. **ELF binary** — the program in the VM's ISA, ready to load.
+3. **Execution record** — per-instruction logs emitted by running the ELF on the VM.
+4. **Witness** — a set of trace tables (CPU, decode, MEMW, LOAD, bitwise, branch, LT, shift, MUL, DVRM, page, register, halt, commit, keccak, …) derived from the execution record. Each table is an AIR (Algebraic Intermediate Representation); tables are linked by LogUp lookup arguments.
+5. **Proof** — a multi-table STARK proof (transparent, hash-based, post-quantum secure) that the witness satisfies all AIR constraints and lookup arguments. Low-degree of the witness polynomials is verified via FRI.
+
+## Transformations
+
+1. **Compiler** — `rustc` cross-compiles to the custom RISC-V target spec ([`executor/programs/riscv64im-lambda-vm-elf.json`](../executor/programs/riscv64im-lambda-vm-elf.json)) and produces the ELF. The `lambda-vm-syscalls` crate exposes guest-side syscalls (`commit`, `get_private_input`, `print_string`, `keccak_permute`, `sys_halt`).
+2. **Executor** ([`executor/`](../executor/)) — loads the ELF, runs the program against the VM's memory and register state, handles syscalls and precompiles (e.g. Keccak), and emits the per-instruction logs.
+3. **Witness generator** ([`prover/src/tables/`](../prover/src/tables/)) — turns the logs into trace tables, populates AIR columns, and computes the LogUp auxiliary columns that connect tables.
+4. **Proof system** ([`crypto/stark/`](../crypto/stark/)) — commits to each table's trace via Merkle trees, samples challenges via Fiat-Shamir, and runs FRI for the low-degree test. Produces a `MultiProof`; the verifier replays the transcript and checks all AIR and lookup constraints.
+
+For a deeper dive into each component see the [proof system overview](./cryptography/proof_system.md).
diff --git a/executor/README.md b/executor/README.md
new file mode 100644
index 000000000..df9911038
--- /dev/null
+++ b/executor/README.md
@@ -0,0 +1,53 @@
+# Lambda VM Executor
+
+RISC-V (RV64IM) emulator for the Lambda VM. Loads ELF binaries, runs them against an in-memory VM state, and emits the per-instruction execution logs that the [prover](../prover) turns into a STARK trace.
+
+Published as `executor`. Used directly by the CLI and the prover; you can also drive it from Rust.
+
+## Usage
+
+```rust
+use executor::elf::Elf;
+use executor::vm::execution::Executor;
+
+let elf_bytes = std::fs::read("program.elf")?;
+let program = Elf::load(&elf_bytes)?;
+let executor = Executor::new(&program, /* private input */ vec![])?;
+let result = executor.run()?;
+
+println!("Executed {} instructions", result.logs.len());
+```
+
+For chunked execution (useful when you don't want to hold all logs in memory), drive the executor via `executor.resume()` in a loop until it yields `None`, then call `executor.finish()`. See [`bin/cli/src/main.rs`](../bin/cli/src/main.rs) for an example.
+
+## Example programs
+
+The repo ships ready-to-use guest programs in three flavours, all compiled by Makefile targets at the repo root:
+
+- [`programs/asm/`](./programs/asm/) — raw RISC-V assembly. Built with `make compile-programs-asm` into `program_artifacts/asm/`.
+- [`programs/rust/`](./programs/rust/) — Rust guest projects (`fibonacci`, `keccak`, `hashmap`, …). Built with `make compile-programs-rust` into `program_artifacts/rust/`. Requires the pinned nightly toolchain and sysroot — see the root [`README.md`](../README.md).
+- [`programs/bench/`](./programs/bench/) — benchmark programs. Built with `make compile-bench`.
+
+The custom RISC-V target spec used for Rust guests lives at [`programs/riscv64im-lambda-vm-elf.json`](./programs/riscv64im-lambda-vm-elf.json).
+
+## Tests
+
+```sh
+# Compile all programs and run executor tests
+make test-executor
+
+# Just the asm tests
+make test-asm
+
+# Just the Rust tests
+make test-rust
+```
+
+To add a new test:
+
+- **ASM**: add a `.s` file under [`programs/asm/`](./programs/asm/) and a matching entry in [`tests/asm.rs`](./tests/asm.rs).
+- **Rust**: add a cargo project under [`programs/rust/<name>/`](./programs/rust/) (the directory and the `Cargo.toml` package name must match) and a matching entry in [`tests/rust.rs`](./tests/rust.rs).
+
+## Flamegraphs
+
+The executor includes a flamegraph generator (`executor::flamegraph::FlamegraphGenerator`) that produces folded-stack output by instruction count. Drive it via the CLI: `cli execute <elf> --flamegraph stacks.txt`. See [`bin/cli/README.md`](../bin/cli/README.md) for details.
diff --git a/prover/README.md b/prover/README.md
index e69de29bb..b523c1c3b 100644
--- a/prover/README.md
+++ b/prover/README.md
@@ -0,0 +1,54 @@
+# Lambda VM Prover
+
+STARK prover for the Lambda VM. Proves correct execution of RISC-V ELF binaries by generating a multi-table STARK proof (CPU, decode, bitwise, branch, LT, shift, MUL, DVRM, MEMW, LOAD, page, register, halt, commit, keccak) and provides the matching native verifier.
+
+Published as `lambda-vm-prover`. Consumed by [`bin/cli`](../bin/cli) and by the benchmarks; you can also use it directly from Rust.
+
+## Usage
+
+```rust
+use lambda_vm_prover as prover;
+
+let elf_bytes = std::fs::read("program.elf").unwrap();
+
+let proof = prover::prove(&elf_bytes).unwrap();
+assert!(prover::verify(&proof, &elf_bytes).unwrap());
+```
+
+With private inputs:
+
+```rust
+let private_inputs = std::fs::read("input.bin").unwrap();
+let proof = prover::prove_with_inputs(&elf_bytes, &private_inputs).unwrap();
+```
+
+## Public API
+
+| Function | Description |
+|---|---|
+| `prove(elf)` | Generate a proof with default options (blowup = 2). |
+| `prove_with_inputs(elf, private)` | Same, with private input bytes. |
+| `prove_with_options(elf, opts, max_rows)` | Custom proof options and max-rows config. |
+| `prove_with_options_and_inputs(...)` | Most general entry point. |
+| `verify(proof, elf)` | Verify a proof with default options. |
+| `verify_with_options(proof, elf, opts)` | Verify with caller-chosen options (the verifier enforces its own security parameters, not the prover's). |
+| `prove_and_verify(elf)` | Prove + verify in one call (convenience). |
+| `count_elements(elf, private)` | Build traces and return `(main, aux)` field-element counts without running the proof step. |
+
+The proof bundle type is `VmProof`, containing the multi-table STARK proof, the public output bytes committed by the guest, and the metadata the verifier needs to reconstruct the AIR configuration (table chunk counts, runtime page ranges, number of private-input pages).
+
+## Features
+
+| Feature | Description |
+|---|---|
+| `parallel` (default) | Rayon parallelism across tables and FFTs. |
+| `debug-checks` | Runs `validate_trace` and prints a per-bus LogUp balance report after proving. Forwarded to `crypto/stark`. |
+| `instruments` | Per-phase timing and heap-usage report (execute, trace build, AIR construction, prove). |
+
+To run the test suite with debug output:
+
+```sh
+cargo test --release -p lambda-vm-prover --features debug-checks -- --nocapture
+```
+
+See the root [`README.md`](../README.md) for the end-to-end workflow (compiling guest programs, the CLI wrapper, benchmarks).
diff --git a/spec/README.md b/spec/README.md
index 127e528c8..da844e801 100644
--- a/spec/README.md
+++ b/spec/README.md
@@ -1,11 +1,19 @@
-# LambdaVM specification
-This repository contains specification for [`LambdaVM`](https://github.com/yetanotherco/lambda_vm).
-The specification is written in [`Typst`](https://typst.app/) and can be rendered by [`shiroa`](https://myriad-dreamin.github.io/shiroa/) as either a file (pdf) or a wiki (html).
+# Lambda VM Specification
 
-## Installation & Development setup
-1. [Install `Typst`](https://github.com/typst/typst?tab=readme-ov-file#installation).
-2. [Install `shiroa`](https://myriad-dreamin.github.io/shiroa/guide/installation.html).
-3. Clone this repository.
-4. Open the repository in a terminal and execute `shiroa serve`.
+Formal specification of the Lambda VM. Covers the per-chip AIR constraints (CPU, decode, bitwise, branch, LT, shift, MUL, DVRM, MEMW, LOAD, page, register, halt, commit, keccak), the memory argument, and the LogUp lookup framework that links the tables.
 
-At this point, the wiki version is hosted locally and is actively updated as you modify the specification files.
+The specification is written in [Typst](https://typst.app/) and rendered as either a PDF or a browsable HTML wiki using [shiroa](https://myriad-dreamin.github.io/shiroa/).
+
+## Rendering it locally
+
+1. [Install Typst](https://github.com/typst/typst?tab=readme-ov-file#installation).
+2. [Install shiroa](https://myriad-dreamin.github.io/shiroa/guide/installation.html).
+3. From this directory, run:
+
+   ```sh
+   shiroa serve
+   ```
+
+   shiroa will host the HTML wiki locally and live-reload as you edit the `.typ` source files.
+
+To produce a PDF instead, see the shiroa documentation for the `build` command.
diff --git a/syscalls/README.md b/syscalls/README.md
new file mode 100644
index 000000000..fa5758741
--- /dev/null
+++ b/syscalls/README.md
@@ -0,0 +1,52 @@
+# Lambda VM Syscalls (Guest SDK)
+
+Guest-side library for programs that run inside the Lambda VM. Provides the syscalls and the default entry point that let a Rust program interact with the host: read private input, commit public output, halt, and invoke precompiles.
+
+Published as `lambda-vm-syscalls`. Intended to be used from RISC-V (RV64IM) guest binaries cross-compiled with the toolchain described in the root [`README.md`](../README.md).
+
+## What it provides
+
+| Function | Purpose |
+|---|---|
+| `commit(bytes: &[u8])` | Append bytes to the **public output** that the verifier checks. |
+| `get_private_input() -> Vec<u8>` | Read the host-supplied private input bytes (memory-mapped at `0xFF000000`). |
+| `sys_halt() -> !` | Terminate execution cleanly. Called automatically after `main` by the default entry point. |
+| `keccak_permute(state: &mut [u64; 25])` | Keccak-f[1600] permutation precompile. |
+
+The crate also provides a default `_start` that initialises the allocator, calls `main`, and halts.
+
+> **Note:** the `print_string` syscall is temporarily unavailable — calling it in a guest will cause proof verification to fail. Tracked as a follow-up.
+
+## Example
+
+A minimal guest that reads private input and commits a (non-secret) summary of it:
+
+```rust
+use lambda_vm_syscalls::syscalls;
+
+pub fn main() {
+    let input = syscalls::get_private_input();
+
+    // Anything passed to `commit` becomes part of the proof's public output.
+    // Don't echo private input here — commit a derived value instead.
+    let len = (input.len() as u32).to_le_bytes();
+    syscalls::commit(&len);
+}
+```
+
+See [`executor/programs/rust/`](../executor/programs/rust/) for more example guests (`fibonacci`, `keccak`, `hashmap`, …).
+
+## Building a guest
+
+Guests are compiled with a pinned nightly toolchain for the custom RISC-V target `riscv64im-lambda-vm-elf.json`. The simplest path is to drop a new project under `executor/programs/rust/<name>/` and run `make compile-programs-rust` from the repo root.
+
+See the root [`README.md`](../README.md) for the full toolchain setup (sysroot, nightly pin, target spec).
+
+## Unsupported `std` functions
+
+The following functions are stubbed and **panic at runtime** if called — Lambda VM does not provide stdin or command-line arguments:
+
+- `sys_read` (so `io::Read` for `Stdin` is not available)
+- `sys_argc`, `sys_argv`
+
+To pass data into a guest, use `get_private_input()` instead.