diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
new file mode 100644
index 0000000..ee67f19
--- /dev/null
+++ b/.github/workflows/docs.yml
@@ -0,0 +1,64 @@
+name: Docs
+
+on:
+ push:
+ branches: [main]
+ pull_request:
+ branches: [main]
+ workflow_dispatch:
+
+permissions:
+ contents: write
+
+
+jobs:
+ build:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ with:
+ fetch-depth: 0
+
+ - name: Set up Python
+ uses: actions/setup-python@v5
+ with:
+ python-version: "3.11"
+ cache: pip
+
+ - name: Install package with docs extra
+ run: |
+ python -m pip install --upgrade pip
+ pip install -e ".[docs]"
+
+ - name: Build docs (strict)
+ run: zensical build --strict -f docs/mkdocs.yml
+
+ - name: Upload built site
+ uses: actions/upload-artifact@v4
+ with:
+ name: site
+ path: docs/site
+ retention-days: 7
+
+ deploy:
+ needs: build
+ if: github.event_name == 'push' && github.ref == 'refs/heads/main'
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+ with:
+ fetch-depth: 0
+
+ - name: Set up Python
+ uses: actions/setup-python@v5
+ with:
+ python-version: "3.11"
+ cache: pip
+
+ - name: Install package with docs extra
+ run: |
+ python -m pip install --upgrade pip
+ pip install -e ".[docs]"
+
+ - name: Deploy to GitHub Pages
+ run: zensical gh-deploy --force --clean -f docs/mkdocs.yml
diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml
new file mode 100644
index 0000000..110a253
--- /dev/null
+++ b/.github/workflows/lint.yml
@@ -0,0 +1,27 @@
+name: Lint
+
+on:
+ push:
+ branches: [main, dev]
+ pull_request:
+ branches: [main, dev]
+
+jobs:
+ ruff:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: actions/checkout@v4
+
+ - name: Set up Python
+ uses: actions/setup-python@v5
+ with:
+ python-version: "3.11"
+
+ - name: Install ruff
+ run: pip install ruff
+
+ - name: Run ruff check
+ run: ruff check .
+
+ - name: Run ruff format check
+ run: ruff format --check .
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
new file mode 100644
index 0000000..eb50440
--- /dev/null
+++ b/.github/workflows/test.yml
@@ -0,0 +1,30 @@
+name: Test
+
+on:
+ push:
+ branches: [main, dev]
+ pull_request:
+ branches: [main, dev]
+
+jobs:
+ test:
+ runs-on: ubuntu-latest
+ strategy:
+ matrix:
+ python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
+
+ steps:
+ - uses: actions/checkout@v4
+
+ - name: Set up Python ${{ matrix.python-version }}
+ uses: actions/setup-python@v5
+ with:
+ python-version: ${{ matrix.python-version }}
+
+ - name: Install dependencies
+ run: |
+ python -m pip install --upgrade pip
+ pip install -e ".[dev]"
+
+ - name: Run tests
+ run: pytest
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..97b2cf6
--- /dev/null
+++ b/README.md
@@ -0,0 +1,104 @@
+# Cortex
+
+Lightweight Python pub/sub over ZeroMQ, for robotics and beyond.
+
+**[Documentation](https://sudoRicheek.github.io/cortex/)** · [Quickstart](https://sudoRicheek.github.io/cortex/getting-started/quickstart/) · [API Reference](https://sudoRicheek.github.io/cortex/reference/)
+
+## Overview
+
+Cortex is a pub/sub communication layer built on ZeroMQ IPC. Nodes publish typed messages on named topics; subscribers receive them via async callbacks. A small discovery daemon handles endpoint resolution so publishers and subscribers find each other automatically.
+
+- **Typed messages** with 64-bit fingerprint verification — no silent type mismatches
+- **Zero-copy frames** for NumPy arrays and PyTorch tensors over IPC
+- **uvloop-backed async** for low tail latency on Linux/macOS
+- **Simple API**: `Node`, `Publisher`, `Subscriber`, rate-based `Executor`
+
+```
+┌──────────────────────────────────────┐
+│ Discovery Daemon │
+│ ipc:///tmp/cortex_discovery │
+└──────┬───────────────────────┬───────┘
+ │ REQ/REP (register) │ REQ/REP (lookup)
+┌──────┴──────┐ ┌──────┴──────┐
+│ Publisher │─PUB/SUB─│ Subscriber │
+└─────────────┘ IPC └─────────────┘
+```
+
+## Installation
+
+```bash
+git clone https://github.com/sudoRicheek/cortex.git
+cd cortex
+pip install -e "." # core
+pip install -e ".[torch]" # + PyTorch
+```
+
+## Quick Start
+
+```bash
+cortex-discovery # terminal 1: start the discovery daemon
+```
+
+```python
+# publisher.py
+import numpy as np
+from cortex import Node, ArrayMessage
+
+node = Node("sensor")
+pub = node.create_publisher("/sensor/data", ArrayMessage)
+pub.publish(ArrayMessage(data=np.random.randn(640, 480, 3).astype("f4"), name="frame"))
+```
+
+```python
+# subscriber.py
+from cortex import Node, ArrayMessage
+
+def on_msg(msg: ArrayMessage, header):
+ print(f"got {msg.name}: {msg.data.shape}")
+
+node = Node("proc")
+node.create_subscriber("/sensor/data", ArrayMessage, callback=on_msg)
+node.spin()
+```
+
+Custom message types, rate-based executors, multi-node systems — see the **[docs](https://sudoRicheek.github.io/cortex/)**.
+
+## Messages
+
+Define messages as plain dataclasses — registration, fingerprinting, and serialization are automatic:
+
+```python
+from dataclasses import dataclass
+import numpy as np
+from cortex.messages.base import Message
+
+@dataclass
+class RobotState(Message):
+ position: np.ndarray # zero-copy over IPC
+ joint_angles: np.ndarray
+ is_moving: bool
+```
+
+Built-ins cover the common cases: `StringMessage`, `ArrayMessage`, `ImageMessage`, `PointCloudMessage`, `PoseMessage`, `TensorMessage`, and more. See the [Messages reference](https://sudoRicheek.github.io/cortex/components/messages/).
+
+## Examples
+
+See the `examples/` directory for complete examples. One example:
+
+```bash
+python -m cortex.discovery.daemon # Terminal 1
+python examples/publisher_numpy.py # Terminal 2
+python examples/subscriber_numpy.py # Terminal 3
+```
+
+Full walkthroughs in the [Tutorials](https://sudoRicheek.github.io/cortex/tutorials/custom-messages/).
+
+## Testing
+
+```bash
+pytest
+```
+
+## License
+
+Apache 2.0 — see [LICENSE](LICENSE).
diff --git a/benchmarks/__init__.py b/benchmarks/__init__.py
new file mode 100644
index 0000000..6b3e41d
--- /dev/null
+++ b/benchmarks/__init__.py
@@ -0,0 +1 @@
+"""Benchmarks for Cortex framework."""
diff --git a/benchmarks/bench_all.py b/benchmarks/bench_all.py
new file mode 100644
index 0000000..5487cd9
--- /dev/null
+++ b/benchmarks/bench_all.py
@@ -0,0 +1,371 @@
+#!/usr/bin/env python3
+"""
+Comprehensive benchmark suite for Cortex.
+
+Runs all benchmarks and generates a summary report.
+"""
+
+import argparse
+import json
+
+# Add parent to path for imports
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+
+import numpy as np
+import zmq
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+from bench_latency import run_latency_benchmark
+from bench_throughput import run_throughput_benchmark
+
+from cortex.messages.standard import ArrayMessage
+
+
+def run_all_benchmarks() -> dict:
+ """Run the complete benchmark suite."""
+
+ results = {
+ "timestamp": datetime.now().isoformat(),
+ "system_info": get_system_info(),
+ "benchmarks": {},
+ }
+
+ print("\n" + "=" * 80)
+ print("CORTEX BENCHMARK SUITE")
+ print("=" * 80)
+
+ # 1. Latency benchmarks
+ print("\n[1/4] Running latency benchmarks...")
+
+ latency_configs = [
+ {
+ "num_messages": 1000,
+ "payload_size": 64,
+ "rate_hz": 1000,
+ "name": "small_payload",
+ },
+ {
+ "num_messages": 1000,
+ "payload_size": 1024,
+ "rate_hz": 1000,
+ "name": "medium_payload",
+ },
+ {
+ "num_messages": 500,
+ "payload_size": 65536,
+ "rate_hz": 500,
+ "name": "large_payload",
+ },
+ {"num_messages": 5000, "payload_size": 256, "rate_hz": 0, "name": "max_rate"},
+ ]
+
+ results["benchmarks"]["latency"] = {}
+ for config in latency_configs:
+ name = config.pop("name")
+ print(f" - {name}...", end=" ", flush=True)
+ try:
+ result = run_latency_benchmark(**config)
+ results["benchmarks"]["latency"][name] = result
+ if "error" not in result:
+ print(
+ f"mean={result['latency_mean_us']:.1f}µs, p99={result['latency_p99_us']:.1f}µs"
+ )
+ else:
+ print("ERROR")
+ except Exception as e:
+ print(f"FAILED: {e}")
+ results["benchmarks"]["latency"][name] = {"error": str(e)}
+
+ # 2. Throughput benchmarks
+ print("\n[2/4] Running throughput benchmarks...")
+
+ throughput_configs = [
+ {
+ "num_messages": 10000,
+ "array_shape": (10,),
+ "dtype": "float32",
+ "name": "tiny_array",
+ },
+ {
+ "num_messages": 5000,
+ "array_shape": (100, 100),
+ "dtype": "float32",
+ "name": "small_array",
+ },
+ {
+ "num_messages": 1000,
+ "array_shape": (512, 512),
+ "dtype": "float32",
+ "name": "medium_array",
+ },
+ {
+ "num_messages": 200,
+ "array_shape": (1024, 1024),
+ "dtype": "float32",
+ "name": "large_array",
+ },
+ ]
+
+ results["benchmarks"]["throughput"] = {}
+ for config in throughput_configs:
+ name = config.pop("name")
+ print(f" - {name}...", end=" ", flush=True)
+ try:
+ result = run_throughput_benchmark(**config)
+ results["benchmarks"]["throughput"][name] = result
+ if "error" not in result:
+ print(
+ f"{result['throughput_msg_per_s']:.0f} msg/s, {result['throughput_mb_per_s']:.1f} MB/s"
+ )
+ else:
+ print("ERROR")
+ except Exception as e:
+ print(f"FAILED: {e}")
+ results["benchmarks"]["throughput"][name] = {"error": str(e)}
+
+ # 3. Image-like data benchmarks
+ print("\n[3/4] Running image data benchmarks...")
+
+ image_configs = [
+ {
+ "num_messages": 1000,
+ "array_shape": (480, 640, 3),
+ "dtype": "uint8",
+ "name": "vga_rgb",
+ },
+ {
+ "num_messages": 500,
+ "array_shape": (720, 1280, 3),
+ "dtype": "uint8",
+ "name": "720p_rgb",
+ },
+ {
+ "num_messages": 200,
+ "array_shape": (1080, 1920, 3),
+ "dtype": "uint8",
+ "name": "1080p_rgb",
+ },
+ ]
+
+ results["benchmarks"]["images"] = {}
+ for config in image_configs:
+ name = config.pop("name")
+ print(f" - {name}...", end=" ", flush=True)
+ try:
+ result = run_throughput_benchmark(**config)
+ results["benchmarks"]["images"][name] = result
+ if "error" not in result:
+ fps = result["throughput_msg_per_s"]
+ mbps = result["throughput_mb_per_s"]
+ print(f"{fps:.1f} fps, {mbps:.1f} MB/s")
+ else:
+ print("ERROR")
+ except Exception as e:
+ print(f"FAILED: {e}")
+ results["benchmarks"]["images"][name] = {"error": str(e)}
+
+ # 4. Serialization overhead
+ print("\n[4/4] Measuring serialization overhead...")
+ results["benchmarks"]["serialization"] = measure_serialization_overhead()
+
+ return results
+
+
+def get_system_info() -> dict:
+ """Get system information."""
+ import platform
+
+ return {
+ "platform": platform.system(),
+ "platform_release": platform.release(),
+ "processor": platform.processor(),
+ "python_version": platform.python_version(),
+ "numpy_version": np.__version__,
+ }
+
+
+def measure_serialization_overhead() -> dict:
+ """Measure wire serialization overhead using multipart transport frames."""
+
+ results = {}
+
+ test_cases = [
+ ("1KB_array", np.random.randn(256).astype(np.float32)),
+ ("100KB_array", np.random.randn(256, 100).astype(np.float32)),
+ ("1MB_array", np.random.randn(512, 512).astype(np.float32)),
+ ("4MB_array", np.random.randn(1024, 1024).astype(np.float32)),
+ ]
+
+ for name, data in test_cases:
+ message = ArrayMessage(data=data)
+ topic = b"/benchmark/serialization"
+ endpoint = f"inproc://cortex_serialization_{name}"
+ context = zmq.Context.instance()
+ sender = context.socket(zmq.PAIR)
+ receiver = context.socket(zmq.PAIR)
+ sender.setsockopt(zmq.LINGER, 0)
+ receiver.setsockopt(zmq.LINGER, 0)
+ receiver.bind(endpoint)
+ sender.connect(endpoint)
+
+ def frame_size_bytes(frames: list[object]) -> int:
+ total = 0
+ for frame in frames:
+ if hasattr(frame, "nbytes"):
+ total += int(frame.nbytes)
+ else:
+ total += len(frame)
+ return total
+
+ # Warm up
+ for _ in range(10):
+ sender.send_multipart([topic, *message.to_frames()], copy=False)
+ warmup_frames = receiver.recv_multipart(copy=False)
+ ArrayMessage.from_frames(warmup_frames[1:])
+
+ # Benchmark A->B wire transfer and B-side decode
+ iterations = 100
+ wire_total = 0.0
+ decode_total = 0.0
+ frames = []
+
+ for _ in range(iterations):
+ wire_start = time.perf_counter()
+ sender.send_multipart([topic, *message.to_frames()], copy=False)
+ frames = receiver.recv_multipart(copy=False)
+ wire_end = time.perf_counter()
+
+ decode_start = wire_end
+ ArrayMessage.from_frames(frames[1:])
+ decode_end = time.perf_counter()
+
+ wire_total += wire_end - wire_start
+ decode_total += decode_end - decode_start
+
+ serialize_time = (wire_total / iterations) * 1000 # ms (A->B wire path)
+ deserialize_time = (decode_total / iterations) * 1000 # ms (B-side decode)
+
+ # Use real wire bytes including topic frame.
+ data_size_bytes = frame_size_bytes(frames)
+ data_size_kb = data_size_bytes / 1024
+
+ results[name] = {
+ "data_size_kb": data_size_kb,
+ "wire_size_bytes": data_size_bytes,
+ "serialize_ms": serialize_time,
+ "deserialize_ms": deserialize_time,
+ "total_ms": serialize_time + deserialize_time,
+ # Throughput is intentionally omitted here because inproc multipart
+ # transport with copy=False can look unrealistically high and is
+ # often misread as physical link bandwidth.
+ "roundtrip_ms": serialize_time + deserialize_time,
+ }
+
+ print(
+ f" - {name}: to_wire={serialize_time:.3f}ms, from_wire={deserialize_time:.3f}ms"
+ )
+
+ sender.close()
+ receiver.close()
+
+ return results
+
+
+def print_summary(results: dict) -> None:
+ """Print a summary of all benchmark results."""
+
+ print("\n" + "=" * 80)
+ print("BENCHMARK SUMMARY")
+ print("=" * 80)
+
+ # Latency summary
+ print("\n📊 LATENCY (microseconds)")
+ print("-" * 60)
+ print(f"{'Test':<20} {'Mean':>10} {'P50':>10} {'P99':>10} {'Max':>10}")
+ print("-" * 60)
+
+ for name, data in results["benchmarks"].get("latency", {}).items():
+ if "error" not in data:
+ print(
+ f"{name:<20} {data['latency_mean_us']:>10.1f} "
+ f"{data['latency_p50_us']:>10.1f} "
+ f"{data['latency_p99_us']:>10.1f} "
+ f"{data['latency_max_us']:>10.1f}"
+ )
+
+ # Throughput summary
+ print("\n📊 THROUGHPUT")
+ print("-" * 60)
+ print(f"{'Test':<20} {'Msg/s':>12} {'MB/s':>10} {'Loss %':>10}")
+ print("-" * 60)
+
+ for name, data in results["benchmarks"].get("throughput", {}).items():
+ if "error" not in data:
+ print(
+ f"{name:<20} {data['throughput_msg_per_s']:>12,.0f} "
+ f"{data['throughput_mb_per_s']:>10.1f} "
+ f"{data['loss_rate_percent']:>10.2f}"
+ )
+
+ # Image throughput
+ print("\n📊 IMAGE DATA (frames per second)")
+ print("-" * 60)
+ print(f"{'Resolution':<20} {'FPS':>10} {'MB/s':>10} {'Loss %':>10}")
+ print("-" * 60)
+
+ for name, data in results["benchmarks"].get("images", {}).items():
+ if "error" not in data:
+ print(
+ f"{name:<20} {data['throughput_msg_per_s']:>10.1f} "
+ f"{data['throughput_mb_per_s']:>10.1f} "
+ f"{data['loss_rate_percent']:>10.2f}"
+ )
+
+ # Wire serialization overhead
+ print("\n📊 WIRE SERIALIZATION OVERHEAD (MULTIPART)")
+ print("-" * 60)
+ print(
+ f"{'Size':<20} {'Wire Bytes':>12} {'To Wire':>12} {'From Wire':>12} {'Roundtrip':>12}"
+ )
+ print("-" * 60)
+
+ for name, data in results["benchmarks"].get("serialization", {}).items():
+ print(
+ f"{name:<20} {data['wire_size_bytes']:>12,} "
+ f"{data['serialize_ms']:>10.3f}ms "
+ f"{data['deserialize_ms']:>10.3f}ms "
+ f"{data['roundtrip_ms']:>10.3f}ms"
+ )
+
+ print("\n" + "=" * 80)
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Cortex Benchmark Suite")
+ parser.add_argument(
+ "-o", "--output", type=str, default=None, help="Output file for JSON results"
+ )
+ parser.add_argument(
+ "--quick",
+ action="store_true",
+ help="Run quick benchmarks with fewer iterations",
+ )
+
+ args = parser.parse_args()
+
+ results = run_all_benchmarks()
+ print_summary(results)
+
+ if args.output:
+ output_path = Path(args.output)
+ with open(output_path, "w") as f:
+ json.dump(results, f, indent=2, default=str)
+ print(f"\nResults saved to: {output_path}")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/benchmarks/bench_latency.py b/benchmarks/bench_latency.py
new file mode 100644
index 0000000..c2be413
--- /dev/null
+++ b/benchmarks/bench_latency.py
@@ -0,0 +1,324 @@
+#!/usr/bin/env python3
+"""
+Latency benchmark for Cortex.
+
+Measures round-trip latency between publisher and subscriber.
+"""
+
+import argparse
+import asyncio
+import multiprocessing as mp
+import statistics
+
+# Add parent to path for imports
+import sys
+import time
+from dataclasses import dataclass
+from pathlib import Path
+
+import numpy as np
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
+
+import cortex
+from cortex.core.publisher import Publisher
+from cortex.core.subscriber import Subscriber
+from cortex.discovery.daemon import DiscoveryDaemon
+from cortex.messages.base import Message
+
+
+@dataclass
+class LatencyMessage(Message):
+ """Message with timestamp for latency measurement."""
+
+ send_time_ns: int
+ sequence: int
+ payload: bytes # Variable size payload
+
+
+def run_discovery_daemon():
+ """Run the discovery daemon in a separate process."""
+ daemon = DiscoveryDaemon()
+ daemon.start()
+
+
+def run_publisher(
+ topic: str,
+ num_messages: int,
+ payload_size: int,
+ rate_hz: float,
+ ready_event,
+ start_event,
+):
+ """Publisher process."""
+ time.sleep(0.5) # Wait for discovery daemon
+
+ pub = Publisher(
+ topic_name=topic,
+ message_type=LatencyMessage,
+ node_name="latency_publisher",
+ )
+
+ # Signal ready
+ ready_event.set()
+
+ # Wait for start signal
+ start_event.wait()
+
+ payload = b"\x00" * payload_size
+ interval = 1.0 / rate_hz if rate_hz > 0 else 0
+
+ for i in range(num_messages):
+ msg = LatencyMessage(
+ send_time_ns=time.time_ns(),
+ sequence=i,
+ payload=payload,
+ )
+ pub.publish(msg)
+
+ if interval > 0:
+ time.sleep(interval)
+
+ # Send final message to signal completion
+ time.sleep(0.1)
+ pub.close()
+
+
+def run_subscriber(
+ topic: str,
+ num_messages: int,
+ ready_event,
+ start_event,
+ results_queue,
+):
+ """Subscriber process - runs async receive loop."""
+
+ async def subscriber_main():
+ await asyncio.sleep(0.5) # Wait for discovery daemon
+
+ latencies: list[float] = []
+ received = 0
+
+ sub = Subscriber(
+ topic_name=topic,
+ message_type=LatencyMessage,
+ node_name="latency_subscriber",
+ wait_for_topic=True,
+ topic_timeout=30.0,
+ )
+
+ # Wait for topic to be available before signaling ready
+ if not sub.is_connected:
+ connected = await sub._async_connect()
+ if not connected:
+ results_queue.put({"received": 0, "latencies": [], "error": "timeout"})
+ return
+
+ # Signal ready
+ ready_event.set()
+
+ # Wait for start signal
+ start_event.wait()
+
+ start_time = time.time()
+ timeout = 30.0 # Max wait time
+
+ while received < num_messages and (time.time() - start_time) < timeout:
+ try:
+ result = await asyncio.wait_for(sub.receive(), timeout=1.0)
+
+ if result:
+ msg, _header = result
+ receive_time_ns = time.time_ns()
+ latency_us = (receive_time_ns - msg.send_time_ns) / 1000.0
+ latencies.append(latency_us)
+ received += 1
+ except asyncio.TimeoutError:
+ continue
+
+ sub.close()
+
+ # Send results back
+ results_queue.put(
+ {
+ "received": received,
+ "latencies": latencies,
+ }
+ )
+
+ cortex.run(subscriber_main())
+
+
+def run_latency_benchmark(
+ num_messages: int = 1000,
+ payload_size: int = 1024,
+ rate_hz: float = 1000.0,
+) -> dict:
+ """
+ Run the latency benchmark.
+
+ Args:
+ num_messages: Number of messages to send
+ payload_size: Size of payload in bytes
+ rate_hz: Publishing rate (0 for unlimited)
+
+ Returns:
+ Dictionary with benchmark results
+ """
+ topic = "/benchmark/latency"
+
+ # Start discovery daemon
+ discovery_proc = mp.Process(target=run_discovery_daemon, daemon=True)
+ discovery_proc.start()
+ time.sleep(1.0) # Give daemon more time to start and bind socket
+
+ # Events for synchronization
+ pub_ready = mp.Event()
+ sub_ready = mp.Event()
+ start_event = mp.Event()
+ results_queue = mp.Queue()
+
+ # Start subscriber first
+ sub_proc = mp.Process(
+ target=run_subscriber,
+ args=(topic, num_messages, sub_ready, start_event, results_queue),
+ )
+ sub_proc.start()
+
+ # Start publisher
+ pub_proc = mp.Process(
+ target=run_publisher,
+ args=(topic, num_messages, payload_size, rate_hz, pub_ready, start_event),
+ )
+ pub_proc.start()
+
+ # Wait for both to be ready
+ pub_ready.wait(timeout=10)
+ sub_ready.wait(timeout=10)
+
+ # Small delay for connection establishment
+ time.sleep(0.2)
+
+ # Start benchmark
+ benchmark_start = time.time()
+ start_event.set()
+
+ # Wait for completion
+ pub_proc.join(timeout=60)
+ sub_proc.join(timeout=60)
+
+ benchmark_duration = time.time() - benchmark_start
+
+ # Get results
+ results = results_queue.get(timeout=5)
+
+ # Cleanup
+ discovery_proc.terminate()
+ discovery_proc.join(timeout=2)
+
+ # Calculate statistics
+ latencies = results["latencies"]
+
+ if latencies:
+ stats = {
+ "num_messages": num_messages,
+ "payload_size": payload_size,
+ "rate_hz": rate_hz,
+ "received": results["received"],
+ "loss_rate": (num_messages - results["received"]) / num_messages * 100,
+ "duration_s": benchmark_duration,
+ "latency_min_us": min(latencies),
+ "latency_max_us": max(latencies),
+ "latency_mean_us": statistics.mean(latencies),
+ "latency_median_us": statistics.median(latencies),
+ "latency_std_us": statistics.stdev(latencies) if len(latencies) > 1 else 0,
+ "latency_p50_us": np.percentile(latencies, 50),
+ "latency_p90_us": np.percentile(latencies, 90),
+ "latency_p99_us": np.percentile(latencies, 99),
+ "throughput_msg_per_s": results["received"] / benchmark_duration,
+ }
+ else:
+ stats = {
+ "error": "No messages received",
+ "received": 0,
+ }
+
+ return stats
+
+
+def print_results(results: dict) -> None:
+ """Print benchmark results in a formatted way."""
+ print("\n" + "=" * 60)
+ print("LATENCY BENCHMARK RESULTS")
+ print("=" * 60)
+
+ if "error" in results:
+ print(f"ERROR: {results['error']}")
+ return
+
+ print("\nConfiguration:")
+ print(f" Messages: {results['num_messages']:,}")
+ print(f" Payload size: {results['payload_size']:,} bytes")
+ print(f" Target rate: {results['rate_hz']:,.0f} Hz")
+
+ print("\nDelivery:")
+ print(f" Received: {results['received']:,} / {results['num_messages']:,}")
+ print(f" Loss rate: {results['loss_rate']:.2f}%")
+ print(f" Duration: {results['duration_s']:.2f} s")
+ print(f" Throughput: {results['throughput_msg_per_s']:,.0f} msg/s")
+
+ print("\nLatency (microseconds):")
+ print(f" Min: {results['latency_min_us']:,.1f} µs")
+ print(f" Max: {results['latency_max_us']:,.1f} µs")
+ print(f" Mean: {results['latency_mean_us']:,.1f} µs")
+ print(f" Median: {results['latency_median_us']:,.1f} µs")
+ print(f" Std Dev: {results['latency_std_us']:,.1f} µs")
+ print(f" P50: {results['latency_p50_us']:,.1f} µs")
+ print(f" P90: {results['latency_p90_us']:,.1f} µs")
+ print(f" P99: {results['latency_p99_us']:,.1f} µs")
+
+ print("=" * 60 + "\n")
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Cortex Latency Benchmark")
+ parser.add_argument(
+ "-n",
+ "--num-messages",
+ type=int,
+ default=1000,
+ help="Number of messages to send (default: 1000)",
+ )
+ parser.add_argument(
+ "-s",
+ "--payload-size",
+ type=int,
+ default=1024,
+ help="Payload size in bytes (default: 1024)",
+ )
+ parser.add_argument(
+ "-r",
+ "--rate",
+ type=float,
+ default=1000.0,
+ help="Publishing rate in Hz, 0 for unlimited (default: 1000)",
+ )
+
+ args = parser.parse_args()
+
+ print("\nRunning latency benchmark...")
+ print(f" Messages: {args.num_messages}")
+ print(f" Payload: {args.payload_size} bytes")
+ print(f" Rate: {args.rate} Hz")
+
+ results = run_latency_benchmark(
+ num_messages=args.num_messages,
+ payload_size=args.payload_size,
+ rate_hz=args.rate,
+ )
+
+ print_results(results)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/benchmarks/bench_throughput.py b/benchmarks/bench_throughput.py
new file mode 100644
index 0000000..e4a64cd
--- /dev/null
+++ b/benchmarks/bench_throughput.py
@@ -0,0 +1,338 @@
+#!/usr/bin/env python3
+"""
+Throughput benchmark for Cortex framework.
+
+Measures maximum message throughput for different payload sizes.
+"""
+
+import asyncio
+import builtins
+import contextlib
+import logging
+import threading
+import time
+from dataclasses import dataclass
+from typing import Any
+
+import numpy as np
+
+import cortex
+from cortex import Publisher, Subscriber
+from cortex.discovery import DiscoveryDaemon
+from cortex.messages import ArrayMessage
+from cortex.messages.base import MessageHeader
+
+# Reduce logging noise during benchmarks
+logging.getLogger("cortex").setLevel(logging.WARNING)
+
+
+@dataclass
+class ThroughputResult:
+ """Results from a throughput test."""
+
+ payload_size: int
+ messages_sent: int
+ messages_received: int
+ duration: float
+ throughput_msgs: float # messages per second
+ throughput_bytes: float # bytes per second
+ throughput_mbps: float # megabits per second
+ loss_rate: float
+
+
+def run_throughput_test(
+ payload_size: int,
+ duration_seconds: float = 5.0,
+ discovery_address: str = "ipc:///tmp/cortex_benchmark_discovery",
+) -> ThroughputResult:
+ """
+ Run a throughput test with given payload size.
+
+ Args:
+ payload_size: Size of payload in bytes
+ duration_seconds: How long to run the test
+ discovery_address: Discovery daemon address
+
+ Returns:
+ ThroughputResult with benchmark data
+ """
+ topic = "/benchmark/throughput"
+
+ # Create payload (numpy array of given size)
+ # Each float64 is 8 bytes
+ num_elements = max(1, payload_size // 8)
+ payload = np.random.rand(num_elements).astype(np.float64)
+ actual_payload_size = payload.nbytes
+
+ # Counters
+ received_count = 0
+ lock = threading.Lock()
+
+ # Subscriber callback (async)
+ async def on_message(msg: ArrayMessage, header: MessageHeader) -> None:
+ nonlocal received_count
+ with lock:
+ received_count += 1
+
+ # Start discovery daemon
+ daemon = DiscoveryDaemon(address=discovery_address)
+ daemon_thread = threading.Thread(target=daemon.start, daemon=True)
+ daemon_thread.start()
+ time.sleep(1.0) # Give daemon more time to start and bind socket
+
+ sent_count = 0
+
+ try:
+ # Create publisher
+ pub = Publisher(
+ topic_name=topic,
+ message_type=ArrayMessage,
+ node_name="throughput_pub",
+ discovery_address=discovery_address,
+ queue_size=10000, # Large queue for throughput test
+ )
+
+ # Create subscriber
+ sub = Subscriber(
+ topic_name=topic,
+ message_type=ArrayMessage,
+ callback=on_message,
+ node_name="throughput_sub",
+ discovery_address=discovery_address,
+ queue_size=10000,
+ )
+
+ # Wait for connection
+ time.sleep(0.5)
+
+ # Start subscriber in background using asyncio
+ sub_running = True
+
+ def subscriber_loop():
+ async def run_sub():
+ sub.start()
+ while sub_running:
+ try:
+ result = await asyncio.wait_for(sub.receive(), timeout=0.01)
+ if result and sub._callback:
+ msg, header = result
+ await sub._callback(msg, header)
+ except asyncio.TimeoutError:
+ pass
+ except Exception:
+ break
+
+ cortex.run(run_sub())
+
+ sub_thread = threading.Thread(target=subscriber_loop, daemon=True)
+ sub_thread.start()
+
+ # Run publisher at maximum speed for specified duration
+ start_time = time.perf_counter()
+ end_time = start_time + duration_seconds
+
+ message = ArrayMessage(data=payload)
+
+ while time.perf_counter() < end_time:
+ if pub.publish(message):
+ sent_count += 1
+
+ actual_duration = time.perf_counter() - start_time
+
+ # Give subscriber time to catch up
+ time.sleep(0.5)
+ sub_running = False
+ sub_thread.join(timeout=1.0)
+
+ # Calculate results
+ with lock:
+ final_received = received_count
+
+ throughput_msgs = final_received / actual_duration
+ throughput_bytes = throughput_msgs * actual_payload_size
+ throughput_mbps = (throughput_bytes * 8) / 1_000_000 # Convert to megabits
+
+ loss_rate = 1.0 - (final_received / sent_count) if sent_count > 0 else 0.0
+
+ return ThroughputResult(
+ payload_size=actual_payload_size,
+ messages_sent=sent_count,
+ messages_received=final_received,
+ duration=actual_duration,
+ throughput_msgs=throughput_msgs,
+ throughput_bytes=throughput_bytes,
+ throughput_mbps=throughput_mbps,
+ loss_rate=loss_rate,
+ )
+
+ finally:
+ # Cleanup
+ with contextlib.suppress(builtins.BaseException):
+ pub.close()
+ with contextlib.suppress(builtins.BaseException):
+ sub.close()
+ daemon.stop()
+
+
+def run_throughput_benchmark(
+ num_messages: int = 1000,
+ array_shape: tuple[int, ...] = (100, 100),
+ dtype: str = "float32",
+) -> dict[str, Any]:
+ """
+ Run throughput benchmark (compatibility wrapper for bench_all.py).
+
+ Args:
+ num_messages: Number of messages to send
+ array_shape: Shape of array to send
+ dtype: NumPy dtype string
+
+ Returns:
+ Dictionary with benchmark results
+ """
+ import uuid
+
+ # Calculate payload size
+ test_array = np.zeros(array_shape, dtype=dtype)
+ payload_size = test_array.nbytes
+
+ # Estimate duration based on message count
+ # Assume roughly 10000 msg/s for estimation
+ duration = max(1.0, num_messages / 10000)
+
+ # Use unique discovery address to avoid conflicts
+ unique_id = uuid.uuid4().hex[:8]
+ discovery_address = f"ipc:///tmp/cortex_bench_{unique_id}"
+
+ result = run_throughput_test(
+ payload_size=payload_size,
+ duration_seconds=duration,
+ discovery_address=discovery_address,
+ )
+
+ return {
+ "num_messages": num_messages,
+ "array_shape": array_shape,
+ "dtype": dtype,
+ "payload_size_bytes": result.payload_size,
+ "messages_sent": result.messages_sent,
+ "messages_received": result.messages_received,
+ "duration_s": result.duration,
+ "throughput_msg_per_s": result.throughput_msgs,
+ "throughput_mb_per_s": result.throughput_bytes / 1_000_000,
+ "loss_rate_percent": result.loss_rate * 100,
+ }
+
+
+def format_bytes(num_bytes: float) -> str:
+ """Format bytes in human-readable form."""
+ for unit in ["B", "KB", "MB", "GB"]:
+ if abs(num_bytes) < 1024.0:
+ return f"{num_bytes:.1f} {unit}"
+ num_bytes /= 1024.0
+ return f"{num_bytes:.1f} TB"
+
+
+def format_rate(rate: float) -> str:
+ """Format rate in human-readable form."""
+ if rate >= 1_000_000:
+ return f"{rate / 1_000_000:.2f}M"
+ elif rate >= 1_000:
+ return f"{rate / 1_000:.2f}K"
+ else:
+ return f"{rate:.0f}"
+
+
+def main():
+ """Run throughput benchmarks."""
+ print("=" * 70)
+ print("CORTEX THROUGHPUT BENCHMARK")
+ print("=" * 70)
+ print()
+
+ # Test different payload sizes
+ payload_sizes = [
+ 64, # 64 B - small messages
+ 256, # 256 B
+ 1024, # 1 KB
+ 4096, # 4 KB
+ 16384, # 16 KB
+ 65536, # 64 KB
+ 262144, # 256 KB
+ 1048576, # 1 MB - large messages
+ 4194304, # 4 MB - very large (like images)
+ 16777216, # 16 MB - very large (like high-res images)
+ ]
+
+ duration = 3.0 # seconds per test
+ results: list[ThroughputResult] = []
+
+ print(f"Running throughput tests ({duration}s each)...")
+ print()
+
+ for i, size in enumerate(payload_sizes):
+ print(
+ f" [{i + 1}/{len(payload_sizes)}] Testing {format_bytes(size)} payload...",
+ end=" ",
+ flush=True,
+ )
+
+ try:
+ result = run_throughput_test(
+ payload_size=size,
+ duration_seconds=duration,
+ discovery_address=f"ipc:///tmp/cortex_bench_{i}",
+ )
+ results.append(result)
+ print(
+ f"✓ {format_rate(result.throughput_msgs)}/s, {result.throughput_mbps:.1f} Mbps"
+ )
+ except Exception as e:
+ print(f"✗ Error: {e}")
+
+ time.sleep(0.5) # Brief pause between tests
+
+ # Print summary table
+ print()
+ print("=" * 70)
+ print("RESULTS SUMMARY")
+ print("=" * 70)
+ print()
+ print(
+ f"{'Payload':<12} {'Sent':<10} {'Recv':<10} {'Loss':<8} {'Msg/s':<12} {'Throughput':<15}"
+ )
+ print("-" * 70)
+
+ for r in results:
+ print(
+ f"{format_bytes(r.payload_size):<12} "
+ f"{r.messages_sent:<10,} "
+ f"{r.messages_received:<10,} "
+ f"{r.loss_rate * 100:>5.1f}% "
+ f"{format_rate(r.throughput_msgs):<12}/s "
+ f"{r.throughput_mbps:>8.1f} Mbps"
+ )
+
+ print("-" * 70)
+ print()
+
+ # Find peak throughput
+ if results:
+ peak_msgs = max(results, key=lambda r: r.throughput_msgs)
+ peak_bytes = max(results, key=lambda r: r.throughput_bytes)
+
+ print("Peak Performance:")
+ print(
+ f" Messages: {format_rate(peak_msgs.throughput_msgs)}/s @ {format_bytes(peak_msgs.payload_size)} payload"
+ )
+ print(
+ f" Bandwidth: {peak_bytes.throughput_mbps:.1f} Mbps @ {format_bytes(peak_bytes.payload_size)} payload"
+ )
+ print(f" ({peak_bytes.throughput_bytes / 1_000_000:.1f} MB/s)")
+
+ print()
+ print("=" * 70)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/docs/components/discovery.md b/docs/components/discovery.md
new file mode 100644
index 0000000..607d2b6
--- /dev/null
+++ b/docs/components/discovery.md
@@ -0,0 +1,128 @@
+# Discovery
+
+> **Source:** [`cortex.discovery.daemon`](../reference/discovery/daemon.md),
+> [`cortex.discovery.client`](../reference/discovery/client.md),
+> [`cortex.discovery.protocol`](../reference/discovery/protocol.md)
+
+Discovery is Cortex's control plane: a single long-lived process that maps
+topic names to ZMQ endpoints. It sits off the data path — once a subscriber
+has an endpoint, messages flow publisher → subscriber directly without the
+daemon's involvement.
+
+## Moving parts
+
+```mermaid
+flowchart LR
+ subgraph DP[discovery package]
+ PR[protocol.py DiscoveryRequest / DiscoveryResponse / TopicInfo]
+ DM[daemon.py DiscoveryDaemon ZMQ REP loop]
+ CL[client.py DiscoveryClient ZMQ REQ wrapper]
+ end
+
+ CL -- msgpack REQ --> DM
+ DM -- msgpack REP --> CL
+ PR -.-> DM
+ PR -.-> CL
+```
+
+Everyone agrees on the wire format via `protocol.py`. The daemon runs a
+single-threaded REP loop. The client speaks REQ from every publisher and
+subscriber in the graph.
+
+## Daemon
+
+Implemented in [`DiscoveryDaemon`][cortex.discovery.daemon.DiscoveryDaemon].
+
+Key behaviors:
+
+- Binds `zmq.REP` at `ipc:///tmp/cortex/discovery.sock` by default.
+- Maintains `_topics: dict[str, TopicInfo]` — **one publisher per topic**.
+- `RCVTIMEO=1000` on the socket so the loop can check `_running` for clean
+ Ctrl-C. This also means the daemon is naturally single-request-at-a-time —
+ a slow client blocks all others.
+
+### State transitions
+
+```mermaid
+stateDiagram-v2
+ [*] --> Starting
+ Starting --> Running: bind OK
+ Running --> Running: REGISTER → insert
+ Running --> Running: LOOKUP → read
+ Running --> Running: UNREGISTER → delete
+ Running --> Running: LIST → snapshot
+ Running --> Stopping: SIGINT / SHUTDOWN
+ Stopping --> [*]: close socket, unlink .sock
+```
+
+### Registry semantics
+
+| Case | Result |
+| -------------------------------------- | ------------------ |
+| New topic | Insert → OK |
+| Same topic, same `publisher_node` | Overwrite → OK (re-registration) |
+| Same topic, different `publisher_node` | Reject → ALREADY_EXISTS |
+| UNREGISTER missing topic | NOT_FOUND |
+
+## Client
+
+Implemented in [`DiscoveryClient`][cortex.discovery.client.DiscoveryClient].
+
+Thin REQ wrapper around the protocol. Important operational detail: **REQ
+sockets stick after a timeout** — they block subsequent sends waiting for a
+reply that never came. The client handles this by closing and recreating the
+socket on every timeout (`_reconnect`). Callers don't see it.
+
+### REQ timeout recovery
+
+```mermaid
+flowchart TD
+ S[send request] --> W[wait RCVTIMEO]
+ W -->|reply| OK[return DiscoveryResponse]
+ W -->|timeout| T[zmq.Again]
+ T --> C[close REQ socket]
+ C --> N[create fresh REQ same endpoint]
+ N -->|attempts < retries| S
+ N -->|exhausted| F[raise TimeoutError]
+```
+
+### Polling helpers
+
+- [`lookup_topic(name)`][cortex.discovery.client.DiscoveryClient.lookup_topic] —
+ one-shot, returns `None` on miss.
+- [`wait_for_topic(name, timeout, poll_interval)`][cortex.discovery.client.DiscoveryClient.wait_for_topic] —
+ blocking poll loop (time.sleep).
+- [`wait_for_topic_async(name, timeout, poll_interval)`][cortex.discovery.client.DiscoveryClient.wait_for_topic_async] —
+ async poll loop (asyncio.sleep). This is what [`Subscriber`][cortex.core.subscriber.Subscriber]
+ uses when `wait_for_topic=True`.
+
+## Protocol
+
+Implemented in [`cortex.discovery.protocol`](../reference/discovery/protocol.md).
+
+| Type | Purpose |
+| -------------------------------------------------------------------- | ----------------------------------------- |
+| [`DiscoveryCommand`][cortex.discovery.protocol.DiscoveryCommand] | `REGISTER_TOPIC` / `UNREGISTER_TOPIC` / `LOOKUP_TOPIC` / `LIST_TOPICS` / `SHUTDOWN` |
+| [`DiscoveryStatus`][cortex.discovery.protocol.DiscoveryStatus] | `OK` / `NOT_FOUND` / `ALREADY_EXISTS` / `ERROR` |
+| [`TopicInfo`][cortex.discovery.protocol.TopicInfo] | name, address, message_type, fingerprint, publisher_node |
+| [`DiscoveryRequest`][cortex.discovery.protocol.DiscoveryRequest] | command + optional topic_info / topic_name |
+| [`DiscoveryResponse`][cortex.discovery.protocol.DiscoveryResponse] | status, message, topic_info, topics |
+
+All payloads are msgpack. `TopicInfo` is nested as a packed sub-blob so
+discovery responses stay flat.
+
+## Known limitations
+
+Summarized here, detailed in [critique.md](../critique.md):
+
+- One-publisher-per-topic.
+- No heartbeats or leases — crashed publishers leave stale entries.
+- Single-threaded REP — slow client starves others.
+- `retries=1` in the client is a fencepost; effective retries today is zero.
+- Daemon state lost on restart; publishers do not auto-re-register.
+
+## See also
+
+- [Concepts → Discovery protocol](../concepts/discovery-protocol.md)
+- [Getting started → Running the discovery daemon](../getting-started/discovery-daemon.md)
+- [Critique](../critique.md)
diff --git a/docs/components/messages.md b/docs/components/messages.md
new file mode 100644
index 0000000..619ad96
--- /dev/null
+++ b/docs/components/messages.md
@@ -0,0 +1,110 @@
+# Messages
+
+> **Source:** [`cortex.messages.base`](../reference/messages/base.md),
+> [`cortex.messages.standard`](../reference/messages/standard.md)
+
+Messages are just `@dataclass`es that inherit from
+[`Message`][cortex.messages.base.Message]. Registering with the type system,
+computing a fingerprint, and (de)serialization all happen automatically.
+
+## Anatomy of a message
+
+```mermaid
+classDiagram
+ class Message {
+ +fingerprint() int
+ +to_bytes() bytes
+ +to_frames() list
+ +from_bytes(data) tuple
+ +from_frames(frames) tuple
+ +decode(bytes) tuple [static]
+ -_build_header()
+ -_field_names() tuple
+ -_field_values() list
+ -_next_sequence() int
+ }
+ class MessageHeader {
+ +fingerprint: int
+ +timestamp_ns: int
+ +sequence: int
+ +to_bytes() bytes
+ +from_bytes(data) MessageHeader
+ +size() int
+ }
+ class MessageType {
+ +register(cls)
+ +get(fingerprint) type
+ +get_all() dict
+ }
+ Message ..> MessageHeader : emits
+ Message ..> MessageType : auto-registers on subclass
+```
+
+## Defining a custom message
+
+```python
+from dataclasses import dataclass
+import numpy as np
+from cortex.messages.base import Message
+
+@dataclass
+class JointTrajectory(Message):
+ timestamp: float
+ positions: np.ndarray # shape (N,)
+ velocities: np.ndarray # shape (N,)
+ frame_id: str = ""
+```
+
+That is the entire contract. The class is registered into
+[`MessageType._registry`][cortex.messages.base.MessageType] by fingerprint at
+import time, and gains:
+
+- `JointTrajectory.fingerprint()` — 64-bit ID.
+- `msg.to_frames()` / `JointTrajectory.from_frames(frames)` — the transport path.
+- `msg.to_bytes()` / `JointTrajectory.from_bytes(data)` — the legacy blob path.
+- `Message.decode(blob)` — class dispatch via fingerprint registry.
+
+## Sequence numbering
+
+!!! warning "Class-level counter"
+ `Message._sequence_counter` is shared across **all publisher instances** of
+ the same message class in the process. Two `ArrayMessage` publishers
+ interleave sequence numbers. Per-topic gap detection therefore needs a
+ per-publisher counter today; see [critique.md § 12](../critique.md).
+
+## Built-in messages
+
+| Class | Use for |
+| ------------------------------------------------------------------------- | --------------------------------------------- |
+| [`StringMessage`][cortex.messages.standard.StringMessage] | Plain strings |
+| [`IntMessage`][cortex.messages.standard.IntMessage] / [`FloatMessage`][cortex.messages.standard.FloatMessage] | Single scalars |
+| [`BytesMessage`][cortex.messages.standard.BytesMessage] | Opaque binary |
+| [`DictMessage`][cortex.messages.standard.DictMessage] | Nested dicts with arrays/tensors |
+| [`ListMessage`][cortex.messages.standard.ListMessage] | Mixed-type lists |
+| [`ArrayMessage`][cortex.messages.standard.ArrayMessage] | Single NumPy array + name / frame_id |
+| [`MultiArrayMessage`][cortex.messages.standard.MultiArrayMessage] | `dict[str, np.ndarray]` (e.g. points+colors) |
+| [`TensorMessage`][cortex.messages.standard.TensorMessage] | PyTorch tensor (preserves device/grad) |
+| [`MultiTensorMessage`][cortex.messages.standard.MultiTensorMessage] | Named tensor bundle (model I/O) |
+| [`ImageMessage`][cortex.messages.standard.ImageMessage] | Image + encoding + width/height |
+| [`PointCloudMessage`][cortex.messages.standard.PointCloudMessage] | XYZ + optional RGB / intensity / normals |
+| [`PoseMessage`][cortex.messages.standard.PoseMessage] | 6-DoF pose (position + quaternion) |
+| [`TransformMessage`][cortex.messages.standard.TransformMessage] | 4×4 homogeneous transform |
+| [`TimestampMessage`][cortex.messages.standard.TimestampMessage] / [`HeaderMessage`][cortex.messages.standard.HeaderMessage] | ROS-style stamps |
+
+## Encode / decode lifecycle
+
+```mermaid
+flowchart LR
+ A[User builds dataclass] --> B[Publisher.publish]
+ B --> C[message.to_frames]
+ C --> D[[ZMQ multipart send]]
+ D --> E[[ZMQ multipart recv]]
+ E --> F[Message.from_frames]
+ F --> G[user callback msg, header]
+```
+
+## See also
+
+- [Concept: message wire format](../concepts/message-wire-format.md)
+- [Concept: fingerprinting](../concepts/fingerprinting.md)
+- [Tutorial: custom messages](../tutorials/custom-messages.md)
diff --git a/docs/components/node-and-executors.md b/docs/components/node-and-executors.md
new file mode 100644
index 0000000..c022a02
--- /dev/null
+++ b/docs/components/node-and-executors.md
@@ -0,0 +1,206 @@
+# Node & Executors
+
+> **Source:** [`cortex.core.node`](../reference/core/node.md),
+> [`cortex.core.executor`](../reference/core/executor.md)
+
+A [`Node`][cortex.core.node.Node] is the user-facing composition unit: it owns
+a shared ZMQ async context and a collection of publishers, subscribers, and
+timers. Executors provide the scheduling primitives that timers and
+subscriber receive loops run on.
+
+## Responsibilities
+
+```mermaid
+flowchart TB
+ subgraph NodeResp[Node]
+ CTX[shared zmq.asyncio.Context]
+ PUBS[Publishers dict]
+ SUBS[Subscribers dict]
+ TIMERS[Timers list]
+ end
+
+ NodeResp -- create_publisher --> P[Publisher]
+ NodeResp -- create_subscriber --> S[Subscriber]
+ NodeResp -- create_timer --> RE[RateExecutor]
+ NodeResp -- run / close --> Lifecycle
+
+ P -. uses .-> CTX
+ S -. uses .-> CTX
+```
+
+One node = one process boundary in practice. Nothing stops you running
+multiple nodes in the same process (`asyncio.gather([n.run() for n in nodes])`,
+see [`examples/multi_node_system.py`](https://github.com/sudoRicheek/cortex/blob/main/examples/multi_node_system.py)),
+but remember they share the same event loop — a slow callback in one still
+blocks the others.
+
+## Lifecycle
+
+```mermaid
+stateDiagram-v2
+ [*] --> Constructed: Node(name)
+ Constructed --> Configured: create_publisher/subscriber/timer
+ Configured --> Running: await node.run()
+ Running --> Running: timers fire, callbacks dispatch
+ Running --> Stopping: node.stop() or cancel
+ Stopping --> Closed: await node.close()
+ Closed --> [*]: context terminated
+```
+
+### `node.run()`
+
+Spawns one asyncio task per timer and one per callback-bearing subscriber,
+then `asyncio.gather`s them. Returns when all tasks complete or the node is
+stopped.
+
+```python
+async with Node("my_node") as node:
+ node.create_publisher("/x", IntMessage)
+ node.create_subscriber("/y", IntMessage, callback=on_y)
+ await node.run() # blocks until cancelled
+# __aexit__ calls close() automatically
+```
+
+### `node.close()`
+
+Stops all executors, cancels outstanding tasks, closes every publisher and
+subscriber (each of which unregisters/unbinds their own socket), and
+terminates the shared ZMQ context. Idempotent.
+
+## Executors
+
+Two flavours, both subclasses of `BaseExecutor`.
+
+```mermaid
+classDiagram
+ class BaseExecutor {
+ <>
+ +func: AsyncCallback
+ +start()
+ +stop()
+ +run(*args, **kwargs)
+ #_run_impl()*
+ }
+ class AsyncExecutor {
+ +_run_impl()
+ }
+ class RateExecutor {
+ +rate_hz: float
+ +interval: float
+ +_run_impl()
+ }
+ BaseExecutor <|-- AsyncExecutor
+ BaseExecutor <|-- RateExecutor
+```
+
+### `AsyncExecutor`
+
+"Run this coroutine as fast as possible, yielding between iterations."
+
+```mermaid
+flowchart LR
+ Start --> Check{running?}
+ Check -- no --> End
+ Check -- yes --> Call[await func]
+ Call -- exception --> Log[log error]
+ Log --> Sleep
+ Call --> Sleep[await sleep 0]
+ Sleep --> Check
+```
+
+Used by `Subscriber.run` to drive the receive-dispatch loop.
+
+### `RateExecutor`
+
+"Run this coroutine at a constant rate, catching up on overruns."
+
+```mermaid
+flowchart TD
+ Start[next = perf_counter] --> Loop{running?}
+ Loop -- no --> End
+ Loop -- yes --> Now[now = perf_counter]
+ Now --> Due{now >= next?}
+ Due -- yes --> Call[await func]
+ Call --> Advance[next += interval]
+ Advance --> Behind{next < now?}
+ Behind -- yes --> Reset[next = now + interval]
+ Behind -- no --> Wait
+ Reset --> Wait
+ Due -- no --> Wait[await sleep next - now]
+ Wait --> Loop
+```
+
+The catch-up branch silently drops ticks — if your 100 Hz callback takes
+20 ms once, you do not get two callbacks back-to-back; you skip one tick.
+
+!!! warning "Redundant yield"
+ Today there is an `await asyncio.sleep(0)` inside the loop *and*
+ `await asyncio.sleep(max(0, dt))` at the bottom. That generates an extra
+ wakeup per tick. See [critique § 15](../critique.md).
+
+## Timer usage
+
+```python
+node.create_timer(1.0 / 30, self.publish_frame) # 30 Hz
+node.create_timer(1.0, self.log_stats) # 1 Hz
+```
+
+Timers are plain async functions — no decorator, no magic. They run in the
+same event loop as subscriber callbacks, so the same head-of-line caveat
+applies.
+
+## Shared ZMQ context
+
+Every publisher and subscriber created through a node **reuses** the node's
+`zmq.asyncio.Context`. This means:
+
+- Socket creation is cheap.
+- io threads are shared across all sockets in the node.
+- Terminating the node's context cleanly shuts down all its sockets.
+
+Do not create your own context inside callbacks; you'll leak resources and
+defeat the shared-io-thread optimization.
+
+## Minimal complete node
+
+```python
+from dataclasses import dataclass
+import numpy as np
+import cortex
+from cortex import Node, Message
+from cortex.messages.base import MessageHeader
+
+
+@dataclass
+class Ping(Message):
+ payload: np.ndarray
+ counter: int
+
+
+class Echo(Node):
+ def __init__(self):
+ super().__init__("echo")
+ self.pub = self.create_publisher("/pong", Ping)
+ self.create_subscriber("/ping", Ping, callback=self.on_ping)
+ self._n = 0
+
+ async def on_ping(self, msg: Ping, header: MessageHeader):
+ self._n += 1
+ self.pub.publish(Ping(payload=msg.payload, counter=self._n))
+
+
+async def main():
+ async with Echo() as node:
+ await node.run()
+
+
+if __name__ == "__main__":
+ cortex.run(main())
+```
+
+## See also
+
+- [`cortex.core.node`](../reference/core/node.md)
+- [`cortex.core.executor`](../reference/core/executor.md)
+- [Concepts → Async execution model](../concepts/async-execution-model.md)
+- [Components → Publisher & Subscriber](publisher-subscriber.md)
diff --git a/docs/components/publisher-subscriber.md b/docs/components/publisher-subscriber.md
new file mode 100644
index 0000000..c6b69f8
--- /dev/null
+++ b/docs/components/publisher-subscriber.md
@@ -0,0 +1,280 @@
+# Publisher & Subscriber
+
+> **Source:** [`cortex.core.publisher`](../reference/core/publisher.md),
+> [`cortex.core.subscriber`](../reference/core/subscriber.md)
+
+The data-plane workhorses. A `Publisher` binds a ZMQ `PUB` socket and registers
+with discovery; a `Subscriber` looks up the endpoint, connects a `SUB` socket,
+and drives an async receive loop. Discovery is consulted **once per topic** on
+startup — it is not on the hot path.
+
+## Relationship to the rest of the stack
+
+```mermaid
+flowchart LR
+ Node -.owns.-> P[Publisher]
+ Node -.owns.-> S[Subscriber]
+ P -- register --> DC1[DiscoveryClient]
+ S -- lookup --> DC2[DiscoveryClient]
+ P -- send_multipart --> Sock1[(zmq.PUB IPC)]
+ Sock1 -. IPC .-> Sock2[(zmq.SUB)]
+ S -- recv_multipart --> Sock2
+ M[Message] -- to_frames --> P
+ S -- from_frames --> M
+```
+
+## Publisher
+
+### Construction
+
+Always create via [`Node.create_publisher`][cortex.core.node.Node.create_publisher] —
+direct construction works but skips the shared ZMQ context reuse and the
+node-level registration bookkeeping.
+
+```python
+pub = node.create_publisher(
+ topic_name="/camera/image", # must start with "/"
+ message_type=ImageMessage, # fingerprint is taken from this class
+ queue_size=100, # SNDHWM; drops under backpressure
+)
+```
+
+### Startup sequence
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant U as User
+ participant Pub as Publisher
+ participant FS as /tmp/cortex/topics/
+ participant ZMQ as zmq.PUB
+ participant D as Discovery daemon
+
+ U->>Pub: __init__(topic, msg_cls, ...)
+ Pub->>Pub: address = generate_ipc_address(topic, node)
+ Pub->>FS: mkdir -p; unlink stale .sock
+ Pub->>ZMQ: socket(PUB); setsockopt HWM/LINGER; bind(address)
+ Pub->>D: REGISTER TopicInfo{name, address, fingerprint, node}
+ D-->>Pub: OK / ALREADY_EXISTS
+ Note over Pub: ready; user can publish()
+```
+
+Two things worth calling out:
+
+1. The IPC address is derived deterministically from `node_name` and
+ `topic_name` via [`generate_ipc_address`][cortex.core.publisher.generate_ipc_address]:
+ `ipc:///tmp/cortex/topics/__.sock`.
+2. `_setup_socket` unlinks any existing file at that path before binding. That
+ protects against crash-leftover sockets, but also means **two publishers
+ configured with the same `node_name + topic_name` in the same process tree
+ will silently stomp each other** — see [critique § 10](../critique.md).
+
+### Publish path
+
+```mermaid
+flowchart LR
+ Msg[Message dataclass] --> H[build MessageHeader fp, ts, seq]
+ Msg --> V[serialize_message_frames values]
+ H --> F[[frame 1: header 24B]]
+ V --> F2[[frame 2: msgpack metadata]]
+ V --> FN[[frames 3..N: array buffers]]
+ T[[frame 0: topic bytes]]
+ F --> Send
+ F2 --> Send
+ FN --> Send
+ T --> Send
+ Send[send_multipart NOBLOCK] -->|success| Pub[publish count++]
+ Send -->|zmq.Again| Drop[return False]
+```
+
+`publish()` is **synchronous** and returns a boolean:
+
+- `True` — handed to ZMQ successfully.
+- `False` — `zmq.Again`, queue full, message dropped.
+
+Any other exception is logged and swallowed; `publish` still returns `False`.
+For robotics code this "fire and forget" is intentional — the caller decides
+whether to retry based on the return value and the topic's role.
+
+### Async context quirk
+
+`Node` owns a `zmq.asyncio.Context`. The `Publisher` constructor detects this
+and wraps a **sync** `zmq.Context` around the same underlying io threads:
+
+```python
+if isinstance(self._context, zmq.asyncio.Context):
+ self._context: zmq.Context = zmq.Context(self._context)
+```
+
+This keeps `publish()` a normal function call instead of forcing every publish
+to be `await`ed. It is the right performance choice, but it has consequences:
+
+!!! danger "`zmq.PUB` is not thread-safe"
+ Do not call `publish()` on the same `Publisher` from multiple threads
+ (or multiple asyncio tasks that could race on `send_multipart`). Serialize
+ per-publisher calls yourself if you fan out work.
+
+### Lifecycle and cleanup
+
+```mermaid
+stateDiagram-v2
+ [*] --> Bound: bind + register
+ Bound --> Publishing: publish() calls
+ Publishing --> Publishing: more messages
+ Publishing --> Closed: close()
+ Bound --> Closed: close()
+ Closed --> [*]: unregister, unlink .sock file
+```
+
+`Publisher.close()` is best-effort: it unregisters from the daemon (silently
+tolerates a dead daemon), closes the socket, and removes the IPC file.
+Exceptions from any one step do not block the others.
+
+### Statistics
+
+`publisher.publish_count`, `publisher.last_publish_time`, and
+`publisher.is_registered` are exposed for instrumentation. They update on the
+hot path with no locking — read them from the same task that calls `publish()`
+for deterministic numbers.
+
+## Subscriber
+
+### Construction
+
+```python
+sub = node.create_subscriber(
+ topic_name="/camera/image",
+ message_type=ImageMessage,
+ callback=on_image, # async def callback(msg, header)
+ queue_size=10, # RCVHWM
+ wait_for_topic=True, # poll until topic appears
+ topic_timeout=30.0, # abort wait after N seconds
+)
+```
+
+If `callback` is `None`, the subscriber is passive — call `await sub.receive()`
+manually. With a callback, `Node.run()` will drive the receive loop.
+
+### Startup sequence
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant U as User
+ participant S as Subscriber
+ participant D as DiscoveryClient
+ participant Pub as publisher IPC
+
+ U->>S: __init__(...)
+ S->>D: lookup_topic(name) # non-blocking
+ alt found immediately
+ D-->>S: TopicInfo
+ S->>S: verify fingerprint
+ S->>Pub: SUB connect + SUBSCRIBE topic
+ Note over S: is_connected = True
+ else not found
+ D-->>S: None
+ Note over S: defer; retry in run()
+ end
+
+ U->>S: node.run() schedules sub.run()
+ S->>D: wait_for_topic_async(name, timeout)
+ D-->>S: TopicInfo
+ S->>Pub: SUB connect + SUBSCRIBE topic
+```
+
+The constructor tries a non-blocking lookup first so that when a publisher is
+already up, no polling is needed. The polling fallback only kicks in inside
+`sub.run()` via [`wait_for_topic_async`][cortex.discovery.client.DiscoveryClient.wait_for_topic_async].
+
+### Receive loop
+
+```mermaid
+flowchart LR
+ Loop{{AsyncExecutor}} --> Recv[await recv_multipart copy=False]
+ Recv --> Frames[frames = topic, header, metadata, *buffers]
+ Frames --> Decode[Message.from_frames frames 1..]
+ Decode --> CB[await callback msg, header]
+ CB --> Yield[await asyncio.sleep 0]
+ Yield --> Loop
+```
+
+- `copy=False` means each frame is a `zmq.Frame` — the metadata and array
+ buffers are memoryview-able without a copy. See
+ [`cortex.utils.serialization`](../reference/utils/serialization.md).
+- The one-frame fast path (`len(payload_frames) == 1`) handles legacy
+ publishers still on the single-blob path — it falls back to
+ `from_bytes` on the single payload buffer.
+
+### Head-of-line blocking
+
+The callback runs **inline** in the receive loop. A slow callback stalls
+everything:
+
+```mermaid
+gantt
+ title Receive loop when callback is slow
+ dateFormat X
+ axisFormat %L ms
+ section Messages
+ recv m1 :0, 1
+ decode m1 :1, 2
+ callback m1 (slow!) :active, 2, 50
+ recv m2 (queued on HWM) :crit, 50, 51
+ decode m2 :51, 52
+ callback m2 :52, 55
+```
+
+If callbacks do meaningful work, dispatch them to a task or thread pool:
+
+```python
+import asyncio
+
+async def on_image(msg, header):
+ asyncio.create_task(process_in_background(msg, header))
+```
+
+Or use a bounded queue + worker pattern. The roadmap item in
+[critique § 6](../critique.md) is to lift this into the framework.
+
+### Fingerprint verification
+
+On connect the subscriber compares its class's fingerprint to the one in the
+registry entry. Today a mismatch only logs a warning and **proceeds anyway** —
+downstream decoding will then fail hard. Treat fingerprint warnings as errors
+in your code.
+
+### Cleanup
+
+`Subscriber.close()` stops the executor, closes the discovery client and SUB
+socket, and flips `is_connected` to `False`. Safe to call multiple times;
+errors are suppressed so teardown does not cascade.
+
+## Statistics and instrumentation
+
+| Property | Publisher | Subscriber |
+| ---------------------------------------- | --------- | ---------- |
+| `publish_count` / `receive_count` | ✓ | ✓ |
+| `last_publish_time` / `last_receive_time`| ✓ | ✓ |
+| `is_registered` / `is_connected` | ✓ | ✓ |
+| `topic_info` | | ✓ |
+
+None of these are atomic; treat them as coarse gauges.
+
+## Common pitfalls
+
+| Symptom | Cause | Fix |
+| ------------------------------------------ | ------------------------------------------------------------------------------------------ | ---------------------------------------- |
+| First N messages not received | ZMQ "slow joiner": SUB not connected yet when PUB started publishing | Let subscriber start first, or sleep briefly before first publish |
+| Subscriber receives nothing, no errors | Topic name mismatch, or forgot to call `node.run()` | Log both sides; run `cortex-discovery --log-level DEBUG` |
+| `publish()` returns `False` repeatedly | Subscriber can't keep up; SNDHWM reached | Increase `queue_size`, or reduce publish rate |
+| Mutating a received array "corrupts" later | Decoded arrays alias ZMQ frame memory | `arr = arr.copy()` before mutating |
+| Two processes stomp each other's socket | Same `node_name + topic_name` | Unique node names per process |
+
+## See also
+
+- [`cortex.core.publisher`](../reference/core/publisher.md)
+- [`cortex.core.subscriber`](../reference/core/subscriber.md)
+- [Concepts → Async execution model](../concepts/async-execution-model.md)
+- [Concepts → Message wire format](../concepts/message-wire-format.md)
+- [Guides → Debugging](../guides/debugging.md)
diff --git a/docs/components/serialization.md b/docs/components/serialization.md
new file mode 100644
index 0000000..485b6b7
--- /dev/null
+++ b/docs/components/serialization.md
@@ -0,0 +1,133 @@
+# Serialization
+
+> **Source:** [`cortex.utils.serialization`](../reference/utils/serialization.md),
+> [`cortex.utils.hashing`](../reference/utils/hashing.md)
+
+Two encodings live side by side: a **multipart / out-of-band** path that the
+transport actually uses, and a **single-blob** path kept for the legacy
+`Message.to_bytes` / `decode` API and tests. Both support the same Python
+types; only their frame layout differs.
+
+## Supported types
+
+| Type | Inline path (`to_bytes`) | OOB path (`to_frames`) |
+| ----------------------------- | ----------------------------- | ----------------------------- |
+| `None` | 1 byte tag | msgpack `nil` |
+| `int`, `float`, `str`, `bool` | msgpack PRIMITIVE | msgpack |
+| `bytes` | tag + length + bytes | msgpack bin |
+| `list`, `tuple`, `dict` | msgpack with ExtType arrays | msgpack with OOB descriptors |
+| `np.ndarray` | ExtType (inline bytes) | OOB descriptor + extra frame |
+| `torch.Tensor` | ExtType (inline bytes) | OOB descriptor + extra frame |
+
+## The two paths, side by side
+
+=== "OOB multipart (used on the wire)"
+
+ ```mermaid
+ flowchart LR
+ V[values] --> E[_encode_transport_value]
+ E --> Meta[msgpack metadata OOB descriptors for arrays]
+ E --> Bufs[[buffer 0]]
+ E --> Bufs2[[buffer 1]]
+ Meta --> Out[(list of frames)]
+ Bufs --> Out
+ Bufs2 --> Out
+ ```
+
+ The function of interest is
+ [`serialize_message_frames`][cortex.utils.serialization.serialize_message_frames]:
+
+ ```python
+ metadata_bytes, [buf0, buf1, ...] = serialize_message_frames(values)
+ ```
+
+ Arrays stay contiguous; ZMQ hands the buffer straight to the kernel.
+
+=== "Inline blob (legacy / `Message.decode`)"
+
+ ```mermaid
+ flowchart LR
+ V[values] --> P[msgpack.packb default=_msgpack_default]
+ P --> Ext[ExtType 1/2 for arrays/tensors bytes embedded]
+ Ext --> Blob[single bytes blob]
+ ```
+
+ The single blob round-trips through `serialize(value)` →
+ `deserialize(data)`. Useful for persisting to disk, caches, or when you
+ need a self-contained payload without tracking extra buffers.
+
+## OOB descriptors
+
+An out-of-band descriptor is a small dict that takes the place of the array
+inside the msgpack metadata:
+
+```python
+# numpy
+{"__cortex_oob__": "numpy", "buffer": 0, "dtype": ">ZMQ: recv_multipart(copy=False)
+ ZMQ-->>Sub: frame with .buffer property
+ Sub->>MV: memoryview(frame.buffer)
+ Sub->>NP: np.frombuffer(mv, dtype).reshape(shape)
+ Note over NP: array aliases the ZMQ frame memory
+```
+
+!!! warning "Aliasing caveat"
+ The returned NumPy array is **a view over the ZMQ frame buffer**. It is
+ safe to read as long as the frame lives, which is at least until your
+ callback returns. If you need to:
+
+ - mutate the array, or
+ - keep it past the callback,
+
+ call `arr = arr.copy()` first. This is cheap compared to the savings on
+ the hot path.
+
+## PyTorch specifics
+
+- Tensors are **always moved to CPU** for transport. Transport frames carry
+ the tensor's CPU bytes plus the original device string.
+- On decode, CUDA tensors are moved back to the original device when CUDA is
+ available; otherwise they stay on CPU.
+- `requires_grad` is preserved.
+
+## Fingerprinting
+
+Separate but related: [`compute_fingerprint(cls)`][cortex.utils.hashing.compute_fingerprint]
+computes a 64-bit identity from the module path, class name, and sorted
+`field:type` pairs. Cached per-class in `_fingerprint_cache`. See
+[Concepts → Fingerprinting](../concepts/fingerprinting.md) for the full story.
+
+## When to use each helper
+
+| Helper | Use when |
+| ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
+| [`serialize_message_frames`][cortex.utils.serialization.serialize_message_frames] | You're building a custom transport that speaks multipart |
+| [`deserialize_message_frames`][cortex.utils.serialization.deserialize_message_frames] | Decoding the above |
+| [`serialize(value)`][cortex.utils.serialization.serialize] / [`deserialize`][cortex.utils.serialization.deserialize] | Persisting a single value to disk / cache |
+| [`serialize_numpy`][cortex.utils.serialization.serialize_numpy] / [`deserialize_numpy`][cortex.utils.serialization.deserialize_numpy] | Raw array round-trip without msgpack overhead |
+| `Message.to_frames` / `Message.from_frames` | Anything inside Cortex itself |
+
+## See also
+
+- [Concepts → Message wire format](../concepts/message-wire-format.md)
+- [Concepts → Fingerprinting](../concepts/fingerprinting.md)
+- [Guides → Performance tuning](../guides/performance-tuning.md)
diff --git a/docs/concepts/architecture.md b/docs/concepts/architecture.md
new file mode 100644
index 0000000..ef5159a
--- /dev/null
+++ b/docs/concepts/architecture.md
@@ -0,0 +1,96 @@
+# Architecture
+
+Cortex has three moving parts: the **discovery daemon**, **publisher** nodes,
+and **subscriber** nodes. They coordinate over ZeroMQ — a REQ/REP control plane
+for discovery and a PUB/SUB data plane for messages.
+
+## High-level view
+
+```mermaid
+flowchart TB
+ subgraph CP[Control plane]
+ DD[Discovery daemon ipc:///tmp/cortex/discovery.sock]
+ end
+
+ subgraph DP[Data plane]
+ direction LR
+ P[Publisher node] -- "PUB / SUB (IPC)" --> S[Subscriber node]
+ end
+
+ P -- REGISTER --> DD
+ S -- LOOKUP --> DD
+ DD -- TopicInfo --> S
+
+ classDef daemon fill:#6366f1,stroke:#312e81,color:#fff
+ classDef node fill:#0ea5e9,stroke:#0369a1,color:#fff
+ class DD daemon
+ class P,S node
+```
+
+## Message journey
+
+Tracing one frame end to end:
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant User as User code
+ participant Pub as Publisher
+ participant Sock as ZMQ PUB socket
+ participant Net as IPC
+ participant SSock as ZMQ SUB socket
+ participant Sub as Subscriber
+ participant CB as async callback
+
+ User->>Pub: publish(Message)
+ Pub->>Pub: build header (fingerprint, ts, seq)
+ Pub->>Pub: encode field values + OOB buffers
+ Pub->>Sock: send_multipart([topic, header, metadata, *buffers])
+ Sock->>Net: zero-copy handoff
+ Net->>SSock: frames delivered
+ SSock->>Sub: recv_multipart(copy=False)
+ Sub->>Sub: Message.from_frames(...)
+ Sub->>CB: await callback(msg, header)
+```
+
+Key invariant: array buffers ride as **separate ZMQ frames**, not inline in the
+metadata. See [Message wire format](message-wire-format.md).
+
+## Process layout
+
+```mermaid
+flowchart LR
+ subgraph P1[Process: sensor]
+ N1[Node shared zmq.asyncio.Context]
+ PUB1[Publisher /sensor/a]
+ PUB2[Publisher /sensor/b]
+ T1[Timer 30 Hz]
+ N1 --> PUB1
+ N1 --> PUB2
+ N1 --> T1
+ end
+
+ subgraph P2[Process: processor]
+ N2[Node]
+ SUB1[Subscriber /sensor/a]
+ SUB2[Subscriber /sensor/b]
+ PUB3[Publisher /processed]
+ N2 --> SUB1
+ N2 --> SUB2
+ N2 --> PUB3
+ end
+
+ PUB1 -.->|IPC| SUB1
+ PUB2 -.->|IPC| SUB2
+```
+
+Each topic gets its own IPC socket under `/tmp/cortex/topics/`. A single `Node`
+shares one `zmq.asyncio.Context` across all its publishers and subscribers to
+avoid per-socket io thread overhead.
+
+## See also
+
+- [Message wire format](message-wire-format.md)
+- [Fingerprinting](fingerprinting.md)
+- [Discovery protocol](discovery-protocol.md)
+- [Async execution model](async-execution-model.md)
diff --git a/docs/concepts/async-execution-model.md b/docs/concepts/async-execution-model.md
new file mode 100644
index 0000000..4449ca7
--- /dev/null
+++ b/docs/concepts/async-execution-model.md
@@ -0,0 +1,88 @@
+# Async execution model
+
+Cortex nodes are asyncio-native. One event loop per process drives all
+publishers, subscribers, and timers for that node. On Linux and macOS,
+[`cortex.run`][cortex.utils.loop.run] prefers `uvloop` for lower tail latency.
+
+## Node task graph
+
+```mermaid
+flowchart TB
+ Loop(((asyncio event loop)))
+ Loop --> T1[Timer 1 RateExecutor]
+ Loop --> T2[Timer 2 RateExecutor]
+ Loop --> S1[Subscriber 1 AsyncExecutor]
+ Loop --> S2[Subscriber 2 AsyncExecutor]
+```
+
+`Node.run()` spawns one task per timer (`RateExecutor`) and one per
+callback-bearing subscriber (`AsyncExecutor`). It then `asyncio.gather`s them
+until cancelled.
+
+## `RateExecutor` cadence
+
+```mermaid
+sequenceDiagram
+ participant L as Event loop
+ participant R as RateExecutor
+ participant CB as callback
+
+ loop every interval
+ L->>R: resume
+ R->>CB: await callback()
+ R->>R: next_exec_time += interval
+ alt fell behind
+ R->>R: next_exec_time = now + interval
+ end
+ R->>L: sleep(next_exec_time - now)
+ end
+```
+
+Catch-up logic **silently drops ticks** when a callback overruns its period —
+something to keep in mind for control loops.
+
+## `AsyncExecutor` receive loop
+
+```mermaid
+sequenceDiagram
+ participant L as Event loop
+ participant A as AsyncExecutor
+ participant S as SUB socket
+ participant CB as user callback
+
+ loop while running
+ L->>A: resume
+ A->>S: await recv_multipart(copy=False)
+ S-->>A: frames
+ A->>A: decode message
+ A->>CB: await callback(msg, header)
+ A->>L: sleep(0) (yield)
+ end
+```
+
+!!! warning "Head-of-line blocking"
+ A slow callback stalls the receive loop. Messages pile up on the SUB HWM
+ and get evicted. If you expect variable-latency work, offload callback
+ bodies to `asyncio.create_task(...)` or a thread pool.
+
+## Publish is sync-inside-async
+
+The `Publisher` uses a sync `zmq.Context` (shadowed onto the node's async
+context). `publish()` is a plain function call — no `await`. This avoids the
+overhead of the async zmq integration on the send path.
+
+!!! danger "Not thread-safe"
+ A `zmq.PUB` socket is not safe to call from multiple threads or tasks
+ concurrently. Serialize calls to `publish()` per publisher.
+
+## uvloop
+
+On Unix, importing `cortex.run` checks for `uvloop` and uses it if present.
+Measured impact: modest throughput improvement, meaningful p99 latency
+reduction on high-rate small messages.
+
+## See also
+
+- [`cortex.core.executor`](../reference/core/executor.md)
+- [`cortex.core.node`](../reference/core/node.md)
+- [Components → Node & Executors](../components/node-and-executors.md)
diff --git a/docs/concepts/discovery-protocol.md b/docs/concepts/discovery-protocol.md
new file mode 100644
index 0000000..9a73108
--- /dev/null
+++ b/docs/concepts/discovery-protocol.md
@@ -0,0 +1,111 @@
+# Discovery protocol
+
+The discovery daemon speaks a tiny msgpack-over-REQ/REP protocol. It is not
+on the data path — once a subscriber has the endpoint, messages flow
+publisher → subscriber directly.
+
+## Commands
+
+| Command | Payload required | Returns |
+| ------------------------------ | ------------------------ | ------------------ |
+| `REGISTER_TOPIC` (1) | [`TopicInfo`][cortex.discovery.protocol.TopicInfo] | OK / ALREADY_EXISTS |
+| `UNREGISTER_TOPIC` (2) | `topic_name` or `TopicInfo.name` | OK / NOT_FOUND |
+| `LOOKUP_TOPIC` (3) | `topic_name` | OK + `TopicInfo` / NOT_FOUND |
+| `LIST_TOPICS` (4) | — | OK + `list[TopicInfo]` |
+| `SHUTDOWN` (99) | — | OK; daemon exits |
+
+Status codes: `OK=0`, `NOT_FOUND=1`, `ALREADY_EXISTS=2`, `ERROR=3`.
+
+## `TopicInfo` payload
+
+```python
+@dataclass
+class TopicInfo:
+ name: str # "/camera/image"
+ address: str # "ipc:///tmp/cortex/topics/cam__camera_image.sock"
+ message_type: str # "ImageMessage"
+ fingerprint: int # 64-bit class fingerprint
+ publisher_node: str # "cam"
+```
+
+## Publisher register flow
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant P as Publisher
+ participant D as Daemon REP
+
+ P->>P: bind PUB socket on ipc:///tmp/cortex/topics/__.sock
+ P->>D: REQ → DiscoveryRequest(REGISTER_TOPIC, TopicInfo{...})
+ D->>D: if topic_name absent: insert; else compare publisher_node
+ alt new
+ D-->>P: OK "Registered topic: /x"
+ else same publisher re-registering
+ D-->>P: OK (overwrite)
+ else different publisher, same topic
+ D-->>P: ALREADY_EXISTS
+ end
+```
+
+## Subscriber lookup flow
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant S as Subscriber
+ participant D as Daemon REP
+ participant P as Publisher
+
+ S->>D: REQ → LOOKUP_TOPIC("/x")
+ alt present
+ D-->>S: OK + TopicInfo
+ S->>P: SUB connect + SUBSCRIBE "/x"
+ else missing
+ D-->>S: NOT_FOUND
+ Note over S: if wait_for_topic: poll every 500 ms until timeout
+ S->>D: retry LOOKUP_TOPIC
+ end
+```
+
+`wait_for_topic_async` implements the retry loop with `asyncio.sleep` so the
+event loop keeps spinning.
+
+## REQ-socket recovery
+
+ZMQ `REQ` sockets enter a bad state after a missed reply — they block further
+sends. The client detects `zmq.Again` on timeout and rebuilds the socket:
+
+```mermaid
+flowchart TD
+ A[send request] -->|timeout| B[REQ socket stuck]
+ B --> C[close socket]
+ C --> D[recreate socket same endpoint]
+ D --> E[retry up to retries]
+```
+
+See [`DiscoveryClient._reconnect`][cortex.discovery.client.DiscoveryClient].
+
+!!! bug "Fencepost in `retries` default"
+ `retries=1` today executes the loop exactly once — i.e. no retry. Bump to
+ `retries=3` in client-side code if you need resilience.
+
+## Failure modes & how Cortex handles them
+
+| Scenario | Behavior |
+| ---------------------------------------- | --------------------------------------------- |
+| Daemon not running when publisher starts | Register fails; publisher still publishes, but no subscriber can find it. |
+| Daemon restarts | All state lost; publishers must re-register. Current design has no auto-re-register. |
+| Publisher crashes | Registry keeps stale `TopicInfo` until someone UNREGISTERs. |
+| Two publishers, same topic | Second registration rejected with `ALREADY_EXISTS`. |
+| Subscriber looks up before publisher | `NOT_FOUND`; caller may `wait_for_topic` to poll. |
+
+Roadmap items (see [critique.md](../critique.md)) to address these: leases with
+heartbeats, multi-publisher support, and notify-on-change.
+
+## See also
+
+- [`cortex.discovery.protocol`](../reference/discovery/protocol.md)
+- [`cortex.discovery.client`](../reference/discovery/client.md)
+- [`cortex.discovery.daemon`](../reference/discovery/daemon.md)
+- [Components → Discovery](../components/discovery.md)
diff --git a/docs/concepts/fingerprinting.md b/docs/concepts/fingerprinting.md
new file mode 100644
index 0000000..b442c22
--- /dev/null
+++ b/docs/concepts/fingerprinting.md
@@ -0,0 +1,104 @@
+# Fingerprinting
+
+Every message class gets a **64-bit identifier** derived from its name and
+field schema. The fingerprint rides in the header of every published message
+and does two jobs:
+
+1. **Type dispatch** — `Message.decode(bytes)` looks up the right class in the
+ [`MessageType`][cortex.messages.base.MessageType] registry.
+2. **Compatibility check** — subscribers verify that the topic they looked up
+ advertises the same fingerprint as the type they were written against.
+
+## Derivation
+
+```mermaid
+flowchart LR
+ A[class.__module__ + qualname] --> C[canonical string]
+ B[sorted list of field:type] --> C
+ C --> H[SHA-256]
+ H --> F[first 8 bytes → u64 big-endian]
+```
+
+Pseudocode:
+
+```python
+canonical = f"{cls.__module__}.{cls.__qualname__}|{','.join(sorted('name:type'))}"
+fingerprint = int.from_bytes(sha256(canonical.encode()).digest()[:8], "big")
+```
+
+The result is cached per-class in `_fingerprint_cache`, computed once lazily.
+
+## Registry
+
+`Message.__init_subclass__` auto-registers every concrete subclass into
+[`MessageType._registry`][cortex.messages.base.MessageType] keyed by
+fingerprint. Nothing else to do — decorating your dataclass with
+`@dataclass` and inheriting from `Message` is enough.
+
+```python
+from dataclasses import dataclass
+from cortex.messages.base import Message
+
+@dataclass
+class JointState(Message):
+ positions: list[float]
+ velocities: list[float]
+
+print(hex(JointState.fingerprint()))
+```
+
+## When fingerprints change
+
+The fingerprint is **not stable across edits that touch**:
+
+- Module path or class name (`cortex.messages.standard.ArrayMessage` renamed
+ anywhere).
+- Field names.
+- Field *type annotations as spelled* (see the PEP 563 caveat below).
+
+It is stable across:
+
+- Adding/removing unrelated classes.
+- Reordering methods.
+- Changing docstrings or default values.
+
+## Subscriber check
+
+On connect, the subscriber compares the topic's advertised fingerprint against
+the one it computed from its message class:
+
+```mermaid
+sequenceDiagram
+ participant S as Subscriber
+ participant D as Discovery daemon
+
+ S->>D: LOOKUP /topic
+ D-->>S: TopicInfo(fingerprint=0xABCD...)
+ S->>S: compare with MyMessage.fingerprint()
+ alt mismatch
+ S-->>S: log warning, continue anyway
+ else match
+ S-->>S: connect and subscribe
+ end
+```
+
+!!! warning "Today: mismatch is a warning, not an error"
+ A fingerprint mismatch currently only logs a warning — see [critique.md](../critique.md).
+ Downstream decoding will fail hard. Until that is tightened, prefer to
+ re-exchange type definitions between processes rather than rely on this guard.
+
+## PEP 563 caveat
+
+`field.type` may be a **string** (under `from __future__ import annotations`)
+or a **real type** otherwise. The canonical string differs in the two cases,
+so the same class can fingerprint differently across import environments.
+
+When defining messages shared between processes, either use the same import
+style in both, or rely on the runtime `typing.get_type_hints(cls)` equivalent
+once that lands upstream.
+
+## See also
+
+- [`cortex.utils.hashing`](../reference/utils/hashing.md) — `compute_fingerprint`, cache helpers
+- [Message wire format](message-wire-format.md)
+- [Critique § code-level issue 13](../critique.md)
diff --git a/docs/concepts/message-wire-format.md b/docs/concepts/message-wire-format.md
new file mode 100644
index 0000000..48a4ac0
--- /dev/null
+++ b/docs/concepts/message-wire-format.md
@@ -0,0 +1,96 @@
+# Message wire format
+
+Cortex uses **ZeroMQ multipart messages**. Each published message is a list of
+frames rather than a single blob. That lets array payloads ride as raw
+contiguous buffers — no copy into a Python `bytes`, no re-copy by ZMQ.
+
+## Frames on the wire
+
+```mermaid
+flowchart LR
+ F0["Frame 0 topic bytes"] --> F1
+ F1["Frame 1 header (24B) fingerprint • ts_ns • seq"] --> F2
+ F2["Frame 2 msgpack metadata (ordered field values)"] --> F3
+ F3["Frame 3..N raw array buffers (OOB, zero-copy)"]
+```
+
+| Frame | Contents | Size |
+| ------- | ---------------------------- | ------------ |
+| 0 | Topic name (UTF-8) | variable |
+| 1 | [`MessageHeader`][cortex.messages.base.MessageHeader] | **24 bytes** (3 × u64, big-endian) |
+| 2 | msgpack-packed ordered field values; arrays replaced by OOB descriptors | small |
+| 3..N | `np.ndarray.tobytes()` / `tensor.numpy().tobytes()`, contiguous | payload-sized |
+
+## Header layout
+
+```
+offset 0 8 16 24
+ |fp u64 |ts u64 |seq u64 |
+ big-endian throughout
+```
+
+- `fp` — 64-bit message fingerprint, computed from class name and field schema.
+- `ts` — publisher wall-clock in nanoseconds (`time.time_ns()`).
+- `seq` — per-process, per-message-type monotonic counter.
+
+## Metadata (Frame 2)
+
+Field values are packed **in declaration order** (not by name), so the receiver
+reconstructs using the dataclass's cached field tuple. This removes per-message
+field-name encoding.
+
+Arrays and tensors appear in the metadata as small dict stand-ins called
+**OOB descriptors**:
+
+```json
+{
+ "__cortex_oob__": "numpy",
+ "buffer": 0,
+ "dtype": ">M: build header + collect field values
+ M->>S: values in declaration order
+ S->>E: for each value, walk nested dicts/lists
+ E-->>S: scalar stays inline; array → OOB descriptor + buffer appended
+ S-->>M: (metadata_bytes, [buf0, buf1, ...])
+ M-->>Z: [topic, header, metadata, *buffers]
+```
+
+## The legacy single-blob path
+
+`Message.to_bytes()` / `from_bytes()` / `Message.decode()` still exist. They
+pack *everything* into one msgpack blob using `ExtType` for arrays. That path
+is retained for tests and opportunistic use; the transport always uses the
+multipart path above.
+
+!!! warning "Mismatch trap"
+ Bytes captured from the wire cannot be fed to `Message.decode()` — the wire
+ format is multipart, not a single blob. Use `Message.from_frames(frames)`.
+
+## See also
+
+- [Fingerprinting](fingerprinting.md)
+- [`cortex.utils.serialization`](../reference/utils/serialization.md) — encoding helpers
+- [`cortex.messages.base`](../reference/messages/base.md) — `Message`, `MessageHeader`
diff --git a/docs/concepts/transport-and-qos.md b/docs/concepts/transport-and-qos.md
new file mode 100644
index 0000000..766b669
--- /dev/null
+++ b/docs/concepts/transport-and-qos.md
@@ -0,0 +1,33 @@
+# Transport & QoS
+
+*Stub — deep dive coming in a later pass.*
+
+## Current socket settings
+
+| Socket | Option | Value | Notes |
+| ------------ | ------------- | ----- | ------------------------------------- |
+| Publisher PUB | `SNDHWM` | 10 (default `queue_size`) | Drops under backpressure |
+| Publisher PUB | `LINGER` | 0 | Immediate close |
+| Subscriber SUB | `RCVHWM` | 10 | Oldest messages evicted when full |
+| Subscriber SUB | `LINGER` | 0 | |
+| Daemon REP | `RCVTIMEO` | 1000 ms | Allows Ctrl-C responsiveness |
+| Daemon REP | `LINGER` | 0 | |
+
+## Today's delivery semantics
+
+- Publisher uses `zmq.NOBLOCK`: if the send queue is full, the message is
+ **silently dropped**.
+- Subscriber HWM is a ring buffer: old messages are **silently evicted** on
+ overflow.
+
+This is fine for best-effort telemetry. It is unsafe for control commands.
+
+## Planned QoS profiles
+
+Taking inspiration from DDS, three profiles are enough for most robotics use:
+
+- `best_effort_latest` — conflate; keep only newest (camera frames).
+- `reliable_queue` — publisher blocks or errors (control commands).
+- `dropping_queue` — current behavior with an exposed drop counter (telemetry).
+
+See [critique.md § 4](../critique.md) for rationale.
diff --git a/docs/critique.md b/docs/critique.md
new file mode 100644
index 0000000..6677641
--- /dev/null
+++ b/docs/critique.md
@@ -0,0 +1,146 @@
+# Cortex Critique
+
+A bottom-up review of Cortex as it stands today, with a focus on its viability as a communication library for robotics. This complements [design-review.md](design-review.md) with concrete code-level findings and benchmark observations.
+
+## How Cortex works (bottom-up)
+
+### 1. Fingerprinting — `utils/hashing.py`
+
+A message class's identity is a 64-bit integer:
+
+```
+fingerprint = SHA-256(f"{module}.{qualname}|{','.join(sorted('field:type'))}")[:8]
+```
+
+- Computed lazily and cached in `_fingerprint_cache`.
+- `field.type` is a string when `from __future__ import annotations` is active and a real type otherwise. The fingerprint therefore depends on how the module was imported — fragile for cross-repo use.
+- Field ordering is sorted alphabetically in the fingerprint, but the wire layout uses dataclass declaration order. Two classes could theoretically fingerprint identically but interpret the wire differently.
+
+### 2. Message base — `messages/base.py`
+
+Each dataclass inheriting `Message` is auto-registered via `__init_subclass__` into `MessageType._registry[fingerprint] = cls`.
+
+Wire format (multipart transport, what publishers actually use):
+
+```
+Frame 0: topic bytes (for PUB/SUB filter)
+Frame 1: 24-byte header (fingerprint u64, timestamp_ns u64, sequence u64, big-endian)
+Frame 2: msgpack of ordered field values with OOB descriptors
+Frame 3..N: raw contiguous array buffers (zero-copy)
+```
+
+There is a second, legacy single-blob path (`to_bytes` / `from_bytes`) that embeds array bytes inside a single msgpack blob using ExtType. It is retained for `Message.decode(...)` and tests, but is not what the transport uses.
+
+### 3. Serialization — `utils/serialization.py`
+
+Two strategies coexist:
+
+- `_msgpack_default` / `_msgpack_ext_hook` (inline): arrays/tensors get packed as msgpack ExtType inside the single blob. Used by the legacy path.
+- `_encode_transport_value` / `_decode_transport_value` (out-of-band): each array/tensor is replaced with a tiny dict `{__cortex_oob__: "numpy", buffer: i, dtype, shape}` and its raw bytes are appended as separate ZMQ frames. Reconstruction uses `np.frombuffer(frame.buffer, dtype).reshape(shape)` with no copy.
+
+After the March 2026 optimizations: zero-copy decode, schema-ordered values (field names no longer repeated per message), and cached field-name tuples.
+
+### 4. Discovery — `discovery/daemon.py` and `discovery/client.py`
+
+Single-threaded `zmq.REP` over IPC at `ipc:///tmp/cortex/discovery.sock`.
+
+- Registry is a plain `dict[str, TopicInfo]`, enforcing one publisher per topic.
+- RCVTIMEO=1s so the run loop can poll `_running` for Ctrl-C.
+- Commands: REGISTER, UNREGISTER, LOOKUP, LIST, SHUTDOWN.
+- Request/response payloads are msgpack.
+- Client uses REQ with close-and-recreate on timeout (REQ sockets are stuck after a missed reply).
+
+### 5. Publisher / Subscriber — `core/publisher.py`, `core/subscriber.py`
+
+- **Publisher**: binds a `zmq.PUB` at `ipc:///tmp/cortex/topics/__.sock`, registers via the discovery client, publishes multipart `[topic, header, metadata, *buffers]` with `zmq.NOBLOCK`. If the `Node` hands it an async context, it wraps a sync `zmq.Context(self._context)` around the same underlying zmq io threads so publishing stays synchronous.
+- **Subscriber**: uses an async context, looks up the topic (optionally waits), connects `zmq.SUB`, sets a topic filter, loops via `AsyncExecutor` doing `recv_multipart(copy=False)` → `Message.from_frames`.
+
+### 6. Node + Executors — `core/node.py`, `core/executor.py`
+
+A `Node` owns a shared `zmq.asyncio.Context`, plus lists of publishers, subscribers, and timers. Each timer gets a `RateExecutor(fn, rate_hz)`. `node.run()` creates asyncio tasks for every timer and every callback-subscriber, then `asyncio.gather`. `RateExecutor` uses `perf_counter` plus `asyncio.sleep(max(0, next-now))`. `cortex.run` prefers uvloop on Unix.
+
+## Benchmark results
+
+Measured on this machine with the in-repo benchmark suite:
+
+| Metric | Value |
+| ------------------------- | --------------------------- |
+| Small-payload latency | mean 556 µs, p99 1075 µs |
+| 64KB latency | mean 919 µs, p99 1.4 ms |
+| Tiny array throughput | 21.8k msg/s |
+| 1MB array throughput | 7.7k msg/s, 8.0 GB/s |
+| 4MB array throughput | 2.25k msg/s, 9.4 GB/s |
+| 1080p RGB frames | 1422 fps, 8.8 GB/s |
+| Raw wire+decode (inproc) | 35 µs roundtrip (4MB array) |
+
+The delta between the **~35 µs raw wire** and **~550 µs end-to-end** is asyncio scheduling, context-switch between publisher timer and subscriber recv, and Python callback dispatch. Serialization is close to memcpy-bandwidth on large payloads — the OOB transport is pulling its weight.
+
+## What can be improved
+
+### Design-level (biggest wins)
+
+1. **Latency floor is too high for control loops.** ~550 µs mean and ~1.5 ms p99 is dominated by `asyncio` + `zmq.asyncio`, not zmq itself. Control topics should be able to opt into a synchronous thread-plus-`zmq.Poller` receive path targeting <100 µs p99. Async should be the default, not the only option.
+
+2. **Discovery is a single REQ/REP chokepoint with stop-the-world semantics.** On crashes, stale topic entries are never reclaimed — a crashed publisher's IPC file stays on disk and the registry keeps pointing at a dead socket. Add leases with heartbeats (publisher renews every N seconds; daemon evicts stale entries), or a peer-gossip model where every node beacons presence. The current daemon has no concurrency — one slow client blocks all others.
+
+3. **One-publisher-per-topic is a hard limit for robotics.** Redundant IMUs, failover, and multi-source fusion are all blocked. The registry should accept N publishers per topic and subscribers should `connect()` to all of them — ZMQ SUB handles fan-in natively.
+
+4. **No backpressure semantics.** `pub.publish()` is `NOBLOCK` and silently drops on HWM. Subscriber HWM=10 on SUB evicts old messages by default. Robotics needs per-topic QoS profiles similar to DDS:
+ - `best_effort_latest` — camera frames: drop old, keep newest (`ZMQ_CONFLATE=1`).
+ - `reliable_queue` — commands: block or surface an error.
+ - `dropping_queue` — telemetry: current behavior, but with a drop counter.
+
+5. **No liveness or drop detection.** A subscriber has no way to know the publisher died. Sequence numbers exist in the header but are never checked for gaps. Automatic gap-counting in Subscriber would be gold for debugging.
+
+6. **Callback execution blocks the receive loop.** A 10 ms callback accumulates on SUB HWM and drops. Receive, decode, and user-callback execution should be decoupled with a bounded work queue and one or more worker coroutines/threads per subscriber. ROS 2 executors have this distinction for a reason.
+
+7. **Local-only transport in practice.** Addresses are hardcoded `ipc://` paths under `/tmp`. Multi-host robotics (robot ↔ base-station) needs TCP transport in discovery, NIC selection, and topology-aware addressing.
+
+8. **No shared memory for huge payloads.** At 9 GB/s on 4 MB arrays, every subscriber gets a fresh copy. For multi-subscriber camera or LiDAR fan-out, a shared-memory transport (posix shm + ring buffer + zmq for control-plane notifications) would give true zero-copy.
+
+### Code-level issues
+
+9. `publisher.py:91-95` — `zmq.Context(self._context)` creates a shadowed sync context sharing the async context's io threads. Correct, but subtle. `zmq.PUB` is **not thread-safe** — calling `pub.publish()` from multiple asyncio tasks on the same socket is undefined. Needs docs or a lock.
+
+10. `publisher.py:117-118` — the publisher unlinks any existing socket file on startup. If two publishers on the same host use the same node name + topic, the second silently steals the socket. Should fail loudly.
+
+11. `subscriber.py:155-160` — fingerprint mismatch logs a warning and proceeds anyway. That is a silent-data-corruption path. Should refuse to connect.
+
+12. `messages/base.py:109-129` — `_sequence_counter` is **class-level**, shared across every Publisher instance of that message type in the process. Two publishers of `ArrayMessage` interleave sequences — breaking per-topic drop detection. Move it onto the `Publisher`.
+
+13. `utils/hashing.py:34-38` — `field.type` is a string with PEP 563 and a real type otherwise; the resulting fingerprint differs across import environments. Use `typing.get_type_hints(cls)` consistently.
+
+14. `discovery/client.py:78-101` — `retries=1` default means zero retries (loop runs once). Fencepost bug.
+
+15. `core/executor.py:119-147` — `RateExecutor` has both `await asyncio.sleep(0)` inside the loop and `await asyncio.sleep(max(0, dt))` at the bottom. The first is redundant and creates unnecessary wakeups. Catch-up logic silently eats dropped ticks; control loops often need to know.
+
+16. `discovery/daemon.py:87` — RCVTIMEO=1s means Ctrl-C takes up to 1s to take effect and request throughput is throttled. A `zmq.Poller` with a shutdown PAIR socket gives clean immediate shutdown.
+
+17. `messages/standard.py:146-150` — `ImageMessage.__post_init__` auto-fill is non-idempotent across deserialization round-trips. Minor.
+
+18. `discovery/daemon.py:168-177` — same-publisher re-registration is allowed; if its IPC path changed, existing subscribers are never told. Needs a lease or a "changed" notification.
+
+19. **No CI test for cross-process fingerprint stability.** Given how much safety rides on fingerprints, every standard message type deserves a stored golden fingerprint asserted in CI.
+
+20. **`from_bytes` vs `from_frames` asymmetry is a trap.** `Message.decode(bytes)` only handles the inline path. If anyone captures bytes from the wire (the multipart path) and calls `decode()`, it will fail silently. Unify the paths or rename `decode`.
+
+21. **No async publish.** `send_multipart` briefly blocks on HWM/context switch; inside an async timer callback this is a hidden blocking call. An async `publish` variant would help.
+
+### Schema evolution
+
+22. No optional fields, no versioning. For long-lived robotics deployments, add:
+ - field defaults (so fingerprints tolerate missing trailing fields on decode),
+ - an `msg_schema_version: int = 1` convention,
+ - eventually, a real wire schema (FlatBuffers, Cap'n Proto, or generated-from-.fbs dataclasses).
+
+## Summary
+
+Cortex is a well-built, honest small-system IPC library. The **serialization is genuinely fast** — hitting memcpy-bandwidth on 4 MB arrays with zero-copy OOB frames. The **latency floor (~550 µs p50, ~1.5 ms p99)** is limited by asyncio, not zmq. The **discovery, QoS, liveness, and single-host assumptions** are the real blockers for using this as robotics middleware.
+
+Recommended path if adopting Cortex for robotics:
+
+1. Add per-topic QoS profiles with drop counters (1-2 days).
+2. Add a synchronous-threaded subscriber option for low-latency control (1 day).
+3. Add heartbeats/leases and multi-publisher support to discovery (3-5 days).
+4. Add TCP transport and host-aware discovery (2-3 days).
+5. Then consider shared memory and schema evolution.
diff --git a/docs/gen_ref_pages.py b/docs/gen_ref_pages.py
new file mode 100644
index 0000000..a6c4f5a
--- /dev/null
+++ b/docs/gen_ref_pages.py
@@ -0,0 +1,47 @@
+"""Generate one API reference page per module under ``src/cortex/``.
+
+Executed by ``mkdocs-gen-files`` during the build. Emits:
+
+- ``reference//.md`` for every non-dunder module,
+- ``reference//index.md`` for every ``__init__.py``,
+- ``reference/SUMMARY.md`` consumed by ``mkdocs-literate-nav``.
+
+Keeping this generated means adding a new module needs zero doc edits.
+"""
+
+from pathlib import Path
+
+import mkdocs_gen_files
+
+# This script lives at ``docs/gen_ref_pages.py`` and is executed by
+# mkdocs-gen-files with the mkdocs.yml directory as cwd. Anchor to the
+# repo root so the generator finds ``src/cortex`` regardless of cwd.
+REPO_ROOT = Path(__file__).resolve().parent.parent
+SRC_ROOT = REPO_ROOT / "src"
+PACKAGE = "cortex"
+
+nav = mkdocs_gen_files.Nav()
+
+for path in sorted((SRC_ROOT / PACKAGE).rglob("*.py")):
+ module_path = path.relative_to(SRC_ROOT).with_suffix("")
+ doc_path = Path("reference", *module_path.parts[1:]).with_suffix(".md")
+ parts = tuple(module_path.parts)
+
+ if parts[-1] == "__init__":
+ parts = parts[:-1]
+ doc_path = doc_path.with_name("index.md")
+ elif parts[-1].startswith("_"):
+ continue
+
+ nav_parts = parts[1:] if parts[1:] else ("cortex",)
+ nav[nav_parts] = doc_path.relative_to("reference").as_posix()
+
+ identifier = ".".join(parts) if parts else PACKAGE
+ with mkdocs_gen_files.open(doc_path, "w") as f:
+ f.write(f"# `{identifier}`\n\n")
+ f.write(f"::: {identifier}\n")
+
+ mkdocs_gen_files.set_edit_path(doc_path, path.relative_to(REPO_ROOT))
+
+with mkdocs_gen_files.open("reference/SUMMARY.md", "w") as f:
+ f.writelines(nav.build_literate_nav())
diff --git a/docs/getting-started/discovery-daemon.md b/docs/getting-started/discovery-daemon.md
new file mode 100644
index 0000000..518bc55
--- /dev/null
+++ b/docs/getting-started/discovery-daemon.md
@@ -0,0 +1,69 @@
+# Running the Discovery Daemon
+
+The discovery daemon is a lightweight REP service that maintains the registry
+of active topics. Publishers register on startup; subscribers look up the
+endpoint and connect directly.
+
+## Start
+
+=== "As a script"
+
+ ```bash
+ cortex-discovery
+ ```
+
+=== "As a module"
+
+ ```bash
+ python -m cortex.discovery.daemon
+ ```
+
+=== "As a systemd service"
+
+ ```ini title="/etc/systemd/system/cortex-discovery.service"
+ [Unit]
+ Description=Cortex discovery daemon
+ After=network.target
+
+ [Service]
+ Type=simple
+ ExecStart=/usr/bin/env cortex-discovery
+ Restart=on-failure
+ RuntimeDirectory=cortex
+
+ [Install]
+ WantedBy=multi-user.target
+ ```
+
+## Command-line options
+
+| Flag | Default | Description |
+| ------------- | -------------------------------------- | ------------------------------- |
+| `--address` | `ipc:///tmp/cortex/discovery.sock` | ZMQ endpoint to bind |
+| `--log-level` | `INFO` | `DEBUG` / `INFO` / `WARNING` / `ERROR` |
+
+## Lifecycle
+
+```mermaid
+stateDiagram-v2
+ [*] --> Starting: bind REP socket
+ Starting --> Running: socket ready
+ Running --> Running: handle REGISTER / LOOKUP / LIST / UNREGISTER
+ Running --> Stopping: SIGINT or SHUTDOWN command
+ Stopping --> [*]: close socket, unlink ipc file
+```
+
+## Troubleshooting
+
+**"Address already in use"**
+: Another daemon (or a stale socket file) is holding the path.
+ `rm /tmp/cortex/discovery.sock` and restart.
+
+**Subscribers time out looking up topics**
+: Daemon not running, or publisher failed to register. Run with
+ `--log-level DEBUG` and watch for REGISTER / LOOKUP lines.
+
+**Daemon crash leaves stale entries**
+: Today, entries are only removed on explicit UNREGISTER. A crashed
+ publisher's topic stays in the registry pointing at a dead socket.
+ Restarting the daemon clears all state.
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
new file mode 100644
index 0000000..3ea43a0
--- /dev/null
+++ b/docs/getting-started/installation.md
@@ -0,0 +1,42 @@
+# Installation
+
+## Requirements
+
+- Python **3.10+**
+- Linux or macOS (Windows works but without `uvloop`)
+- ZeroMQ shared library (bundled via `pyzmq`)
+
+## Install from source
+
+```bash
+git clone https://github.com/sudoRicheek/cortex.git
+cd cortex
+pip install -e ".[dev]"
+```
+
+## Optional extras
+
+=== "PyTorch support"
+
+ ```bash
+ pip install -e ".[torch]"
+ ```
+
+ Enables [`TensorMessage`][cortex.messages.standard.TensorMessage] and
+ torch-aware serialization paths.
+
+=== "Everything"
+
+ ```bash
+ pip install -e ".[all]"
+ ```
+
+## Verify
+
+```python
+import cortex
+print(cortex.__version__)
+```
+
+If that prints a version string, you're ready. Continue to the
+[Quickstart](quickstart.md).
diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md
new file mode 100644
index 0000000..470f12e
--- /dev/null
+++ b/docs/getting-started/quickstart.md
@@ -0,0 +1,104 @@
+# Quickstart
+
+A three-terminal pub/sub loop in under two minutes.
+
+## 1. Start the discovery daemon
+
+```bash
+cortex-discovery
+```
+
+Leave it running. This is the single service that maps topic names to
+IPC endpoints.
+
+## 2. Publisher
+
+```python title="pub.py"
+import numpy as np
+import cortex
+from cortex import Node, ArrayMessage
+
+
+class SensorNode(Node):
+ def __init__(self):
+ super().__init__("sensor")
+ self.pub = self.create_publisher("/sensor/data", ArrayMessage)
+ self.count = 0
+ self.create_timer(0.1, self.tick) # 10 Hz
+
+ async def tick(self):
+ data = np.random.randn(64, 64).astype("float32")
+ self.pub.publish(ArrayMessage(data=data, name=f"frame_{self.count}"))
+ self.count += 1
+
+
+async def main():
+ node = SensorNode()
+ try:
+ await node.run()
+ finally:
+ await node.close()
+
+
+if __name__ == "__main__":
+ cortex.run(main())
+```
+
+```bash
+python pub.py
+```
+
+## 3. Subscriber
+
+```python title="sub.py"
+import cortex
+from cortex import Node, ArrayMessage
+from cortex.messages.base import MessageHeader
+
+
+async def on_data(msg: ArrayMessage, header: MessageHeader):
+ print(f"[{header.sequence}] {msg.name} shape={msg.data.shape}")
+
+
+class ViewerNode(Node):
+ def __init__(self):
+ super().__init__("viewer")
+ self.create_subscriber("/sensor/data", ArrayMessage, callback=on_data)
+
+
+async def main():
+ node = ViewerNode()
+ try:
+ await node.run()
+ finally:
+ await node.close()
+
+
+if __name__ == "__main__":
+ cortex.run(main())
+```
+
+```bash
+python sub.py
+```
+
+## What just happened
+
+```mermaid
+sequenceDiagram
+ participant P as Publisher
+ participant D as Discovery daemon
+ participant S as Subscriber
+
+ P->>D: REGISTER /sensor/data -> ipc:///tmp/cortex/topics/...
+ S->>D: LOOKUP /sensor/data
+ D-->>S: ipc:///tmp/cortex/topics/...
+ S->>P: ZMQ SUB connect + SUBSCRIBE "/sensor/data"
+ loop 10 Hz
+ P->>S: multipart [topic, header, metadata, buffer]
+ S->>S: decode + await on_data(msg, header)
+ end
+```
+
+See [Concepts → Architecture](../concepts/architecture.md) for the end-to-end
+picture, or jump into a [custom message tutorial](../tutorials/custom-messages.md).
diff --git a/docs/guides/benchmarks.md b/docs/guides/benchmarks.md
new file mode 100644
index 0000000..f7fad29
--- /dev/null
+++ b/docs/guides/benchmarks.md
@@ -0,0 +1,35 @@
+# Benchmarks
+
+Cortex ships an in-repo benchmark suite at [`benchmarks/`](https://github.com/sudoRicheek/cortex/tree/main/benchmarks).
+
+## Run
+
+```bash
+# Terminal 1
+cortex-discovery
+
+# Terminal 2
+python benchmarks/bench_all.py --output results.json
+```
+
+Individual benchmarks:
+
+- `benchmarks/bench_latency.py` — one-way publisher→subscriber latency.
+- `benchmarks/bench_throughput.py` — messages/sec and MB/sec.
+- `benchmarks/bench_all.py` — full matrix with summary and optional JSON dump.
+
+## Reading results
+
+- `p99` is what matters for real-time-ish workloads; `mean` can hide jitter.
+- For array workloads, `MB/s` approaching memcpy bandwidth is a good sign
+ that zero-copy transport is working.
+- Serialization overhead via `inproc` sockets with `copy=False` is reported
+ separately — that isolates the encode/decode path from the network path.
+
+## Tips
+
+- Pin publisher and subscriber to separate cores for stable latency numbers.
+- Disable Turbo-Boost / set CPU governor to `performance` for reproducible
+ runs.
+- Always measure with the discovery daemon also running (it is off the hot
+ path but can steal a little cache).
diff --git a/docs/guides/debugging.md b/docs/guides/debugging.md
new file mode 100644
index 0000000..881e5e9
--- /dev/null
+++ b/docs/guides/debugging.md
@@ -0,0 +1,58 @@
+# Debugging
+
+## Subscriber hangs on startup
+
+Most likely: the daemon is not running, or the topic name is mistyped.
+`DiscoveryClient.wait_for_topic_async` polls every 500 ms until the topic
+appears or the timeout fires.
+
+```bash
+cortex-discovery --log-level DEBUG
+```
+
+Watch for `LOOKUP topic: /x -> NOT FOUND`.
+
+## Publisher "works" but subscriber receives nothing
+
+ZMQ PUB drops messages for which no matching SUB is connected yet. If your
+publisher starts first and publishes immediately, the first few messages are
+lost — this is the classic ZMQ slow-joiner problem.
+
+Workarounds:
+
+- Have the publisher wait briefly after bind before publishing the first message.
+- Have the subscriber wait-for-topic (the default) so it comes up after the
+ publisher registered.
+
+## Stale `/tmp/cortex/topics/*.sock` files
+
+If a publisher exits uncleanly, its IPC socket file remains. Cortex's
+`Publisher._setup_socket` unlinks any existing file at the same path on the
+**next bind** — so restarting the publisher fixes it. Otherwise:
+
+```bash
+rm /tmp/cortex/topics/.sock
+```
+
+## Daemon state survives restarts — but doesn't
+
+The registry is **in-memory**. Restarting the daemon wipes all state;
+publishers do not auto-re-register today. Restart your publishers after
+restarting the daemon.
+
+## Fingerprint mismatch warning
+
+If you see
+`Message type mismatch for /x: expected FooMessage, got BarMessage` —
+the topic was registered with a different message class. Either rename the
+topic or align the classes.
+
+## Debug logging
+
+```python
+import logging
+logging.basicConfig(level=logging.DEBUG)
+```
+
+Cortex uses standard `logging`. Interesting loggers: `cortex.publisher`,
+`cortex.subscriber`, `cortex.node`, `cortex.discovery`, `cortex.discovery.client`.
diff --git a/docs/guides/performance-tuning.md b/docs/guides/performance-tuning.md
new file mode 100644
index 0000000..ce8828f
--- /dev/null
+++ b/docs/guides/performance-tuning.md
@@ -0,0 +1,37 @@
+# Performance tuning
+
+Current measured numbers on the repo's benchmark suite (single workstation):
+
+| Workload | Throughput / latency |
+| --------------------- | ------------------------------- |
+| Small payload latency | mean 556 µs, p99 1075 µs |
+| 1MB array throughput | 7.7k msg/s, 8.0 GB/s |
+| 4MB array throughput | 2.25k msg/s, 9.4 GB/s |
+| 1080p RGB | 1422 fps, 8.8 GB/s |
+
+See [Benchmarks guide](benchmarks.md) to reproduce.
+
+## Copy-on-use
+
+Decoded NumPy arrays **alias the ZMQ frame memory**. That is what makes
+large-payload throughput close to memcpy bandwidth — but it means:
+
+- If you intend to mutate the array, `arr = arr.copy()` first.
+- If you intend to hold the array past the callback, copy it first.
+
+## Queue sizing
+
+Per-socket HWM defaults to 10. Increase `queue_size` on high-rate producers
+whose subscribers are known to be slow — but remember that ZMQ drops silently
+at the HWM.
+
+## When to prefer the inline path
+
+Single tiny messages (primitives only, < 1 KB) see no benefit from multipart.
+The inline `to_bytes` path is still fine there. Publishers always use
+multipart today.
+
+## uvloop
+
+Installed by default on Unix. Drops tail latency on high-rate small messages
+noticeably. No action needed.
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000..a0a2a64
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,83 @@
+# Cortex
+
+**A lightweight Python framework for inter-process communication over ZeroMQ.**
+
+Cortex is a pub/sub layer designed to feel obvious. Nodes publish typed messages on named topics; subscribers receive them via async callbacks. A tiny discovery daemon tells subscribers where to connect. Native support for NumPy arrays and PyTorch tensors keeps robotics- and ML-shaped payloads fast.
+
+
+
+- :material-rocket-launch-outline: **[Getting started](getting-started/quickstart.md)**
+
+ Install, start the daemon, publish your first message in under two minutes.
+
+- :material-book-open-variant: **[Concepts](concepts/architecture.md)**
+
+ How the wire format, fingerprinting, discovery handshake, and async execution fit together.
+
+- :material-puzzle-outline: **[Components](components/messages.md)**
+
+ Deep dives into the Messages, Discovery, and Core modules.
+
+- :material-api: **[API reference](reference/index.md)**
+
+ Auto-generated from the source. Always matches the code on `main`.
+
+
Discovery is Cortex's control plane: a single long-lived process that maps
+topic names to ZMQ endpoints. It sits off the data path — once a subscriber
+has an endpoint, messages flow publisher → subscriber directly without the
+daemon's involvement.
Everyone agrees on the wire format via protocol.py. The daemon runs a
+single-threaded REP loop. The client speaks REQ from every publisher and
+subscriber in the graph.
Implemented in [DiscoveryDaemon][cortex.discovery.daemon.DiscoveryDaemon].
+
Key behaviors:
+
+
Binds zmq.REP at ipc:///tmp/cortex/discovery.sock by default.
+
Maintains _topics: dict[str, TopicInfo] — one publisher per topic.
+
RCVTIMEO=1000 on the socket so the loop can check _running for clean
+ Ctrl-C. This also means the daemon is naturally single-request-at-a-time —
+ a slow client blocks all others.
Implemented in [DiscoveryClient][cortex.discovery.client.DiscoveryClient].
+
Thin REQ wrapper around the protocol. Important operational detail: REQ
+sockets stick after a timeout — they block subsequent sends waiting for a
+reply that never came. The client handles this by closing and recreating the
+socket on every timeout (_reconnect). Callers don't see it.
[wait_for_topic_async(name, timeout, poll_interval)][cortex.discovery.client.DiscoveryClient.wait_for_topic_async] —
+ async poll loop (asyncio.sleep). This is what [Subscriber][cortex.core.subscriber.Subscriber]
+ uses when wait_for_topic=True.
Messages are just @dataclasses that inherit from
+[Message][cortex.messages.base.Message]. Registering with the type system,
+computing a fingerprint, and (de)serialization all happen automatically.
That is the entire contract. The class is registered into
+[MessageType._registry][cortex.messages.base.MessageType] by fingerprint at
+import time, and gains:
+
+
JointTrajectory.fingerprint() — 64-bit ID.
+
msg.to_frames() / JointTrajectory.from_frames(frames) — the transport path.
+
msg.to_bytes() / JointTrajectory.from_bytes(data) — the legacy blob path.
+
Message.decode(blob) — class dispatch via fingerprint registry.
Message._sequence_counter is shared across all publisher instances of
+the same message class in the process. Two ArrayMessage publishers
+interleave sequence numbers. Per-topic gap detection therefore needs a
+per-publisher counter today; see critique.md § 12.
A [Node][cortex.core.node.Node] is the user-facing composition unit: it owns
+a shared ZMQ async context and a collection of publishers, subscribers, and
+timers. Executors provide the scheduling primitives that timers and
+subscriber receive loops run on.
One node = one process boundary in practice. Nothing stops you running
+multiple nodes in the same process (asyncio.gather([n.run() for n in nodes]),
+see examples/multi_node_system.py),
+but remember they share the same event loop — a slow callback in one still
+blocks the others.
Spawns one asyncio task per timer and one per callback-bearing subscriber,
+then asyncio.gathers them. Returns when all tasks complete or the node is
+stopped.
Stops all executors, cancels outstanding tasks, closes every publisher and
+subscriber (each of which unregisters/unbinds their own socket), and
+terminates the shared ZMQ context. Idempotent.
"Run this coroutine at a constant rate, catching up on overruns."
+
flowchart TD
+ Start[next = perf_counter] --> Loop{running?}
+ Loop -- no --> End
+ Loop -- yes --> Now[now = perf_counter]
+ Now --> Due{now >= next?}
+ Due -- yes --> Call[await func]
+ Call --> Advance[next += interval]
+ Advance --> Behind{next < now?}
+ Behind -- yes --> Reset[next = now + interval]
+ Behind -- no --> Wait
+ Reset --> Wait
+ Due -- no --> Wait[await sleep next - now]
+ Wait --> Loop
+
The catch-up branch silently drops ticks — if your 100 Hz callback takes
+20 ms once, you do not get two callbacks back-to-back; you skip one tick.
+
+
Redundant yield
+
Today there is an await asyncio.sleep(0) inside the loop and
+await asyncio.sleep(max(0, dt)) at the bottom. That generates an extra
+wakeup per tick. See critique § 15.
Timers are plain async functions — no decorator, no magic. They run in the
+same event loop as subscriber callbacks, so the same head-of-line caveat
+applies.
The data-plane workhorses. A Publisher binds a ZMQ PUB socket and registers
+with discovery; a Subscriber looks up the endpoint, connects a SUB socket,
+and drives an async receive loop. Discovery is consulted once per topic on
+startup — it is not on the hot path.
Always create via [Node.create_publisher][cortex.core.node.Node.create_publisher] —
+direct construction works but skips the shared ZMQ context reuse and the
+node-level registration bookkeeping.
+
pub=node.create_publisher(
+topic_name="/camera/image",# must start with "/"
+message_type=ImageMessage,# fingerprint is taken from this class
+queue_size=100,# SNDHWM; drops under backpressure
+)
+
sequenceDiagram
+ autonumber
+ participant U as User
+ participant Pub as Publisher
+ participant FS as /tmp/cortex/topics/
+ participant ZMQ as zmq.PUB
+ participant D as Discovery daemon
+
+ U->>Pub: __init__(topic, msg_cls, ...)
+ Pub->>Pub: address = generate_ipc_address(topic, node)
+ Pub->>FS: mkdir -p; unlink stale .sock
+ Pub->>ZMQ: socket(PUB); setsockopt HWM/LINGER; bind(address)
+ Pub->>D: REGISTER TopicInfo{name, address, fingerprint, node}
+ D-->>Pub: OK / ALREADY_EXISTS
+ Note over Pub: ready; user can publish()
+
Two things worth calling out:
+
+
The IPC address is derived deterministically from node_name and
+ topic_name via [generate_ipc_address][cortex.core.publisher.generate_ipc_address]:
+ ipc:///tmp/cortex/topics/<node>__<topic-with-slashes-as-underscores>.sock.
+
_setup_socket unlinks any existing file at that path before binding. That
+ protects against crash-leftover sockets, but also means two publishers
+ configured with the same node_name + topic_name in the same process tree
+ will silently stomp each other — see critique § 10.
Any other exception is logged and swallowed; publish still returns False.
+For robotics code this "fire and forget" is intentional — the caller decides
+whether to retry based on the return value and the topic's role.
This keeps publish() a normal function call instead of forcing every publish
+to be awaited. It is the right performance choice, but it has consequences:
+
+
zmq.PUB is not thread-safe
+
Do not call publish() on the same Publisher from multiple threads
+(or multiple asyncio tasks that could race on send_multipart). Serialize
+per-publisher calls yourself if you fan out work.
Publisher.close() is best-effort: it unregisters from the daemon (silently
+tolerates a dead daemon), closes the socket, and removes the IPC file.
+Exceptions from any one step do not block the others.
publisher.publish_count, publisher.last_publish_time, and
+publisher.is_registered are exposed for instrumentation. They update on the
+hot path with no locking — read them from the same task that calls publish()
+for deterministic numbers.
sequenceDiagram
+ autonumber
+ participant U as User
+ participant S as Subscriber
+ participant D as DiscoveryClient
+ participant Pub as publisher IPC
+
+ U->>S: __init__(...)
+ S->>D: lookup_topic(name) # non-blocking
+ alt found immediately
+ D-->>S: TopicInfo
+ S->>S: verify fingerprint
+ S->>Pub: SUB connect + SUBSCRIBE topic
+ Note over S: is_connected = True
+ else not found
+ D-->>S: None
+ Note over S: defer; retry in run()
+ end
+
+ U->>S: node.run() schedules sub.run()
+ S->>D: wait_for_topic_async(name, timeout)
+ D-->>S: TopicInfo
+ S->>Pub: SUB connect + SUBSCRIBE topic
+
The constructor tries a non-blocking lookup first so that when a publisher is
+already up, no polling is needed. The polling fallback only kicks in inside
+sub.run() via [wait_for_topic_async][cortex.discovery.client.DiscoveryClient.wait_for_topic_async].
copy=False means each frame is a zmq.Frame — the metadata and array
+ buffers are memoryview-able without a copy. See
+ cortex.utils.serialization.
+
The one-frame fast path (len(payload_frames) == 1) handles legacy
+ publishers still on the single-blob path — it falls back to
+ from_bytes on the single payload buffer.
On connect the subscriber compares its class's fingerprint to the one in the
+registry entry. Today a mismatch only logs a warning and proceeds anyway —
+downstream decoding will then fail hard. Treat fingerprint warnings as errors
+in your code.
Subscriber.close() stops the executor, closes the discovery client and SUB
+socket, and flips is_connected to False. Safe to call multiple times;
+errors are suppressed so teardown does not cascade.
Two encodings live side by side: a multipart / out-of-band path that the
+transport actually uses, and a single-blob path kept for the legacy
+Message.to_bytes / decode API and tests. Both support the same Python
+types; only their frame layout differs.
flowchart LR
+ V[values] --> E[_encode_transport_value]
+ E --> Meta[msgpack metadata<br/>OOB descriptors for arrays]
+ E --> Bufs[[buffer 0]]
+ E --> Bufs2[[buffer 1]]
+ Meta --> Out[(list of frames)]
+ Bufs --> Out
+ Bufs2 --> Out
+
The function of interest is
+[serialize_message_frames][cortex.utils.serialization.serialize_message_frames]:
Arrays stay contiguous; ZMQ hands the buffer straight to the kernel.
+
+
+
flowchart LR
+ V[values] --> P[msgpack.packb<br/>default=_msgpack_default]
+ P --> Ext[ExtType 1/2 for arrays/tensors<br/>bytes embedded]
+ Ext --> Blob[single bytes blob]
+
The single blob round-trips through serialize(value) →
+deserialize(data). Useful for persisting to disk, caches, or when you
+need a self-contained payload without tracking extra buffers.
The buffer index refers into the ZMQ frames that follow the metadata.
+Nested structures (dict of arrays, list of tensors, etc.) are walked
+recursively by _encode_transport_value / _decode_transport_value.
sequenceDiagram
+ participant Sub as Subscriber
+ participant ZMQ as zmq.Frame
+ participant MV as memoryview
+ participant NP as np.ndarray
+
+ Sub->>ZMQ: recv_multipart(copy=False)
+ ZMQ-->>Sub: frame with .buffer property
+ Sub->>MV: memoryview(frame.buffer)
+ Sub->>NP: np.frombuffer(mv, dtype).reshape(shape)
+ Note over NP: array aliases the ZMQ frame memory
+
+
Aliasing caveat
+
The returned NumPy array is a view over the ZMQ frame buffer. It is
+safe to read as long as the frame lives, which is at least until your
+callback returns. If you need to:
+
+
mutate the array, or
+
keep it past the callback,
+
+
call arr = arr.copy() first. This is cheap compared to the savings on
+the hot path.
Separate but related: [compute_fingerprint(cls)][cortex.utils.hashing.compute_fingerprint]
+computes a 64-bit identity from the module path, class name, and sorted
+field:type pairs. Cached per-class in _fingerprint_cache. See
+Concepts → Fingerprinting for the full story.
Cortex has three moving parts: the discovery daemon, publisher nodes,
+and subscriber nodes. They coordinate over ZeroMQ — a REQ/REP control plane
+for discovery and a PUB/SUB data plane for messages.
Each topic gets its own IPC socket under /tmp/cortex/topics/. A single Node
+shares one zmq.asyncio.Context across all its publishers and subscribers to
+avoid per-socket io thread overhead.
Cortex nodes are asyncio-native. One event loop per process drives all
+publishers, subscribers, and timers for that node. On Linux and macOS,
+[cortex.run][cortex.utils.loop.run] prefers uvloop for lower tail latency.
Node.run() spawns one task per timer (RateExecutor) and one per
+callback-bearing subscriber (AsyncExecutor). It then asyncio.gathers them
+until cancelled.
sequenceDiagram
+ participant L as Event loop
+ participant A as AsyncExecutor
+ participant S as SUB socket
+ participant CB as user callback
+
+ loop while running
+ L->>A: resume
+ A->>S: await recv_multipart(copy=False)
+ S-->>A: frames
+ A->>A: decode message
+ A->>CB: await callback(msg, header)
+ A->>L: sleep(0) (yield)
+ end
+
+
Head-of-line blocking
+
A slow callback stalls the receive loop. Messages pile up on the SUB HWM
+and get evicted. If you expect variable-latency work, offload callback
+bodies to asyncio.create_task(...) or a thread pool.
The Publisher uses a sync zmq.Context (shadowed onto the node's async
+context). publish() is a plain function call — no await. This avoids the
+overhead of the async zmq integration on the send path.
+
+
Not thread-safe
+
A zmq.PUB socket is not safe to call from multiple threads or tasks
+concurrently. Serialize calls to publish() per publisher.
On Unix, importing cortex.run checks for uvloop and uses it if present.
+Measured impact: modest throughput improvement, meaningful p99 latency
+reduction on high-rate small messages.
The discovery daemon speaks a tiny msgpack-over-REQ/REP protocol. It is not
+on the data path — once a subscriber has the endpoint, messages flow
+publisher → subscriber directly.
sequenceDiagram
+ autonumber
+ participant P as Publisher
+ participant D as Daemon REP
+
+ P->>P: bind PUB socket on ipc:///tmp/cortex/topics/<node>__<topic>.sock
+ P->>D: REQ → DiscoveryRequest(REGISTER_TOPIC, TopicInfo{...})
+ D->>D: if topic_name absent: insert; else compare publisher_node
+ alt new
+ D-->>P: OK "Registered topic: /x"
+ else same publisher re-registering
+ D-->>P: OK (overwrite)
+ else different publisher, same topic
+ D-->>P: ALREADY_EXISTS
+ end
sequenceDiagram
+ autonumber
+ participant S as Subscriber
+ participant D as Daemon REP
+ participant P as Publisher
+
+ S->>D: REQ → LOOKUP_TOPIC("/x")
+ alt present
+ D-->>S: OK + TopicInfo
+ S->>P: SUB connect + SUBSCRIBE "/x"
+ else missing
+ D-->>S: NOT_FOUND
+ Note over S: if wait_for_topic:<br/>poll every 500 ms until timeout
+ S->>D: retry LOOKUP_TOPIC
+ end
+
wait_for_topic_async implements the retry loop with asyncio.sleep so the
+event loop keeps spinning.
Every message class gets a 64-bit identifier derived from its name and
+field schema. The fingerprint rides in the header of every published message
+and does two jobs:
+
+
Type dispatch — Message.decode(bytes) looks up the right class in the
+ [MessageType][cortex.messages.base.MessageType] registry.
+
Compatibility check — subscribers verify that the topic they looked up
+ advertises the same fingerprint as the type they were written against.
flowchart LR
+ A[class.__module__ + qualname] --> C[canonical string]
+ B[sorted list of field:type] --> C
+ C --> H[SHA-256]
+ H --> F[first 8 bytes → u64 big-endian]
Message.__init_subclass__ auto-registers every concrete subclass into
+[MessageType._registry][cortex.messages.base.MessageType] keyed by
+fingerprint. Nothing else to do — decorating your dataclass with
+@dataclass and inheriting from Message is enough.
On connect, the subscriber compares the topic's advertised fingerprint against
+the one it computed from its message class:
+
sequenceDiagram
+ participant S as Subscriber
+ participant D as Discovery daemon
+
+ S->>D: LOOKUP /topic
+ D-->>S: TopicInfo(fingerprint=0xABCD...)
+ S->>S: compare with MyMessage.fingerprint()
+ alt mismatch
+ S-->>S: log warning, continue anyway
+ else match
+ S-->>S: connect and subscribe
+ end
+
+
Today: mismatch is a warning, not an error
+
A fingerprint mismatch currently only logs a warning — see critique.md.
+Downstream decoding will fail hard. Until that is tightened, prefer to
+re-exchange type definitions between processes rather than rely on this guard.
field.type may be a string (under from __future__ import annotations)
+or a real type otherwise. The canonical string differs in the two cases,
+so the same class can fingerprint differently across import environments.
+
When defining messages shared between processes, either use the same import
+style in both, or rely on the runtime typing.get_type_hints(cls) equivalent
+once that lands upstream.
Cortex uses ZeroMQ multipart messages. Each published message is a list of
+frames rather than a single blob. That lets array payloads ride as raw
+contiguous buffers — no copy into a Python bytes, no re-copy by ZMQ.
Field values are packed in declaration order (not by name), so the receiver
+reconstructs using the dataclass's cached field tuple. This removes per-message
+field-name encoding.
+
Arrays and tensors appear in the metadata as small dict stand-ins called
+OOB descriptors:
sequenceDiagram
+ participant U as User
+ participant M as Message.to_frames
+ participant S as serialize_message_frames
+ participant E as _encode_transport_value
+ participant Z as ZMQ send_multipart
+
+ U->>M: build header + collect field values
+ M->>S: values in declaration order
+ S->>E: for each value, walk nested dicts/lists
+ E-->>S: scalar stays inline; array → OOB descriptor + buffer appended
+ S-->>M: (metadata_bytes, [buf0, buf1, ...])
+ M-->>Z: [topic, header, metadata, *buffers]
Message.to_bytes() / from_bytes() / Message.decode() still exist. They
+pack everything into one msgpack blob using ExtType for arrays. That path
+is retained for tests and opportunistic use; the transport always uses the
+multipart path above.
+
+
Mismatch trap
+
Bytes captured from the wire cannot be fed to Message.decode() — the wire
+format is multipart, not a single blob. Use Message.from_frames(frames).
A bottom-up review of Cortex as it stands today, with a focus on its viability as a communication library for robotics. This complements design-review.md with concrete code-level findings and benchmark observations.
field.type is a string when from __future__ import annotations is active and a real type otherwise. The fingerprint therefore depends on how the module was imported — fragile for cross-repo use.
+
Field ordering is sorted alphabetically in the fingerprint, but the wire layout uses dataclass declaration order. Two classes could theoretically fingerprint identically but interpret the wire differently.
Each dataclass inheriting Message is auto-registered via __init_subclass__ into MessageType._registry[fingerprint] = cls.
+
Wire format (multipart transport, what publishers actually use):
+
Frame 0: topic bytes (for PUB/SUB filter)
+Frame 1: 24-byte header (fingerprint u64, timestamp_ns u64, sequence u64, big-endian)
+Frame 2: msgpack of ordered field values with OOB descriptors
+Frame 3..N: raw contiguous array buffers (zero-copy)
+
+
There is a second, legacy single-blob path (to_bytes / from_bytes) that embeds array bytes inside a single msgpack blob using ExtType. It is retained for Message.decode(...) and tests, but is not what the transport uses.
_msgpack_default / _msgpack_ext_hook (inline): arrays/tensors get packed as msgpack ExtType inside the single blob. Used by the legacy path.
+
_encode_transport_value / _decode_transport_value (out-of-band): each array/tensor is replaced with a tiny dict {__cortex_oob__: "numpy", buffer: i, dtype, shape} and its raw bytes are appended as separate ZMQ frames. Reconstruction uses np.frombuffer(frame.buffer, dtype).reshape(shape) with no copy.
+
+
After the March 2026 optimizations: zero-copy decode, schema-ordered values (field names no longer repeated per message), and cached field-name tuples.
+
4. Discovery — discovery/daemon.py and discovery/client.py¶
+
Single-threaded zmq.REP over IPC at ipc:///tmp/cortex/discovery.sock.
+
+
Registry is a plain dict[str, TopicInfo], enforcing one publisher per topic.
+
RCVTIMEO=1s so the run loop can poll _running for Ctrl-C.
Publisher: binds a zmq.PUB at ipc:///tmp/cortex/topics/<node>__<topic>.sock, registers via the discovery client, publishes multipart [topic, header, metadata, *buffers] with zmq.NOBLOCK. If the Node hands it an async context, it wraps a sync zmq.Context(self._context) around the same underlying zmq io threads so publishing stays synchronous.
+
Subscriber: uses an async context, looks up the topic (optionally waits), connects zmq.SUB, sets a topic filter, loops via AsyncExecutor doing recv_multipart(copy=False) → Message.from_frames.
A Node owns a shared zmq.asyncio.Context, plus lists of publishers, subscribers, and timers. Each timer gets a RateExecutor(fn, rate_hz). node.run() creates asyncio tasks for every timer and every callback-subscriber, then asyncio.gather. RateExecutor uses perf_counter plus asyncio.sleep(max(0, next-now)). cortex.run prefers uvloop on Unix.
Measured on this machine with the in-repo benchmark suite:
+
+
+
+
Metric
+
Value
+
+
+
+
+
Small-payload latency
+
mean 556 µs, p99 1075 µs
+
+
+
64KB latency
+
mean 919 µs, p99 1.4 ms
+
+
+
Tiny array throughput
+
21.8k msg/s
+
+
+
1MB array throughput
+
7.7k msg/s, 8.0 GB/s
+
+
+
4MB array throughput
+
2.25k msg/s, 9.4 GB/s
+
+
+
1080p RGB frames
+
1422 fps, 8.8 GB/s
+
+
+
Raw wire+decode (inproc)
+
35 µs roundtrip (4MB array)
+
+
+
+
The delta between the ~35 µs raw wire and ~550 µs end-to-end is asyncio scheduling, context-switch between publisher timer and subscriber recv, and Python callback dispatch. Serialization is close to memcpy-bandwidth on large payloads — the OOB transport is pulling its weight.
Latency floor is too high for control loops. ~550 µs mean and ~1.5 ms p99 is dominated by asyncio + zmq.asyncio, not zmq itself. Control topics should be able to opt into a synchronous thread-plus-zmq.Poller receive path targeting <100 µs p99. Async should be the default, not the only option.
+
+
+
Discovery is a single REQ/REP chokepoint with stop-the-world semantics. On crashes, stale topic entries are never reclaimed — a crashed publisher's IPC file stays on disk and the registry keeps pointing at a dead socket. Add leases with heartbeats (publisher renews every N seconds; daemon evicts stale entries), or a peer-gossip model where every node beacons presence. The current daemon has no concurrency — one slow client blocks all others.
+
+
+
One-publisher-per-topic is a hard limit for robotics. Redundant IMUs, failover, and multi-source fusion are all blocked. The registry should accept N publishers per topic and subscribers should connect() to all of them — ZMQ SUB handles fan-in natively.
+
+
+
No backpressure semantics.pub.publish() is NOBLOCK and silently drops on HWM. Subscriber HWM=10 on SUB evicts old messages by default. Robotics needs per-topic QoS profiles similar to DDS:
+
+
best_effort_latest — camera frames: drop old, keep newest (ZMQ_CONFLATE=1).
+
reliable_queue — commands: block or surface an error.
+
+
dropping_queue — telemetry: current behavior, but with a drop counter.
+
+
+
No liveness or drop detection. A subscriber has no way to know the publisher died. Sequence numbers exist in the header but are never checked for gaps. Automatic gap-counting in Subscriber would be gold for debugging.
+
+
+
Callback execution blocks the receive loop. A 10 ms callback accumulates on SUB HWM and drops. Receive, decode, and user-callback execution should be decoupled with a bounded work queue and one or more worker coroutines/threads per subscriber. ROS 2 executors have this distinction for a reason.
+
+
+
Local-only transport in practice. Addresses are hardcoded ipc:// paths under /tmp. Multi-host robotics (robot ↔ base-station) needs TCP transport in discovery, NIC selection, and topology-aware addressing.
+
+
+
No shared memory for huge payloads. At 9 GB/s on 4 MB arrays, every subscriber gets a fresh copy. For multi-subscriber camera or LiDAR fan-out, a shared-memory transport (posix shm + ring buffer + zmq for control-plane notifications) would give true zero-copy.
publisher.py:91-95 — zmq.Context(self._context) creates a shadowed sync context sharing the async context's io threads. Correct, but subtle. zmq.PUB is not thread-safe — calling pub.publish() from multiple asyncio tasks on the same socket is undefined. Needs docs or a lock.
+
+
+
publisher.py:117-118 — the publisher unlinks any existing socket file on startup. If two publishers on the same host use the same node name + topic, the second silently steals the socket. Should fail loudly.
+
+
+
subscriber.py:155-160 — fingerprint mismatch logs a warning and proceeds anyway. That is a silent-data-corruption path. Should refuse to connect.
+
+
+
messages/base.py:109-129 — _sequence_counter is class-level, shared across every Publisher instance of that message type in the process. Two publishers of ArrayMessage interleave sequences — breaking per-topic drop detection. Move it onto the Publisher.
+
+
+
utils/hashing.py:34-38 — field.type is a string with PEP 563 and a real type otherwise; the resulting fingerprint differs across import environments. Use typing.get_type_hints(cls) consistently.
+
+
+
discovery/client.py:78-101 — retries=1 default means zero retries (loop runs once). Fencepost bug.
+
+
+
core/executor.py:119-147 — RateExecutor has both await asyncio.sleep(0) inside the loop and await asyncio.sleep(max(0, dt)) at the bottom. The first is redundant and creates unnecessary wakeups. Catch-up logic silently eats dropped ticks; control loops often need to know.
+
+
+
discovery/daemon.py:87 — RCVTIMEO=1s means Ctrl-C takes up to 1s to take effect and request throughput is throttled. A zmq.Poller with a shutdown PAIR socket gives clean immediate shutdown.
+
+
+
messages/standard.py:146-150 — ImageMessage.__post_init__ auto-fill is non-idempotent across deserialization round-trips. Minor.
+
+
+
discovery/daemon.py:168-177 — same-publisher re-registration is allowed; if its IPC path changed, existing subscribers are never told. Needs a lease or a "changed" notification.
+
+
+
No CI test for cross-process fingerprint stability. Given how much safety rides on fingerprints, every standard message type deserves a stored golden fingerprint asserted in CI.
+
+
+
from_bytes vs from_frames asymmetry is a trap.Message.decode(bytes) only handles the inline path. If anyone captures bytes from the wire (the multipart path) and calls decode(), it will fail silently. Unify the paths or rename decode.
+
+
+
No async publish.send_multipart briefly blocks on HWM/context switch; inside an async timer callback this is a hidden blocking call. An async publish variant would help.
Cortex is a well-built, honest small-system IPC library. The serialization is genuinely fast — hitting memcpy-bandwidth on 4 MB arrays with zero-copy OOB frames. The latency floor (~550 µs p50, ~1.5 ms p99) is limited by asyncio, not zmq. The discovery, QoS, liveness, and single-host assumptions are the real blockers for using this as robotics middleware.
+
Recommended path if adopting Cortex for robotics:
+
+
Add per-topic QoS profiles with drop counters (1-2 days).
+
Add a synchronous-threaded subscriber option for low-latency control (1 day).
+
Add heartbeats/leases and multi-publisher support to discovery (3-5 days).
+
Add TCP transport and host-aware discovery (2-3 days).
+
Then consider shared memory and schema evolution.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/site/gen_ref_pages.py b/docs/site/gen_ref_pages.py
new file mode 100644
index 0000000..a6c4f5a
--- /dev/null
+++ b/docs/site/gen_ref_pages.py
@@ -0,0 +1,47 @@
+"""Generate one API reference page per module under ``src/cortex/``.
+
+Executed by ``mkdocs-gen-files`` during the build. Emits:
+
+- ``reference//.md`` for every non-dunder module,
+- ``reference//index.md`` for every ``__init__.py``,
+- ``reference/SUMMARY.md`` consumed by ``mkdocs-literate-nav``.
+
+Keeping this generated means adding a new module needs zero doc edits.
+"""
+
+from pathlib import Path
+
+import mkdocs_gen_files
+
+# This script lives at ``docs/gen_ref_pages.py`` and is executed by
+# mkdocs-gen-files with the mkdocs.yml directory as cwd. Anchor to the
+# repo root so the generator finds ``src/cortex`` regardless of cwd.
+REPO_ROOT = Path(__file__).resolve().parent.parent
+SRC_ROOT = REPO_ROOT / "src"
+PACKAGE = "cortex"
+
+nav = mkdocs_gen_files.Nav()
+
+for path in sorted((SRC_ROOT / PACKAGE).rglob("*.py")):
+ module_path = path.relative_to(SRC_ROOT).with_suffix("")
+ doc_path = Path("reference", *module_path.parts[1:]).with_suffix(".md")
+ parts = tuple(module_path.parts)
+
+ if parts[-1] == "__init__":
+ parts = parts[:-1]
+ doc_path = doc_path.with_name("index.md")
+ elif parts[-1].startswith("_"):
+ continue
+
+ nav_parts = parts[1:] if parts[1:] else ("cortex",)
+ nav[nav_parts] = doc_path.relative_to("reference").as_posix()
+
+ identifier = ".".join(parts) if parts else PACKAGE
+ with mkdocs_gen_files.open(doc_path, "w") as f:
+ f.write(f"# `{identifier}`\n\n")
+ f.write(f"::: {identifier}\n")
+
+ mkdocs_gen_files.set_edit_path(doc_path, path.relative_to(REPO_ROOT))
+
+with mkdocs_gen_files.open("reference/SUMMARY.md", "w") as f:
+ f.writelines(nav.build_literate_nav())
diff --git a/docs/site/getting-started/discovery-daemon/index.html b/docs/site/getting-started/discovery-daemon/index.html
new file mode 100644
index 0000000..f81e399
--- /dev/null
+++ b/docs/site/getting-started/discovery-daemon/index.html
@@ -0,0 +1,1896 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Running the Discovery Daemon - Cortex
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The discovery daemon is a lightweight REP service that maintains the registry
+of active topics. Publishers register on startup; subscribers look up the
+endpoint and connect directly.
Another daemon (or a stale socket file) is holding the path.
+rm /tmp/cortex/discovery.sock and restart.
+
Subscribers time out looking up topics
+
Daemon not running, or publisher failed to register. Run with
+--log-level DEBUG and watch for REGISTER / LOOKUP lines.
+
Daemon crash leaves stale entries
+
Today, entries are only removed on explicit UNREGISTER. A crashed
+publisher's topic stays in the registry pointing at a dead socket.
+Restarting the daemon clears all state.
Most likely: the daemon is not running, or the topic name is mistyped.
+DiscoveryClient.wait_for_topic_async polls every 500 ms until the topic
+appears or the timeout fires.
+
cortex-discovery--log-levelDEBUG
+
+
Watch for LOOKUP topic: /x -> NOT FOUND.
+
Publisher "works" but subscriber receives nothing¶
+
ZMQ PUB drops messages for which no matching SUB is connected yet. If your
+publisher starts first and publishes immediately, the first few messages are
+lost — this is the classic ZMQ slow-joiner problem.
+
Workarounds:
+
+
Have the publisher wait briefly after bind before publishing the first message.
+
Have the subscriber wait-for-topic (the default) so it comes up after the
+ publisher registered.
If a publisher exits uncleanly, its IPC socket file remains. Cortex's
+Publisher._setup_socket unlinks any existing file at the same path on the
+next bind — so restarting the publisher fixes it. Otherwise:
The registry is in-memory. Restarting the daemon wipes all state;
+publishers do not auto-re-register today. Restart your publishers after
+restarting the daemon.
If you see
+Message type mismatch for /x: expected FooMessage, got BarMessage —
+the topic was registered with a different message class. Either rename the
+topic or align the classes.
Per-socket HWM defaults to 10. Increase queue_size on high-rate producers
+whose subscribers are known to be slow — but remember that ZMQ drops silently
+at the HWM.
Single tiny messages (primitives only, < 1 KB) see no benefit from multipart.
+The inline to_bytes path is still fine there. Publishers always use
+multipart today.
A lightweight Python framework for inter-process communication over ZeroMQ.
+
Cortex is a pub/sub layer designed to feel obvious. Nodes publish typed messages on named topics; subscribers receive them via async callbacks. A tiny discovery daemon tells subscribers where to connect. Native support for NumPy arrays and PyTorch tensors keeps robotics- and ML-shaped payloads fast.
Cortex targets single-host process graphs today. See design-review.md
+and critique.md for an honest account of current limits and the
+roadmap toward multi-host robotics use.
A lightweight Python framework for inter-process communication over ZeroMQ.
Cortex is a pub/sub layer designed to feel obvious. Nodes publish typed messages on named topics; subscribers receive them via async callbacks. A tiny discovery daemon tells subscribers where to connect. Native support for NumPy arrays and PyTorch tensors keeps robotics- and ML-shaped payloads fast.
Getting started
Install, start the daemon, publish your first message in under two minutes.
Concepts
How the wire format, fingerprinting, discovery handshake, and async execution fit together.
Components
Deep dives into the Messages, Discovery, and Core modules.
API reference
Auto-generated from the source. Always matches the code on main.
Cortex targets single-host process graphs today. See design-review.md and critique.md for an honest account of current limits and the roadmap toward multi-host robotics use.
A bottom-up review of Cortex as it stands today, with a focus on its viability as a communication library for robotics. This complements design-review.md with concrete code-level findings and benchmark observations.
field.type is a string when from __future__ import annotations is active and a real type otherwise. The fingerprint therefore depends on how the module was imported — fragile for cross-repo use.
Field ordering is sorted alphabetically in the fingerprint, but the wire layout uses dataclass declaration order. Two classes could theoretically fingerprint identically but interpret the wire differently.
","path":["Design notes","Cortex Critique"],"tags":[]},{"location":"critique/#2-message-base-messagesbasepy","level":3,"title":"2. Message base — messages/base.py","text":"
Each dataclass inheriting Message is auto-registered via __init_subclass__ into MessageType._registry[fingerprint] = cls.
Wire format (multipart transport, what publishers actually use):
Frame 0: topic bytes (for PUB/SUB filter)\nFrame 1: 24-byte header (fingerprint u64, timestamp_ns u64, sequence u64, big-endian)\nFrame 2: msgpack of ordered field values with OOB descriptors\nFrame 3..N: raw contiguous array buffers (zero-copy)\n
There is a second, legacy single-blob path (to_bytes / from_bytes) that embeds array bytes inside a single msgpack blob using ExtType. It is retained for Message.decode(...) and tests, but is not what the transport uses.
_msgpack_default / _msgpack_ext_hook (inline): arrays/tensors get packed as msgpack ExtType inside the single blob. Used by the legacy path.
_encode_transport_value / _decode_transport_value (out-of-band): each array/tensor is replaced with a tiny dict {__cortex_oob__: \"numpy\", buffer: i, dtype, shape} and its raw bytes are appended as separate ZMQ frames. Reconstruction uses np.frombuffer(frame.buffer, dtype).reshape(shape) with no copy.
After the March 2026 optimizations: zero-copy decode, schema-ordered values (field names no longer repeated per message), and cached field-name tuples.
","path":["Design notes","Cortex Critique"],"tags":[]},{"location":"critique/#4-discovery-discoverydaemonpy-and-discoveryclientpy","level":3,"title":"4. Discovery — discovery/daemon.py and discovery/client.py","text":"
Single-threaded zmq.REP over IPC at ipc:///tmp/cortex/discovery.sock.
Registry is a plain dict[str, TopicInfo], enforcing one publisher per topic.
RCVTIMEO=1s so the run loop can poll _running for Ctrl-C.
Publisher: binds a zmq.PUB at ipc:///tmp/cortex/topics/<node>__<topic>.sock, registers via the discovery client, publishes multipart [topic, header, metadata, *buffers] with zmq.NOBLOCK. If the Node hands it an async context, it wraps a sync zmq.Context(self._context) around the same underlying zmq io threads so publishing stays synchronous.
Subscriber: uses an async context, looks up the topic (optionally waits), connects zmq.SUB, sets a topic filter, loops via AsyncExecutor doing recv_multipart(copy=False) → Message.from_frames.
A Node owns a shared zmq.asyncio.Context, plus lists of publishers, subscribers, and timers. Each timer gets a RateExecutor(fn, rate_hz). node.run() creates asyncio tasks for every timer and every callback-subscriber, then asyncio.gather. RateExecutor uses perf_counter plus asyncio.sleep(max(0, next-now)). cortex.run prefers uvloop on Unix.
The delta between the ~35 µs raw wire and ~550 µs end-to-end is asyncio scheduling, context-switch between publisher timer and subscriber recv, and Python callback dispatch. Serialization is close to memcpy-bandwidth on large payloads — the OOB transport is pulling its weight.
","path":["Design notes","Cortex Critique"],"tags":[]},{"location":"critique/#what-can-be-improved","level":2,"title":"What can be improved","text":"","path":["Design notes","Cortex Critique"],"tags":[]},{"location":"critique/#design-level-biggest-wins","level":3,"title":"Design-level (biggest wins)","text":"
Latency floor is too high for control loops. ~550 µs mean and ~1.5 ms p99 is dominated by asyncio + zmq.asyncio, not zmq itself. Control topics should be able to opt into a synchronous thread-plus-zmq.Poller receive path targeting <100 µs p99. Async should be the default, not the only option.
Discovery is a single REQ/REP chokepoint with stop-the-world semantics. On crashes, stale topic entries are never reclaimed — a crashed publisher's IPC file stays on disk and the registry keeps pointing at a dead socket. Add leases with heartbeats (publisher renews every N seconds; daemon evicts stale entries), or a peer-gossip model where every node beacons presence. The current daemon has no concurrency — one slow client blocks all others.
One-publisher-per-topic is a hard limit for robotics. Redundant IMUs, failover, and multi-source fusion are all blocked. The registry should accept N publishers per topic and subscribers should connect() to all of them — ZMQ SUB handles fan-in natively.
No backpressure semantics. pub.publish() is NOBLOCK and silently drops on HWM. Subscriber HWM=10 on SUB evicts old messages by default. Robotics needs per-topic QoS profiles similar to DDS:
best_effort_latest — camera frames: drop old, keep newest (ZMQ_CONFLATE=1).
reliable_queue — commands: block or surface an error.
dropping_queue — telemetry: current behavior, but with a drop counter.
No liveness or drop detection. A subscriber has no way to know the publisher died. Sequence numbers exist in the header but are never checked for gaps. Automatic gap-counting in Subscriber would be gold for debugging.
Callback execution blocks the receive loop. A 10 ms callback accumulates on SUB HWM and drops. Receive, decode, and user-callback execution should be decoupled with a bounded work queue and one or more worker coroutines/threads per subscriber. ROS 2 executors have this distinction for a reason.
Local-only transport in practice. Addresses are hardcoded ipc:// paths under /tmp. Multi-host robotics (robot ↔ base-station) needs TCP transport in discovery, NIC selection, and topology-aware addressing.
No shared memory for huge payloads. At 9 GB/s on 4 MB arrays, every subscriber gets a fresh copy. For multi-subscriber camera or LiDAR fan-out, a shared-memory transport (posix shm + ring buffer + zmq for control-plane notifications) would give true zero-copy.
publisher.py:91-95 — zmq.Context(self._context) creates a shadowed sync context sharing the async context's io threads. Correct, but subtle. zmq.PUB is not thread-safe — calling pub.publish() from multiple asyncio tasks on the same socket is undefined. Needs docs or a lock.
publisher.py:117-118 — the publisher unlinks any existing socket file on startup. If two publishers on the same host use the same node name + topic, the second silently steals the socket. Should fail loudly.
subscriber.py:155-160 — fingerprint mismatch logs a warning and proceeds anyway. That is a silent-data-corruption path. Should refuse to connect.
messages/base.py:109-129 — _sequence_counter is class-level, shared across every Publisher instance of that message type in the process. Two publishers of ArrayMessage interleave sequences — breaking per-topic drop detection. Move it onto the Publisher.
utils/hashing.py:34-38 — field.type is a string with PEP 563 and a real type otherwise; the resulting fingerprint differs across import environments. Use typing.get_type_hints(cls) consistently.
discovery/client.py:78-101 — retries=1 default means zero retries (loop runs once). Fencepost bug.
core/executor.py:119-147 — RateExecutor has both await asyncio.sleep(0) inside the loop and await asyncio.sleep(max(0, dt)) at the bottom. The first is redundant and creates unnecessary wakeups. Catch-up logic silently eats dropped ticks; control loops often need to know.
discovery/daemon.py:87 — RCVTIMEO=1s means Ctrl-C takes up to 1s to take effect and request throughput is throttled. A zmq.Poller with a shutdown PAIR socket gives clean immediate shutdown.
messages/standard.py:146-150 — ImageMessage.__post_init__ auto-fill is non-idempotent across deserialization round-trips. Minor.
discovery/daemon.py:168-177 — same-publisher re-registration is allowed; if its IPC path changed, existing subscribers are never told. Needs a lease or a \"changed\" notification.
No CI test for cross-process fingerprint stability. Given how much safety rides on fingerprints, every standard message type deserves a stored golden fingerprint asserted in CI.
from_bytes vs from_frames asymmetry is a trap. Message.decode(bytes) only handles the inline path. If anyone captures bytes from the wire (the multipart path) and calls decode(), it will fail silently. Unify the paths or rename decode.
No async publish. send_multipart briefly blocks on HWM/context switch; inside an async timer callback this is a hidden blocking call. An async publish variant would help.
Cortex is a well-built, honest small-system IPC library. The serialization is genuinely fast — hitting memcpy-bandwidth on 4 MB arrays with zero-copy OOB frames. The latency floor (~550 µs p50, ~1.5 ms p99) is limited by asyncio, not zmq. The discovery, QoS, liveness, and single-host assumptions are the real blockers for using this as robotics middleware.
Recommended path if adopting Cortex for robotics:
Add per-topic QoS profiles with drop counters (1-2 days).
Add a synchronous-threaded subscriber option for low-latency control (1 day).
Add heartbeats/leases and multi-publisher support to discovery (3-5 days).
Add TCP transport and host-aware discovery (2-3 days).
Discovery is Cortex's control plane: a single long-lived process that maps topic names to ZMQ endpoints. It sits off the data path — once a subscriber has an endpoint, messages flow publisher → subscriber directly without the daemon's involvement.
Everyone agrees on the wire format via protocol.py. The daemon runs a single-threaded REP loop. The client speaks REQ from every publisher and subscriber in the graph.
Binds zmq.REP at ipc:///tmp/cortex/discovery.sock by default.
Maintains _topics: dict[str, TopicInfo] — one publisher per topic.
RCVTIMEO=1000 on the socket so the loop can check _running for clean Ctrl-C. This also means the daemon is naturally single-request-at-a-time — a slow client blocks all others.
","path":["Components","Discovery"],"tags":[]},{"location":"components/discovery/#registry-semantics","level":3,"title":"Registry semantics","text":"Case Result New topic Insert → OK Same topic, same publisher_node Overwrite → OK (re-registration) Same topic, different publisher_node Reject → ALREADY_EXISTS UNREGISTER missing topic NOT_FOUND","path":["Components","Discovery"],"tags":[]},{"location":"components/discovery/#client","level":2,"title":"Client","text":"
Implemented in DiscoveryClient.
Thin REQ wrapper around the protocol. Important operational detail: REQ sockets stick after a timeout — they block subsequent sends waiting for a reply that never came. The client handles this by closing and recreating the socket on every timeout (_reconnect). Callers don't see it.
Messages are just @dataclasses that inherit from Message. Registering with the type system, computing a fingerprint, and (de)serialization all happen automatically.
","path":["Components","Messages"],"tags":[]},{"location":"components/messages/#anatomy-of-a-message","level":2,"title":"Anatomy of a message","text":"
","path":["Components","Messages"],"tags":[]},{"location":"components/messages/#defining-a-custom-message","level":2,"title":"Defining a custom message","text":"
Message._sequence_counter is shared across all publisher instances of the same message class in the process. Two ArrayMessage publishers interleave sequence numbers. Per-topic gap detection therefore needs a per-publisher counter today; see critique.md § 12.
A Node is the user-facing composition unit: it owns a shared ZMQ async context and a collection of publishers, subscribers, and timers. Executors provide the scheduling primitives that timers and subscriber receive loops run on.
One node = one process boundary in practice. Nothing stops you running multiple nodes in the same process (asyncio.gather([n.run() for n in nodes]), see examples/multi_node_system.py), but remember they share the same event loop — a slow callback in one still blocks the others.
Spawns one asyncio task per timer and one per callback-bearing subscriber, then asyncio.gathers them. Returns when all tasks complete or the node is stopped.
async with Node(\"my_node\") as node:\n node.create_publisher(\"/x\", IntMessage)\n node.create_subscriber(\"/y\", IntMessage, callback=on_y)\n await node.run() # blocks until cancelled\n# __aexit__ calls close() automatically\n
Stops all executors, cancels outstanding tasks, closes every publisher and subscriber (each of which unregisters/unbinds their own socket), and terminates the shared ZMQ context. Idempotent.
\"Run this coroutine at a constant rate, catching up on overruns.\"
flowchart TD\n Start[next = perf_counter] --> Loop{running?}\n Loop -- no --> End\n Loop -- yes --> Now[now = perf_counter]\n Now --> Due{now >= next?}\n Due -- yes --> Call[await func]\n Call --> Advance[next += interval]\n Advance --> Behind{next < now?}\n Behind -- yes --> Reset[next = now + interval]\n Behind -- no --> Wait\n Reset --> Wait\n Due -- no --> Wait[await sleep next - now]\n Wait --> Loop
The catch-up branch silently drops ticks — if your 100 Hz callback takes 20 ms once, you do not get two callbacks back-to-back; you skip one tick.
Redundant yield
Today there is an await asyncio.sleep(0) inside the loop and await asyncio.sleep(max(0, dt)) at the bottom. That generates an extra wakeup per tick. See critique § 15.
Timers are plain async functions — no decorator, no magic. They run in the same event loop as subscriber callbacks, so the same head-of-line caveat applies.
The data-plane workhorses. A Publisher binds a ZMQ PUB socket and registers with discovery; a Subscriber looks up the endpoint, connects a SUB socket, and drives an async receive loop. Discovery is consulted once per topic on startup — it is not on the hot path.
","path":["Components","Publisher & Subscriber"],"tags":[]},{"location":"components/publisher-subscriber/#relationship-to-the-rest-of-the-stack","level":2,"title":"Relationship to the rest of the stack","text":"
flowchart LR\n Node -.owns.-> P[Publisher]\n Node -.owns.-> S[Subscriber]\n P -- register --> DC1[DiscoveryClient]\n S -- lookup --> DC2[DiscoveryClient]\n P -- send_multipart --> Sock1[(zmq.PUB<br/>IPC)]\n Sock1 -. IPC .-> Sock2[(zmq.SUB)]\n S -- recv_multipart --> Sock2\n M[Message] -- to_frames --> P\n S -- from_frames --> M
Always create via Node.create_publisher — direct construction works but skips the shared ZMQ context reuse and the node-level registration bookkeeping.
pub = node.create_publisher(\n topic_name=\"/camera/image\", # must start with \"/\"\n message_type=ImageMessage, # fingerprint is taken from this class\n queue_size=100, # SNDHWM; drops under backpressure\n)\n
sequenceDiagram\n autonumber\n participant U as User\n participant Pub as Publisher\n participant FS as /tmp/cortex/topics/\n participant ZMQ as zmq.PUB\n participant D as Discovery daemon\n\n U->>Pub: __init__(topic, msg_cls, ...)\n Pub->>Pub: address = generate_ipc_address(topic, node)\n Pub->>FS: mkdir -p; unlink stale .sock\n Pub->>ZMQ: socket(PUB); setsockopt HWM/LINGER; bind(address)\n Pub->>D: REGISTER TopicInfo{name, address, fingerprint, node}\n D-->>Pub: OK / ALREADY_EXISTS\n Note over Pub: ready; user can publish()
Two things worth calling out:
The IPC address is derived deterministically from node_name and topic_name via generate_ipc_address: ipc:///tmp/cortex/topics/<node>__<topic-with-slashes-as-underscores>.sock.
_setup_socket unlinks any existing file at that path before binding. That protects against crash-leftover sockets, but also means two publishers configured with the same node_name + topic_name in the same process tree will silently stomp each other — see critique § 10.
Any other exception is logged and swallowed; publish still returns False. For robotics code this \"fire and forget\" is intentional — the caller decides whether to retry based on the return value and the topic's role.
Node owns a zmq.asyncio.Context. The Publisher constructor detects this and wraps a sync zmq.Context around the same underlying io threads:
if isinstance(self._context, zmq.asyncio.Context):\n self._context: zmq.Context = zmq.Context(self._context)\n
This keeps publish() a normal function call instead of forcing every publish to be awaited. It is the right performance choice, but it has consequences:
zmq.PUB is not thread-safe
Do not call publish() on the same Publisher from multiple threads (or multiple asyncio tasks that could race on send_multipart). Serialize per-publisher calls yourself if you fan out work.
","path":["Components","Publisher & Subscriber"],"tags":[]},{"location":"components/publisher-subscriber/#lifecycle-and-cleanup","level":3,"title":"Lifecycle and cleanup","text":"
Publisher.close() is best-effort: it unregisters from the daemon (silently tolerates a dead daemon), closes the socket, and removes the IPC file. Exceptions from any one step do not block the others.
publisher.publish_count, publisher.last_publish_time, and publisher.is_registered are exposed for instrumentation. They update on the hot path with no locking — read them from the same task that calls publish() for deterministic numbers.
sequenceDiagram\n autonumber\n participant U as User\n participant S as Subscriber\n participant D as DiscoveryClient\n participant Pub as publisher IPC\n\n U->>S: __init__(...)\n S->>D: lookup_topic(name) # non-blocking\n alt found immediately\n D-->>S: TopicInfo\n S->>S: verify fingerprint\n S->>Pub: SUB connect + SUBSCRIBE topic\n Note over S: is_connected = True\n else not found\n D-->>S: None\n Note over S: defer; retry in run()\n end\n\n U->>S: node.run() schedules sub.run()\n S->>D: wait_for_topic_async(name, timeout)\n D-->>S: TopicInfo\n S->>Pub: SUB connect + SUBSCRIBE topic
The constructor tries a non-blocking lookup first so that when a publisher is already up, no polling is needed. The polling fallback only kicks in inside sub.run() via wait_for_topic_async.
copy=False means each frame is a zmq.Frame — the metadata and array buffers are memoryview-able without a copy. See cortex.utils.serialization.
The one-frame fast path (len(payload_frames) == 1) handles legacy publishers still on the single-blob path — it falls back to from_bytes on the single payload buffer.
On connect the subscriber compares its class's fingerprint to the one in the registry entry. Today a mismatch only logs a warning and proceeds anyway — downstream decoding will then fail hard. Treat fingerprint warnings as errors in your code.
Subscriber.close() stops the executor, closes the discovery client and SUB socket, and flips is_connected to False. Safe to call multiple times; errors are suppressed so teardown does not cascade.
None of these are atomic; treat them as coarse gauges.
","path":["Components","Publisher & Subscriber"],"tags":[]},{"location":"components/publisher-subscriber/#common-pitfalls","level":2,"title":"Common pitfalls","text":"Symptom Cause Fix First N messages not received ZMQ \"slow joiner\": SUB not connected yet when PUB started publishing Let subscriber start first, or sleep briefly before first publish Subscriber receives nothing, no errors Topic name mismatch, or forgot to call node.run() Log both sides; run cortex-discovery --log-level DEBUGpublish() returns False repeatedly Subscriber can't keep up; SNDHWM reached Increase queue_size, or reduce publish rate Mutating a received array \"corrupts\" later Decoded arrays alias ZMQ frame memory arr = arr.copy() before mutating Two processes stomp each other's socket Same node_name + topic_name Unique node names per process","path":["Components","Publisher & Subscriber"],"tags":[]},{"location":"components/publisher-subscriber/#see-also","level":2,"title":"See also","text":"
Two encodings live side by side: a multipart / out-of-band path that the transport actually uses, and a single-blob path kept for the legacy Message.to_bytes / decode API and tests. Both support the same Python types; only their frame layout differs.
","path":["Components","Serialization"],"tags":[]},{"location":"components/serialization/#supported-types","level":2,"title":"Supported types","text":"Type Inline path (to_bytes) OOB path (to_frames) None 1 byte tag msgpack nilint, float, str, bool msgpack PRIMITIVE msgpack bytes tag + length + bytes msgpack bin list, tuple, dict msgpack with ExtType arrays msgpack with OOB descriptors np.ndarray ExtType (inline bytes) OOB descriptor + extra frame torch.Tensor ExtType (inline bytes) OOB descriptor + extra frame","path":["Components","Serialization"],"tags":[]},{"location":"components/serialization/#the-two-paths-side-by-side","level":2,"title":"The two paths, side by side","text":"OOB multipart (used on the wire)Inline blob (legacy / Message.decode)
flowchart LR\n V[values] --> E[_encode_transport_value]\n E --> Meta[msgpack metadata<br/>OOB descriptors for arrays]\n E --> Bufs[[buffer 0]]\n E --> Bufs2[[buffer 1]]\n Meta --> Out[(list of frames)]\n Bufs --> Out\n Bufs2 --> Out
The function of interest is serialize_message_frames:
Arrays stay contiguous; ZMQ hands the buffer straight to the kernel.
flowchart LR\n V[values] --> P[msgpack.packb<br/>default=_msgpack_default]\n P --> Ext[ExtType 1/2 for arrays/tensors<br/>bytes embedded]\n Ext --> Blob[single bytes blob]
The single blob round-trips through serialize(value) → deserialize(data). Useful for persisting to disk, caches, or when you need a self-contained payload without tracking extra buffers.
The buffer index refers into the ZMQ frames that follow the metadata. Nested structures (dict of arrays, list of tensors, etc.) are walked recursively by _encode_transport_value / _decode_transport_value.
","path":["Components","Serialization"],"tags":[]},{"location":"components/serialization/#zero-copy-on-the-decode-side","level":2,"title":"Zero-copy on the decode side","text":"
sequenceDiagram\n participant Sub as Subscriber\n participant ZMQ as zmq.Frame\n participant MV as memoryview\n participant NP as np.ndarray\n\n Sub->>ZMQ: recv_multipart(copy=False)\n ZMQ-->>Sub: frame with .buffer property\n Sub->>MV: memoryview(frame.buffer)\n Sub->>NP: np.frombuffer(mv, dtype).reshape(shape)\n Note over NP: array aliases the ZMQ frame memory
Aliasing caveat
The returned NumPy array is a view over the ZMQ frame buffer. It is safe to read as long as the frame lives, which is at least until your callback returns. If you need to:
mutate the array, or
keep it past the callback,
call arr = arr.copy() first. This is cheap compared to the savings on the hot path.
Separate but related: compute_fingerprint(cls) computes a 64-bit identity from the module path, class name, and sorted field:type pairs. Cached per-class in _fingerprint_cache. See Concepts → Fingerprinting for the full story.
","path":["Components","Serialization"],"tags":[]},{"location":"components/serialization/#when-to-use-each-helper","level":2,"title":"When to use each helper","text":"Helper Use when serialize_message_frames You're building a custom transport that speaks multipart deserialize_message_frames Decoding the above serialize(value) / deserialize Persisting a single value to disk / cache serialize_numpy / deserialize_numpy Raw array round-trip without msgpack overhead Message.to_frames / Message.from_frames Anything inside Cortex itself","path":["Components","Serialization"],"tags":[]},{"location":"components/serialization/#see-also","level":2,"title":"See also","text":"
Cortex has three moving parts: the discovery daemon, publisher nodes, and subscriber nodes. They coordinate over ZeroMQ — a REQ/REP control plane for discovery and a PUB/SUB data plane for messages.
Each topic gets its own IPC socket under /tmp/cortex/topics/. A single Node shares one zmq.asyncio.Context across all its publishers and subscribers to avoid per-socket io thread overhead.
Cortex nodes are asyncio-native. One event loop per process drives all publishers, subscribers, and timers for that node. On Linux and macOS, cortex.run prefers uvloop for lower tail latency.
Node.run() spawns one task per timer (RateExecutor) and one per callback-bearing subscriber (AsyncExecutor). It then asyncio.gathers them until cancelled.
sequenceDiagram\n participant L as Event loop\n participant R as RateExecutor\n participant CB as callback\n\n loop every interval\n L->>R: resume\n R->>CB: await callback()\n R->>R: next_exec_time += interval\n alt fell behind\n R->>R: next_exec_time = now + interval\n end\n R->>L: sleep(next_exec_time - now)\n end
Catch-up logic silently drops ticks when a callback overruns its period — something to keep in mind for control loops.
sequenceDiagram\n participant L as Event loop\n participant A as AsyncExecutor\n participant S as SUB socket\n participant CB as user callback\n\n loop while running\n L->>A: resume\n A->>S: await recv_multipart(copy=False)\n S-->>A: frames\n A->>A: decode message\n A->>CB: await callback(msg, header)\n A->>L: sleep(0) (yield)\n end
Head-of-line blocking
A slow callback stalls the receive loop. Messages pile up on the SUB HWM and get evicted. If you expect variable-latency work, offload callback bodies to asyncio.create_task(...) or a thread pool.
","path":["Concepts","Async execution model"],"tags":[]},{"location":"concepts/async-execution-model/#publish-is-sync-inside-async","level":2,"title":"Publish is sync-inside-async","text":"
The Publisher uses a sync zmq.Context (shadowed onto the node's async context). publish() is a plain function call — no await. This avoids the overhead of the async zmq integration on the send path.
Not thread-safe
A zmq.PUB socket is not safe to call from multiple threads or tasks concurrently. Serialize calls to publish() per publisher.
On Unix, importing cortex.run checks for uvloop and uses it if present. Measured impact: modest throughput improvement, meaningful p99 latency reduction on high-rate small messages.
The discovery daemon speaks a tiny msgpack-over-REQ/REP protocol. It is not on the data path — once a subscriber has the endpoint, messages flow publisher → subscriber directly.
","path":["Concepts","Discovery protocol"],"tags":[]},{"location":"concepts/discovery-protocol/#commands","level":2,"title":"Commands","text":"Command Payload required Returns REGISTER_TOPIC (1) TopicInfo OK / ALREADY_EXISTS UNREGISTER_TOPIC (2) topic_name or TopicInfo.name OK / NOT_FOUND LOOKUP_TOPIC (3) topic_name OK + TopicInfo / NOT_FOUND LIST_TOPICS (4) — OK + list[TopicInfo]SHUTDOWN (99) — OK; daemon exits
Status codes: OK=0, NOT_FOUND=1, ALREADY_EXISTS=2, ERROR=3.
sequenceDiagram\n autonumber\n participant P as Publisher\n participant D as Daemon REP\n\n P->>P: bind PUB socket on ipc:///tmp/cortex/topics/<node>__<topic>.sock\n P->>D: REQ → DiscoveryRequest(REGISTER_TOPIC, TopicInfo{...})\n D->>D: if topic_name absent: insert; else compare publisher_node\n alt new\n D-->>P: OK \"Registered topic: /x\"\n else same publisher re-registering\n D-->>P: OK (overwrite)\n else different publisher, same topic\n D-->>P: ALREADY_EXISTS\n end
sequenceDiagram\n autonumber\n participant S as Subscriber\n participant D as Daemon REP\n participant P as Publisher\n\n S->>D: REQ → LOOKUP_TOPIC(\"/x\")\n alt present\n D-->>S: OK + TopicInfo\n S->>P: SUB connect + SUBSCRIBE \"/x\"\n else missing\n D-->>S: NOT_FOUND\n Note over S: if wait_for_topic:<br/>poll every 500 ms until timeout\n S->>D: retry LOOKUP_TOPIC\n end
wait_for_topic_async implements the retry loop with asyncio.sleep so the event loop keeps spinning.
ZMQ REQ sockets enter a bad state after a missed reply — they block further sends. The client detects zmq.Again on timeout and rebuilds the socket:
flowchart TD\n A[send request] -->|timeout| B[REQ socket stuck]\n B --> C[close socket]\n C --> D[recreate socket<br/>same endpoint]\n D --> E[retry up to retries]
See DiscoveryClient._reconnect.
Fencepost in retries default
retries=1 today executes the loop exactly once — i.e. no retry. Bump to retries=3 in client-side code if you need resilience.
","path":["Concepts","Discovery protocol"],"tags":[]},{"location":"concepts/discovery-protocol/#failure-modes-how-cortex-handles-them","level":2,"title":"Failure modes & how Cortex handles them","text":"Scenario Behavior Daemon not running when publisher starts Register fails; publisher still publishes, but no subscriber can find it. Daemon restarts All state lost; publishers must re-register. Current design has no auto-re-register. Publisher crashes Registry keeps stale TopicInfo until someone UNREGISTERs. Two publishers, same topic Second registration rejected with ALREADY_EXISTS. Subscriber looks up before publisher NOT_FOUND; caller may wait_for_topic to poll.
Roadmap items (see critique.md) to address these: leases with heartbeats, multi-publisher support, and notify-on-change.
Every message class gets a 64-bit identifier derived from its name and field schema. The fingerprint rides in the header of every published message and does two jobs:
Type dispatch — Message.decode(bytes) looks up the right class in the MessageType registry.
Compatibility check — subscribers verify that the topic they looked up advertises the same fingerprint as the type they were written against.
Message.__init_subclass__ auto-registers every concrete subclass into MessageType._registry keyed by fingerprint. Nothing else to do — decorating your dataclass with @dataclass and inheriting from Message is enough.
On connect, the subscriber compares the topic's advertised fingerprint against the one it computed from its message class:
sequenceDiagram\n participant S as Subscriber\n participant D as Discovery daemon\n\n S->>D: LOOKUP /topic\n D-->>S: TopicInfo(fingerprint=0xABCD...)\n S->>S: compare with MyMessage.fingerprint()\n alt mismatch\n S-->>S: log warning, continue anyway\n else match\n S-->>S: connect and subscribe\n end
Today: mismatch is a warning, not an error
A fingerprint mismatch currently only logs a warning — see critique.md. Downstream decoding will fail hard. Until that is tightened, prefer to re-exchange type definitions between processes rather than rely on this guard.
field.type may be a string (under from __future__ import annotations) or a real type otherwise. The canonical string differs in the two cases, so the same class can fingerprint differently across import environments.
When defining messages shared between processes, either use the same import style in both, or rely on the runtime typing.get_type_hints(cls) equivalent once that lands upstream.
Cortex uses ZeroMQ multipart messages. Each published message is a list of frames rather than a single blob. That lets array payloads ride as raw contiguous buffers — no copy into a Python bytes, no re-copy by ZMQ.
","path":["Concepts","Message wire format"],"tags":[]},{"location":"concepts/message-wire-format/#frames-on-the-wire","level":2,"title":"Frames on the wire","text":"
Field values are packed in declaration order (not by name), so the receiver reconstructs using the dataclass's cached field tuple. This removes per-message field-name encoding.
Arrays and tensors appear in the metadata as small dict stand-ins called OOB descriptors:
sequenceDiagram\n participant U as User\n participant M as Message.to_frames\n participant S as serialize_message_frames\n participant E as _encode_transport_value\n participant Z as ZMQ send_multipart\n\n U->>M: build header + collect field values\n M->>S: values in declaration order\n S->>E: for each value, walk nested dicts/lists\n E-->>S: scalar stays inline; array → OOB descriptor + buffer appended\n S-->>M: (metadata_bytes, [buf0, buf1, ...])\n M-->>Z: [topic, header, metadata, *buffers]
Message.to_bytes() / from_bytes() / Message.decode() still exist. They pack everything into one msgpack blob using ExtType for arrays. That path is retained for tests and opportunistic use; the transport always uses the multipart path above.
Mismatch trap
Bytes captured from the wire cannot be fed to Message.decode() — the wire format is multipart, not a single blob. Use Message.from_frames(frames).
Taking inspiration from DDS, three profiles are enough for most robotics use:
best_effort_latest — conflate; keep only newest (camera frames).
reliable_queue — publisher blocks or errors (control commands).
dropping_queue — current behavior with an exposed drop counter (telemetry).
See critique.md § 4 for rationale.
","path":["Concepts","Transport & QoS"],"tags":[]},{"location":"getting-started/discovery-daemon/","level":1,"title":"Running the Discovery Daemon","text":"
The discovery daemon is a lightweight REP service that maintains the registry of active topics. Publishers register on startup; subscribers look up the endpoint and connect directly.
","path":["Getting started","Running the Discovery Daemon"],"tags":[]},{"location":"getting-started/discovery-daemon/#start","level":2,"title":"Start","text":"As a scriptAs a moduleAs a systemd service
","path":["Getting started","Running the Discovery Daemon"],"tags":[]},{"location":"getting-started/discovery-daemon/#troubleshooting","level":2,"title":"Troubleshooting","text":"\"Address already in use\" Another daemon (or a stale socket file) is holding the path. rm /tmp/cortex/discovery.sock and restart. Subscribers time out looking up topics Daemon not running, or publisher failed to register. Run with --log-level DEBUG and watch for REGISTER / LOOKUP lines. Daemon crash leaves stale entries Today, entries are only removed on explicit UNREGISTER. A crashed publisher's topic stays in the registry pointing at a dead socket. Restarting the daemon clears all state.","path":["Getting started","Running the Discovery Daemon"],"tags":[]},{"location":"getting-started/installation/","level":1,"title":"Installation","text":"","path":["Getting started","Installation"],"tags":[]},{"location":"getting-started/installation/#requirements","level":2,"title":"Requirements","text":"
Python 3.10+
Linux or macOS (Windows works but without uvloop)
ZeroMQ shared library (bundled via pyzmq)
","path":["Getting started","Installation"],"tags":[]},{"location":"getting-started/installation/#install-from-source","level":2,"title":"Install from source","text":"
","path":["Getting started","Quickstart"],"tags":[]},{"location":"getting-started/quickstart/#what-just-happened","level":2,"title":"What just happened","text":"
sequenceDiagram\n participant P as Publisher\n participant D as Discovery daemon\n participant S as Subscriber\n\n P->>D: REGISTER /sensor/data -> ipc:///tmp/cortex/topics/...\n S->>D: LOOKUP /sensor/data\n D-->>S: ipc:///tmp/cortex/topics/...\n S->>P: ZMQ SUB connect + SUBSCRIBE \"/sensor/data\"\n loop 10 Hz\n P->>S: multipart [topic, header, metadata, buffer]\n S->>S: decode + await on_data(msg, header)\n end
See Concepts → Architecture for the end-to-end picture, or jump into a custom message tutorial.
Pin publisher and subscriber to separate cores for stable latency numbers.
Disable Turbo-Boost / set CPU governor to performance for reproducible runs.
Always measure with the discovery daemon also running (it is off the hot path but can steal a little cache).
","path":["Guides","Benchmarks"],"tags":[]},{"location":"guides/debugging/","level":1,"title":"Debugging","text":"","path":["Guides","Debugging"],"tags":[]},{"location":"guides/debugging/#subscriber-hangs-on-startup","level":2,"title":"Subscriber hangs on startup","text":"
Most likely: the daemon is not running, or the topic name is mistyped. DiscoveryClient.wait_for_topic_async polls every 500 ms until the topic appears or the timeout fires.
cortex-discovery --log-level DEBUG\n
Watch for LOOKUP topic: /x -> NOT FOUND.
","path":["Guides","Debugging"],"tags":[]},{"location":"guides/debugging/#publisher-works-but-subscriber-receives-nothing","level":2,"title":"Publisher \"works\" but subscriber receives nothing","text":"
ZMQ PUB drops messages for which no matching SUB is connected yet. If your publisher starts first and publishes immediately, the first few messages are lost — this is the classic ZMQ slow-joiner problem.
Workarounds:
Have the publisher wait briefly after bind before publishing the first message.
Have the subscriber wait-for-topic (the default) so it comes up after the publisher registered.
If a publisher exits uncleanly, its IPC socket file remains. Cortex's Publisher._setup_socket unlinks any existing file at the same path on the next bind — so restarting the publisher fixes it. Otherwise:
rm /tmp/cortex/topics/<stale-socket>.sock\n
","path":["Guides","Debugging"],"tags":[]},{"location":"guides/debugging/#daemon-state-survives-restarts-but-doesnt","level":2,"title":"Daemon state survives restarts — but doesn't","text":"
The registry is in-memory. Restarting the daemon wipes all state; publishers do not auto-re-register today. Restart your publishers after restarting the daemon.
If you see Message type mismatch for /x: expected FooMessage, got BarMessage — the topic was registered with a different message class. Either rename the topic or align the classes.
Per-socket HWM defaults to 10. Increase queue_size on high-rate producers whose subscribers are known to be slow — but remember that ZMQ drops silently at the HWM.
","path":["Guides","Performance tuning"],"tags":[]},{"location":"guides/performance-tuning/#when-to-prefer-the-inline-path","level":2,"title":"When to prefer the inline path","text":"
Single tiny messages (primitives only, < 1 KB) see no benefit from multipart. The inline to_bytes path is still fine there. Publishers always use multipart today.
A message in Cortex is any dataclass that inherits from Message. Auto-registration, fingerprinting, and (de)serialization are all derived from the dataclass definition — you write the schema once, publishers and subscribers speak the same wire format.
Put your message definitions in a module both the publisher and subscriber import. The fingerprint is computed from module.qualname + field names/types; an identical re-declaration in two different modules produces different fingerprints.
","path":["Tutorials","Custom messages"],"tags":[]},{"location":"tutorials/custom-messages/#how-the-dataclass-becomes-a-wire-message","level":2,"title":"How the dataclass becomes a wire message","text":"
flowchart LR\n DC[dataclass fields] --> FP[fingerprint]\n DC --> ORD[declaration order]\n ORD --> Enc[serialize_message_frames<br/>values in order]\n Enc --> Meta[metadata frame]\n Enc --> Bufs[array frames]\n FP --> Hdr[24-byte header]\n Hdr --> Wire[(multipart send)]\n Meta --> Wire\n Bufs --> Wire
See Concepts → Message wire format for the full picture.
","path":["Tutorials","Custom messages"],"tags":[]},{"location":"tutorials/custom-messages/#supported-field-types","level":2,"title":"Supported field types","text":"Field type Notes int / float / bool / str Plain msgpack primitives bytes msgpack bin list[...] / tuple[...] Walked recursively dict[str, Any] Walked recursively; arrays inside are still OOB np.ndarray OOB frame; zero-copy decode torch.Tensor OOB frame; CPU-transported, device restored on decode Optional nested Message Not first-class today — flatten instead","path":["Tutorials","Custom messages"],"tags":[]},{"location":"tutorials/custom-messages/#evolution-what-breaks-the-fingerprint","level":2,"title":"Evolution: what breaks the fingerprint","text":"
Changing any of these changes the fingerprint and makes old and new publishers/subscribers incompatible:
Renaming the class, its module, or any field
Adding a field (even with a default)
Removing a field
Changing a field's annotation text
Safe to change without breaking:
Reordering methods, adding methods
Editing docstrings or defaults
Changing unrelated classes in the same module
See critique § 22 for the roadmap on first-class schema evolution.
A walk-through of examples/multi_node_system.py — a sensor → processor → monitor pipeline with custom messages, multiple publishers and subscribers, and periodic status reporting.
Four nodes run in a single Python process, each on the same asyncio event loop via asyncio.gather. Cortex's IPC transport does not care that they share a process — the data still rides through real ZMQ sockets.
","path":["Tutorials","Multi-node system"],"tags":[]},{"location":"tutorials/multi-node-system/#running-the-whole-graph","level":2,"title":"Running the whole graph","text":"
async def main():\n sensor_nodes = [SensorNode(sid, publish_rate=10.0) for sid in [\"lidar\", \"camera\"]]\n processor_node = ProcessorNode([\"lidar\", \"camera\"])\n monitor_node = MonitorNode()\n all_nodes = [*sensor_nodes, processor_node, monitor_node]\n\n await asyncio.sleep(1.0) # let topics register and subscribers connect\n\n try:\n await asyncio.gather(*[n.run() for n in all_nodes])\n finally:\n for n in all_nodes:\n await n.close()\n
sequenceDiagram\n participant L as lidar\n participant C as camera\n participant P as processor\n participant M as monitor\n\n par at 10 Hz\n L->>P: SensorReading\n and\n C->>P: SensorReading\n end\n P->>M: ProcessedData (per reading)\n Note over M: counts per second\n M->>M: publish SystemStatus (1 Hz)
","path":["Tutorials","Multi-node system"],"tags":[]},{"location":"tutorials/multi-node-system/#run-it-yourself","level":2,"title":"Run it yourself","text":"
Cortex treats NumPy arrays as first-class payloads. Array bytes travel as separate ZMQ frames and are reconstructed with np.frombuffer on the receiver — no intermediate bytes object, no extra copy.
","path":["Tutorials","NumPy arrays & images"],"tags":[]},{"location":"tutorials/numpy-and-images/#pattern-publisher-that-emits-synthetic-frames","level":2,"title":"Pattern: publisher that emits synthetic frames","text":"camera.py
","path":["Tutorials","NumPy arrays & images"],"tags":[]},{"location":"tutorials/numpy-and-images/#pattern-subscriber-that-processes-frames","level":2,"title":"Pattern: subscriber that processes frames","text":"viewer.py
import numpy as np\nimport cortex\nfrom cortex import Node, ArrayMessage\nfrom cortex.messages.base import MessageHeader\n\n\nasync def on_frame(msg: ArrayMessage, header: MessageHeader):\n # msg.data aliases the ZMQ frame buffer — copy before mutating\n frame = msg.data.copy()\n frame[..., 0] = 0 # zero out red channel\n print(f\"[{header.sequence}] {msg.name} mean={frame.mean():.1f}\")\n\n\nclass Viewer(Node):\n def __init__(self):\n super().__init__(\"viewer\")\n self.create_subscriber(\"/cam/frame\", ArrayMessage, callback=on_frame)\n\n\ncortex.run(Viewer().run())\n
","path":["Tutorials","NumPy arrays & images"],"tags":[]},{"location":"tutorials/numpy-and-images/#aliasing-rule-of-thumb","level":2,"title":"Aliasing rule of thumb","text":"
flowchart LR\n A[recv multipart<br/>copy=False] --> B[np.frombuffer view]\n B --> C{Do you...}\n C -->|only read inside callback| OK[Use as-is: fastest]\n C -->|mutate| CP[arr = arr.copy]\n C -->|keep past callback| CP\n C -->|pass to another thread| CP\n CP --> Safe[safe, owned copy]
TensorMessage lets you pipe tensors between processes with the same zero-copy multipart transport used for NumPy arrays. Device and requires_grad metadata are preserved; the bytes travel via the CPU side of the tensor.
flowchart LR\n A[torch.Tensor<br/>cuda:0, grad=True] --> B[encode: .detach.cpu.numpy<br/>contiguous]\n B --> C[OOB frame + metadata<br/>device_str, requires_grad, dtype, shape]\n C -. IPC .-> D[decode: np.frombuffer<br/>torch.from_numpy]\n D --> E{cuda available?}\n E -- yes --> F[move to device_str]\n E -- no --> G[stay on CPU]\n F --> H[requires_grad_ True if flagged]\n G --> H
Attribute Transported dtype ✓ exact shape ✓ device ✓ string; restored on decode if available requires_grad ✓ grad (the actual gradient) ✗ not sent autograd graph ✗ not sent (detach() is implicit)","path":["Tutorials","PyTorch tensors"],"tags":[]},{"location":"tutorials/pytorch-tensors/#multi-tensor-payloads","level":2,"title":"Multi-tensor payloads","text":"
When you need several tensors together — e.g. a model's inputs and outputs — use MultiTensorMessage:
Even for two processes on the same GPU, tensors are DMA'd to CPU on send and back to GPU on receive. That is a copy on each side. Cortex does not currently support CUDA IPC — for tight in-process handoffs, prefer a torch.multiprocessing queue or shared CUDA memory.
Install with the torch extra
TensorMessage raises on construction if PyTorch is not installed. Use pip install -e \".[torch]\".
A message in Cortex is any dataclass that inherits from
+[Message][cortex.messages.base.Message]. Auto-registration, fingerprinting,
+and (de)serialization are all derived from the dataclass definition — you
+write the schema once, publishers and subscribers speak the same wire format.
Put your message definitions in a module both the publisher and
+subscriber import. The fingerprint is computed from
+module.qualname + field names/types; an identical re-declaration in
+two different modules produces different fingerprints.
A walk-through of examples/multi_node_system.py —
+a sensor → processor → monitor pipeline with custom messages, multiple
+publishers and subscribers, and periodic status reporting.
Four nodes run in a single Python process, each on the same asyncio event
+loop via asyncio.gather. Cortex's IPC transport does not care that they
+share a process — the data still rides through real ZMQ sockets.
sequenceDiagram
+ participant L as lidar
+ participant C as camera
+ participant P as processor
+ participant M as monitor
+
+ par at 10 Hz
+ L->>P: SensorReading
+ and
+ C->>P: SensorReading
+ end
+ P->>M: ProcessedData (per reading)
+ Note over M: counts per second
+ M->>M: publish SystemStatus (1 Hz)
Cortex treats NumPy arrays as first-class payloads. Array bytes travel as
+separate ZMQ frames and are reconstructed with np.frombuffer on the
+receiver — no intermediate bytes object, no extra copy.
[TensorMessage][cortex.messages.standard.TensorMessage] lets you pipe
+tensors between processes with the same zero-copy multipart transport used
+for NumPy arrays. Device and requires_grad metadata are preserved; the
+bytes travel via the CPU side of the tensor.
importtorch
+importcortex
+fromcorteximportNode,TensorMessage
+
+
+classInference(Node):
+def__init__(self):
+super().__init__("inference")
+self.pub=self.create_publisher("/model/features",TensorMessage)
+self.create_timer(1/30,self.tick)
+
+asyncdeftick(self):
+# Fake feature tensor; could be output of a real model
+feats=torch.randn(4,256,7,7,device="cuda"iftorch.cuda.is_available()else"cpu")
+self.pub.publish(TensorMessage(data=feats,name="layer4_feats"))
+
+
+cortex.run(Inference().run())
+
flowchart LR
+ A[torch.Tensor<br/>cuda:0, grad=True] --> B[encode: .detach.cpu.numpy<br/>contiguous]
+ B --> C[OOB frame + metadata<br/>device_str, requires_grad, dtype, shape]
+ C -. IPC .-> D[decode: np.frombuffer<br/>torch.from_numpy]
+ D --> E{cuda available?}
+ E -- yes --> F[move to device_str]
+ E -- no --> G[stay on CPU]
+ F --> H[requires_grad_ True if flagged]
+ G --> H
Even for two processes on the same GPU, tensors are DMA'd to CPU on send
+and back to GPU on receive. That is a copy on each side. Cortex does not
+currently support CUDA IPC — for tight in-process handoffs, prefer a
+torch.multiprocessing queue or shared CUDA memory.
+
+
+
Install with the torch extra
+
TensorMessage raises on construction if PyTorch is not installed. Use
+pip install -e ".[torch]".