NanoStream is an ultra-low latency, streaming alignment engine designed for Oxford Nanopore Technologies (ONT) raw signal processing. It enables real-time "Read Until" (adaptive sampling) molecule ejection with sub-millisecond reaction times.
Built on modern Java (Project Panama & Vector API), it abandons traditional Enterprise Java patterns in favor of High-Frequency Trading (HFT) architectures: Zero-GC allocations on the hot path, hardware SIMD vectorization, cache-line padding, and lock-free thread synchronization.
- Zero-GC Hot Path: 100% of signal processing, DTW alignment, and output formatting happens in Off-Heap memory (
MemorySegment,Arena). No Garbage Collector pauses (tested with EpsilonGC and Generational ZGC). - Hardware SIMD Acceleration: Utilizes
jdk.incubator.vectorto process Dynamic Time Warping (DTW) distance calculations using hardware FMA (Fused Multiply-Add) instructions. - HFT Synchronization: Employs a custom Lock-Free MPSC Ring Buffer with explicit Cache-Line Padding to eliminate False Sharing and L1 cache invalidation.
- Mechanical Sympathy: Thread affinity (
AffinityLock) binds the SIMD orchestrator to a physical CPU core, while branchless code generates optimalCMOVinstructions. - Double-Buffered I/O: Uses Sequence Barriers to read Apache Arrow IPC (POD5) files seamlessly without blocking the computational engine.
- Enterprise SPI: Zero-overhead plugin system for real-time target matching logic (e.g., AMR detection).
Hardware: AWS c7i.2xlarge (Intel Xeon 4th Gen Sapphire Rapids, AVX-512), Ubuntu 22.04. Workload: 10,000 continuous DTW alignments (Raw signal length: 4000, Reference length: 10000).
How NanoStream compares against a standard Java implementation and a highly optimized C++ baseline (simulating tools like Uncalled or Sigmap):
| Implementation | Throughput (Alignments/sec) | P99 Latency (ms) | Max GC Pause | Heap Alloc Rate |
|---|---|---|---|---|
| NanoStream (Java Panama + SIMD) | ~42,500 | 0.18 ms | 0 ms (Zero-GC) | 0 MB/sec |
| Native C++ (AVX2 + O3) | ~44,000 | 0.15 ms | N/A | N/A |
| Standard Java (Heap arrays + Math.min) | ~8,200 | 12.50 ms | 45 ms (G1GC) | ~2.4 GB/sec |
Conclusion: NanoStream delivers C++ level throughput and deterministic latency, whilst offering the memory safety and massive ecosystem of the JVM for writing biological detection plugins.
The system is split into multiple concurrent layers interacting via a lock-free Ring Buffer:
- POD5-Dispatcher (Virtual Thread): Reads compressed raw signals using
libvbz, maps them to off-heap memory, and publishes coordinates to the Ring Buffer. - SIMD-Orchestrator (Platform Thread + CPU Affinity): Spins on the Ring Buffer, executes SIMD Banded DTW directly on raw pointers, and evaluates plugins.
- Hardware Controller (Virtual Thread): Dispatches UDP commands back to the sequencer hardware to eject non-target pores (
EJECT_PORE).