A concept model architecture that I would like to see implemented, tested and expanded. Please note, I am open sourcing the concept because I don't have the time to pull this research off on my own, any PR's are welcome. This is the absolute first iteration of the concept so there are still many rough edges.
Status: Initial implementation available (Perceiver IO encoder/decoder, HRM planner/worker with LTC integration, and a forward-pass smoke test). See models/ and tests/.
Repository README (research-oriented)
PerceiverIO–HRM–LTC Hybrid — A multimodal, hierarchical, and continuous-time architecture for scalable long-context reasoning and stable temporal control
Authors / Maintainers: Talon Bernard van Vuuren Contact / Maintainer email: talbvvtrueuser9174@gmail.com
This document describes a research-oriented reference implementation and experiment plan for a hybrid architecture that combines three complementary literatures:
- Perceiver / Perceiver IO: modality-agnostic, scalable encoder/decoder that compresses large and heterogeneous inputs into a fixed-size latent workspace and supports flexible query-based decoding.
- Hierarchical Reasoning Model (HRM): a multi-timescale planner/worker recurrent design that separates slow, abstract planning from fast, detailed execution.
- Liquid Time-Constant (LTC) networks: continuous-time recurrent cells with learnable time constants that produce stable, adaptive temporal dynamics suitable for streaming, irregularly sampled, or event-driven data.
I provide a conceptual specification, implementation scaffold, training recipes, prioritized experiments, ablations, evaluation metrics, and practical engineering notes intended to let a researcher or contributor reproduce, extend, and test the hybrid concept on reasoning, long-context, multimodal, and continuous-control tasks.
Modern AI problems increasingly require systems that can (a) consume very large heterogeneous inputs (long documents, videos, multimodal sensor streams), (b) perform structured, multi-step reasoning and planning, and (c) operate robustly over irregular or continuous time. Transformers and standard RNNs excel at some of these properties but fail when all are needed at once (e.g., long-range memory + continuous-time dynamics + hierarchical decision-making).
This hybrid aims to combine strengths:
- Perceiver IO for compressing and unifying arbitrary inputs and enabling flexible outputs.
- HRM for explicit hierarchical, iterative reasoning separating planning (abstract) from execution (detailed).
- LTC for numerically stable continuous-time execution dynamics in the low-level worker.
Primary research goals:
- Test whether the hybrid improves sample efficiency on structured reasoning tasks.
- Validate robust long-context performance (language and multimodal).
- Demonstrate improved handling of irregularly sampled / streaming data.
- Provide a modular codebase for ablation studies and further research.
- This research README (kept research-oriented) plus the architecture diagram above.
- An initial PyTorch implementation of the core modules:
models/perceiver.py— Perceiver IO encoder/decoder.models/hrm.py— HRM glue: High-Level Planner (transformer) and Low-Level Worker (LTC-based) with write-back.models/layers/attention.pyandmodels/layers/ltc.py— attention stack and a local LTC cell.models/latent_flow.py— end-to-end LatentFlow model wrapper.models/config.py— configuration dataclasses.
- A smoke test in
tests/test_latent_flow_forward.pythat validates a forward pass and tensor shapes. - Training scripts and datasets are planned; contributions welcome.
[ Input modalities (text, image, audio, video, sensors) ]
|
(embed tokens)
|
Perceiver IO cross-attention
|
Perceiver latent workspace L (N_lat x D_lat)
|
+--------------------------------+
| |
High-Level Planner (HLP) Perceiver Latents (shared)
(slow, recurrent) |
| |
plan tokens P <-----------------> cross-attend
| |
Low-Level Worker (LLW) <---- cross-attention ----> L
(continuous-time LTC cell(s))
|
outputs / internal writes
|
Perceiver IO decoder (output queries)
|
structured outputs (tokens, maps, controls)
Key flow notes:
- Inputs are flattened and projected into token embeddings for Perceiver cross-attention.
- Perceiver produces a fixed-size latent workspace
Lthat the HRM uses as the shared context. - The HRM comprises: (1) a High-Level Planner (HLP) that operates at a slow timescale and emits plan tokens, and (2) a Low-Level Worker (LLW) implemented as one or more LTC cells that integrate continuously between planner updates.
- Output queries (Perceiver IO decoder) may read both
Land HRM internal states to produce structured outputs such as next-token predictions or control commands.
- Input: sequence of heterogeneous tokens
X = {x_i}(variable length, modality-labeled). - Input embedding: per-token linear projection + optional modality-specific positional encodings (time, 2D grid, waveform index).
- Cross-attend to a latent
L ∈ R^{N_lat×D_lat}via a standard cross-attention block (queries=latent, keys/values=inputs). - Apply a stack of latent self-attention blocks to
L. - Decoder: for each requested output
q_j(could be autoregressive token queries or structured queries), cross-attendq_jtoL. Decoder may also cross-attend HRM outputs.
-
High-Level Planner (HLP):
- Input: summary view of
L(pooled vector(s) or attended queries) + previous HLP stateh_H^{t-1}. - Computation: small recurrent transformer or LSTM producing new planner state
h_H^tand a set of plan tokensP^t = {p_k}. - Timescale: slow — updates every
Kinference ticks (K≥1). HLP can also run iteratively forMplanning steps per input.
- Input: summary view of
-
Low-Level Worker (LLW):
- Implementation: one or several LTC cells parameterized to accept plan tokens
P^tand context fromL. - Dynamics: continuous-time integration
dh_L/dt = f(h_L, P^t, context(L), t; θ)wherefis implemented by the LTC cell. - Execution: the LLW runs
n_micro_steps(or integrates for a time window) for each HLP plan step, attending back toLat chosen micro-step intervals.
- Implementation: one or several LTC cells parameterized to accept plan tokens
-
HLP ⇄ LLW interface:
- Cross-attention: HLP can attend to LLW states (pooled) and LLW can attend to HLP plan tokens.
- Gating: learned gates modulate how much HLP plans influence LLW dynamics.
- Structured Perceiver IO queries obtain outputs via cross-attention to
Land optionally HRM pooled states. - For autoregressive tasks (language modeling), decode iteratively with token queries that include previous token embeddings and HRM context.
# high-level forward pass pseudocode
L = perceiver_encoder(inputs) # cross-attend inputs -> latent
h_H = HLP.init_state()
h_L = LLW.init_state()
for t in range(T_planner_steps):
# Planner step (slow)
plan_tokens, h_H = HLP.step(L_summary(L), h_H)
# Worker integrates continuously
for _ in range(n_micro_steps):
h_L = LLW.integrate(h_L, plan_tokens, L)
# optionally write back to L or produce intermediates
if iterate: # optional iterative refinement
L = perceiver_writeback(L, h_L)
outputs = perceiver_decoder(L, h_H, h_L)
return outputsNotes: LLW.integrate is an LTC cell call that does continuous dynamics for a fixed micro-step size (or uses an ODE solver). perceiver_writeback is an optional cross-attention from LLW/HRM into the Perceiver latent to support iterative refinement.
-
Primary framework: PyTorch (recommended for easier LTC integration and debugging)
-
Optional: JAX/Flax for research-scale Perceiver implementations if you prefer
-
Suggested Python packages (conda / pip):
torch(>=1.13)numpy,tqdm,pyyamleinops(tensor rearrangements)datasets(Hugging Face datasets for front-loading tasks)wandbortensorboardfor logging- LTC reference implementation (link in repo or install local module) — this project expects an LTC layer implementation in
models/layers/ltc.py.
Example environment install (pip):
python -m venv venv
source venv/bin/activate
pip install torch numpy einops pyyaml tqdm datasets wandb
# add LTC lib if published or clone the LTC reference and install as editablePerceiver
- N_lat = 512
- D_lat = 1024
- Latent_self_blocks = 6
- Latent_attention_heads = 16
HLP
- D_H = 1024
- planner_layers = 2
- planner_update_interval K = 4 (update planner every 4 inference ticks)
LLW (LTC)
- D_L = 512
- n_micro_steps = 16 (micro-steps per planner step)
- time_step = 0.05 (simulated dt per micro-step) — tune based on task
- LTC cell initializer: small-time-constants bias (avoid extremely large τ)
Optimization
- Optimizer: AdamW
- LR: 3e-4 (warmup 2k steps, linear decay)
- Batch size: small experiments 16–64 depending on compute
- Grad clip: 1.0
- Weight decay: 0.01
- Mixed precision: recommended but monitor LTC numerics
These are intentionally conservative; scale up/down as you experiment.
Primary loss: task dependent (cross-entropy for language, MSE for regression, binary/softmax for classification).
Auxiliary losses (encouraged):
- Iterative consistency: supervise intermediate outputs or apply L2 penalty between successive HRM outputs when teacher data exists.
- Perceiver reconstruction: small autoencoding loss on a sampled subset of inputs to encourage latent fidelity.
- LLW temporal prediction: predict short-horizon next-step to stabilize LTC dynamics.
- LTC stability regularizer: L2 penalty on learned time-constants or bounded-scaling penalty.
Curriculum & schedule:
- Start with teacher-forced planner & worker (drive LLW with ground-truth plans if possible).
- Move to scheduled sampling: gradually replace teacher plan tokens with model outputs.
- Anneal to free-running HRM where planner and worker operate solely on model-internal signals.
Checkpointing & evaluation: save checkpoints frequently (every epoch / N steps), and maintain validation roll-outs with both teacher-forced and free-running modes to detect divergence.
-
Synthetic iterative reasoning (fast prototyping):
- Sudoku generator (multiple difficulty levels), graph shortest-path with increasing graph size, small SAT instances.
- Metrics: solve accuracy, steps-to-solution, sample efficiency.
-
Long-context language:
- Datasets: PG-19, BookSum, Long Range Arena (text variants), NarrativeQA for long-document QA.
- Metrics: perplexity / bits-per-token, QA F1/EM, summarization ROUGE.
-
Multimodal VQA over long video:
- Datasets: YouTube-VIS style or long-video QA datasets.
- Metrics: VQA accuracy, temporal grounding.
-
Streaming sensor / event data:
- Event camera datasets, irregular-sampled clinical time-series (MIMIC-III vitals subset), telemetry anomaly datasets.
- Metrics: detection AUC, latency-to-detection, RMSE for forecasting.
-
Hierarchical control (simulator):
- Environments: MuJoCo / Brax / PyBullet task requiring hierarchical goals (multi-stage pick & place, navigation+manipulation).
- Metrics: success rate, smoothness, sample-efficiency.
Want a single concrete starter? Begin with Sudoku: it’s small, clearly measures hierarchical planning vs execution, and many baselines exist.
Run a carefully controlled ablation matrix to validate claims:
- Full model (Perceiver IO + HRM + LTC).
- Perceiver IO only + standard decoder (no HRM, no LTC).
- Perceiver IO + HRM where LLW is GRU (no LTC).
- Perceiver IO + LLW(LTC) but no HLP (no hierarchical planner).
- HRM + LTC but replace Perceiver with a Transformer encoder (to test importance of Perceiver's input scaling).
Measure sample efficiency (#samples to reach threshold), final accuracy, inference latency, and iteration stability.
- Task accuracy: problem-specific metric (accuracy, success rate, EM, BLEU, ROUGE, etc.).
- Sample efficiency: steps/samples to reach specified performance thresholds.
- Iteration stability: for iterative tasks, measure whether later iterations improve or degrade a validation metric (plot metric vs iteration).
- Temporal robustness: evaluate on irregularly subsampled versions of inputs; measure degradation.
- Compute & latency: mean latency per forward pass and memory envelope.
- Interpretability / alignment: correlation of planner tokens with human-level subgoals (requires human analysis).
Plot learning curves, iteration curves, and ablation bars.
- LTC numerics: LTC uses continuous dynamics; certain solvers or FP16 can destabilize training. Keep solver internals in FP32 if using AMP.
- Gradient explosion: use gradient clipping and norm penalties on recurrent weights.
- Too many HRM iterations: inference can be slow—limit iterations, or distill into a single-pass student model.
- Representation mismatch: if Perceiver latents are too small, HRM will starve; monitor latent capacity and activations.
- Dataset bias & overfitting: HRM can memorize small tasks — use strong held-out evaluation and regularization.
/ (root)
├─ README.md # this file
├─ requirements.txt
├─ configs/ # YAML hyperparameter/config files for experiments
├─ models/
│ ├─ perceiver.py
│ ├─ hrm.py # HLP + LLW glue
│ └─ layers/
│ ├─ ltc.py
│ ├─ attention.py
│ └─ utils.py
├─ data/
│ ├─ datasets/ # dataset drivers and preproc
├─ experiments/
│ ├─ sudoku_experiment.py
│ └─ long_context_narrative.py
├─ scripts/
│ ├─ train.py
│ └─ eval.py
└─ docs/
└─ design_notes.md
- Clone the repo and create the environment.
git clone <this-repo>
cd <this-repo>
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Run a tiny smoke test (toy data) — validates forward/backward pass:
python scripts/train.py --config configs/tiny_sudoku.yaml --dry_run True- Run a small training run (fast):
python scripts/train.py --config configs/sudoku_small.yaml- Evaluate and produce reports:
python scripts/eval.py --checkpoint runs/sudoku_small/checkpoint_last.pt --report_path results/sudoku_report.json- Seed RNGs (torch, numpy, python) and log seeds.
- Save optimizer state and scheduler at each checkpoint.
- Use deterministic dataloaders where possible for debugging.
- Log complete config files with each run (save YAML).
- Keep validation & test splits fixed and documented.
We welcome contributions. Suggested workflow:
- Fork repo, code locally.
- Run unit tests and linting.
- Implement feature or experiment branch named
feat/<short-desc>orexp/<task>-<brief>. - Open a PR with description, results, and a minimal reproduction script.
Include experimental results, hyperparams, and discussion in PR notes.
This project tests the following hypotheses:
- A hybrid Perceiver IO + HRM + LTC will improve sample efficiency on iterative symbolic or algorithmic reasoning tasks compared to Perceiver-only or Transformer baselines.
- LTC-equipped low-level workers will provide superior robustness and accuracy on irregularly sampled and streaming temporal data compared to discrete-time RNN workers.
- Perceiver latents + HRM planning will improve long-context performance (books, long videos) versus typical Transformers due to compressed workspace and iterative planning.
Each hypothesis is testable using the ablation plan and evaluation metrics included above.
- Distillation: distill iterative HRM behavior into a single-pass student model for low-latency inference.
- Sparse routing in HLP (Mixture-of-Experts) for scaling planner capacity.
- Differentiable memory and retrieval-augmented HRM for long-term facts.
- Hardware-aware LTC solvers and neuromorphic-friendly variants for low-power edge deployment.
- Perceiver / Perceiver IO — "Perceiver: General Perception with Iterative Attention" / "Perceiver IO" — Jaegle et al.
- Hierarchical Reasoning Model (HRM) — (user-provided) hierarchical multi-timescale reasoning preprint.
- Liquid Time-Constant Networks (LTC) — Hasani et al. / related continuous-time recurrent network papers.
- LTC-SE & continuous-time NN surveys — improvements and engineering notes.
@article{jaegle2021perceiver, title = {Perceiver: General Perception with Iterative Attention}, author = {Jaegle, Andrew and Gimeno, Felix and Brock, Andrew and Zisserman, Andrew and Vinyals, Oriol and Carreira, Jo{~a}o}, year = {2021}, journal = {arXiv}, volume = {2103.03206}, doi = {10.48550/arXiv.2103.03206}, url = {https://arxiv.org/abs/2103.03206} }
@article{jaegle2021perceiverio, title = {Perceiver IO: A General Architecture for Structured Inputs & Outputs}, author = {Jaegle, Andrew and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Doersch, Carl and Ionescu, Catalin and Ding, David and Koppula, Skanda and Zoran, Daniel and Brock, Andrew and Shelhamer, Evan and H{'e}naff, Olivier and Botvinick, Matthew M. and Zisserman, Andrew and Vinyals, Oriol and Carreira, Jo{~a}o}, year = {2021}, journal = {arXiv}, volume = {2107.14795}, doi = {10.48550/arXiv.2107.14795}, url = {https://arxiv.org/abs/2107.14795} }
@article{hasani2020ltc_arxiv, title = {Liquid Time-constant Networks}, author = {Hasani, Ramin and Lechner, Mathias and Amini, Alexander and Rus, Daniela and Grosu, Radu}, year = {2020}, journal = {arXiv}, volume = {2006.04439}, doi = {10.48550/arXiv.2006.04439}, url = {https://arxiv.org/abs/2006.04439} }
@inproceedings{hasani2021ltc_aaai, title = {Liquid Time-constant Networks}, author = {Hasani, Ramin and Lechner, Mathias and Amini, Alexander and Rus, Daniela and Grosu, Radu}, booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence}, year = {2021}, volume = {35}, pages = {7657--7666}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/16936} }
@article{hasani2022closedform, title = {Closed-form Continuous-time Neural Networks}, author = {Hasani, Ramin and Lechner, Mathias and Amini, Alexander and Liebenwein, Lucas and Ray, Aaron and Tschaikowski, Max and Teschl, Gerald and Rus, Daniela}, journal = {Nature Machine Intelligence}, year = {2022}, volume = {4}, number = {11}, pages = {992--1003}, doi = {10.1038/s42256-022-00556-7}, url = {https://doi.org/10.1038/s42256-022-00556-7} }
@article{bidollahkhani2023ltcse, title = {LTC-SE: Expanding the Potential of Liquid Time-Constant Neural Networks for Scalable AI and Embedded Systems}, author = {Bidollahkhani, Michael and Atasoy, Ferhat and Abdellatef, Hamdan}, year = {2023}, journal = {arXiv}, volume = {2304.08691}, doi = {10.48550/arXiv.2304.08691}, url = {https://arxiv.org/abs/2304.08691} }
@article{wang2025hrm, title = {Hierarchical Reasoning Model (HRM)}, author = {Wang, Guan and Li, Jin and Sun, Yuhao and Chen, Xing and Liu, Changling and Wu, Yue and Lu, Meng and Song, Sen and Yadkori, Yasin Abbasi}, year = {2025}, journal = {arXiv}, volume = {2506.21734}, doi = {10.48550/arXiv.2506.21734}, url = {https://arxiv.org/abs/2506.21734} }
If you use this work, open an issue or PR. For research collaboration, contact the maintainer(s) listed at the top of this file.
Thank you for reading.
