fix: replay mode loops by default instead of stopping at end#1494
Merged
spomichter merged 1 commit intodevfrom Mar 9, 2026
Merged
fix: replay mode loops by default instead of stopping at end#1494spomichter merged 1 commit intodevfrom
spomichter merged 1 commit intodevfrom
Conversation
ReplayConnection now defaults loop=True so replay data continuously loops. Also fixes a timing bug in LegacyPickleStore.stream() where looped iterations would fire instantly because the reference timestamps were never reset on loop restart. Now detects backwards timestamp jumps and resets the timing references.
Contributor
Greptile SummaryThis PR fixes replay mode so it loops continuously by default instead of stopping after one pass, which is the expected behavior during development and testing. The changes are small, focused, and correct. Key changes:
Minor note: The loop-restart detection condition Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant RC as ReplayConnection
participant LPS as LegacyPickleStore.stream()
participant SCHED as TimeoutScheduler
participant OBS as Observer
RC->>LPS: stream(loop=True)
LPS->>LPS: iterate_ts(loop=True)
LPS->>OBS: on_next(first_data)
Note over LPS: start_local_time = now()<br/>start_replay_time = first_ts<br/>prev_ts = first_ts
loop For each message
LPS->>SCHED: schedule_relative(delay, emit)
SCHED-->>LPS: emit fires
LPS->>OBS: on_next(data)
LPS->>LPS: next_message = next(iterator)
end
Note over LPS: End of recording — iterator yields<br/>first msg of next loop (ts jumps back)
LPS->>LPS: detect ts < prev_ts (loop restart)
Note over LPS: start_local_time = now()<br/>start_replay_time = ts (reset!)
LPS->>SCHED: schedule_relative(≈0, emit)
SCHED-->>LPS: emit fires immediately
LPS->>OBS: on_next(first_data of loop 2)
loop Second loop continues with correct timing
LPS->>SCHED: schedule_relative(delay, emit)
SCHED-->>LPS: emit fires
LPS->>OBS: on_next(data)
end
Last reviewed commit: afff234 |
spomichter
added a commit
that referenced
this pull request
Mar 9, 2026
…6, DIM-687) (#1451) * feat(cli): add daemon mode for dimos run (DIM-681) * fix: address greptile review — fd leak, wrong PID, fabricated log path - close devnull/stderr_file after dup2 (fd leak) - remove PID from pre-fork output (printed parent PID, not daemon PID) - show log_dir not log_dir/dimos.jsonl (file doesn't exist yet) - keep tests in tests/ (dimos/conftest.py breaks isolated tests) * feat(cli): add dimos stop and dimos status commands (DIM-682, DIM-684) dimos status — shows running instances with run-id, pid, blueprint, uptime, log dir dimos stop — sends SIGTERM (or SIGKILL with --force), waits 5s, escalates if needed --pid to target specific instance, --all to stop everything cleans registry on stop also adds get_most_recent() to run_registry. 8 new tests covering sigterm, sigkill, escalation, dead process cleanup, most-recent lookup. * test: add e2e daemon lifecycle tests with PingPong blueprint lightweight 2-module blueprint (PingModule + PongModule) that needs no hardware, no LFS data, and no replay files. tests real forkserver workers and module deployment. covers: - single worker lifecycle (build -> health check -> registry -> stop) - multiple workers (2 workers, both alive) - health check detects dead worker (SIGKILL -> detect failure) - registry entry JSON field roundtrip - stale entry cleanup (dead PIDs removed, live entries kept) * fix: rename stderr.log to daemon.log (addresses greptile review) both stdout and stderr redirect to the same file after daemonize(), so stderr.log was misleading. daemon.log better describes the contents. * fix: resolve mypy type errors in stop command (DIM-681) * feat: per-run log directory with unified main.jsonl (DIM-685) folds DIM-685 into daemon PR to avoid merge conflicts on dimos.py. - set_run_log_dir() before blueprint.build() routes structlog to <log_base>/<run-id>/main.jsonl - workers inherit DIMOS_RUN_LOG_DIR env var via forkserver - FileHandler replaces RotatingFileHandler (multi-process rotation unsafe) - fallback: env var check -> legacy per-pid files - 6 unit tests for routing logic * fix: migrate existing FileHandlers when set_run_log_dir is called setup_logger() runs at import time throughout the codebase, creating FileHandlers pointing to the legacy log path. set_run_log_dir() was resetting the global path but not updating these existing handlers, so main.jsonl was created but stayed empty (0 bytes). fix: iterate all stdlib loggers and redirect their FileHandlers to the new per-run path. verified: main.jsonl now receives structured JSON logs (1050 bytes, 5 lines in test run). * chore: move daemon tests to dimos/core/ for CI discovery testpaths in pyproject.toml is ['dimos'], so tests/ at repo root wouldn't be picked up by CI. moved all 3 test files to dimos/core/ alongside existing core tests. all 41 tests pass with conftest active. * chore: mark e2e daemon tests as slow matches convention from test_worker.py — forkserver-based tests are marked slow so they run in CI but are skipped in local default pytest. local default: 36 tests (unit only) CI (-m 'not (tool or mujoco)'): 41 tests (unit + e2e) * test: add CLI integration tests for dimos stop and dimos status (DIM-682, DIM-684) 16 tests using typer CliRunner with real subprocess PIDs: status (7 tests): - no instances, running instance details, uptime formatting - MCP port, multiple instances, dead PID filtering stop (9 tests): - default most recent, --pid, --pid not found - --all, --all empty, --force SIGKILL - already-dead cleanup, SIGTERM verification * test: add e2e CLI tests against real running blueprint (DIM-682, DIM-684) 3 new e2e tests that exercise dimos status and stop against a live PingPong blueprint with real forkserver workers: - status shows live blueprint details (run_id, PID, blueprint name) - registry entry findable after real build, workers alive - coord.stop() kills workers, registry cleaned * fix: address paul's review comments - move system imports to top of run(), logger.info before heavy imports - remove hardcoded MCP port line from daemon output - add n_workers/n_modules properties to ModuleCoordinator (public API) - single-instance model: remove --all/--pid from stop, simplify status - use _get_user_data_dir() for XDG-compliant registry/log paths - remove mcp_port from RunEntry (should be in GlobalConfig) - inline _shutdown_handler as closure in install_signal_handlers - add SIGKILL wait poll (2s) to avoid zombie race with port conflict check - replace handler._open() private API with new FileHandler construction - fix e2e _clean_registry to monkeypatch REGISTRY_DIR - reset logging_config module globals in test fixture - move test imports to module level * fix: drop daemon.log, redirect all stdio to /dev/null structlog FileHandler writes to main.jsonl — daemon.log only ever captured signal shutdown messages. no useful content. * fix: restore LOG_BASE_DIR import, remove duplicate set_run_log_dir import fixes mypy name-defined error from import reorganization * fix: address remaining paul review comments - simplify health_check to single alive check (build is synchronous) - remove --health-timeout flag (no longer polling) - add workers property to ModuleCoordinator (public API) - separate try-excepts for kill/wait in sleeper fixture cleanup - move repeated imports to top in test_per_run_logs * fix: address all remaining paul review comments - convert all test helpers to fixtures with cleanup - replace os.environ['CI']='1' at import time with monkeypatch fixture - replace all time.sleep() with polling loops in tests - mark slow stop tests @pytest.mark.slow - move stop_entry logic from CLI to run_registry - remove # type: ignore, properly type _stop_entry - remove duplicate port conflict test - remove redundant monkeypatch cleanup in test_per_run_logs - list(glob()) to avoid mutation during iteration in cleanup_stale - XDG_STATE_HOME compliant _get_state_dir() for registry/log paths - remove redundant cli_config_overrides in e2e builds - move duplicate imports to module level in e2e tests * fix: remove module docstring from test_daemon.py * feat: MCP server enhancements, dimos mcp CLI, agent context, stress tests (DIM-686, DIM-687) MCP server: - new dimos/status and dimos/list_modules JSON-RPC methods CLI: - dimos mcp list-tools/call/status/modules commands - uses stdlib urllib (no requests dependency) agent context (DIM-687): - generate_context() writes context.md on dimos run stress tests (23 tests): - MCP lifecycle, tools, error handling, rapid calls, restart cycles - all tests use fixtures with cleanup, poll loops (no sleep) Closes DIM-686, Closes DIM-687 * fix: address greptile review on PR #1451 - remove dimos restart from agent context (not in this branch yet) - handle JSON-RPC errors in dimos mcp call (show error, exit 1) - pass skills as parameter to dimos/status and dimos/list_modules handlers - fix hardcoded port in curl example (use mcp_port parameter) - fix double stop() in test_mcp_dead_after_stop (standalone coordinator) - use tmp_path for log_dir in mcp_entry fixture (test isolation) * feat: dimos agent-send CLI + MCP method - dimos/agent_send MCP method publishes on /human_input LCM channel - dimos agent-send 'message' CLI wraps the MCP call - 4 new tests: MCP send, empty message, CLI send, no-server * fix: address greptile review round 2 - escape f-string curly braces in curl example (agent_context.py) - fix double stop() in test_registry_cleanup_after_stop - add JSON-RPC error handling to all MCP CLI commands - add type annotation for LCM transport - add agent-send to generated context.md CLI commands * feat: module IO introspection via MCP + CLI - dimos/module_io MCP method: skills grouped by module with schema - dimos mcp module-io CLI: human-readable module/skill listing - 2 new tests * fix: daemon context generation + standalone e2e stress tests Bug fixes found during e2e testing: - worker.pid: catch AssertionError after daemonize() when is_alive() asserts _parent_pid == os.getpid() (double-fork changes PID, but stored process.pid is still valid) - agent_context: fix f-string ValueError from triple braces in curl example — split into f-string + plain string concat New standalone test scripts (no pytest): - e2e_mcp_killtest.py: SIGKILL/restart stress test (3 cycles, verifies MCP recovers after crash, tests all endpoints) - e2e_devex_test.py: full developer experience test simulating OpenClaw agent workflow (daemon start → CLI ops → logs → stop) Register stress-test blueprint in all_blueprints.py for `dimos run stress-test --daemon` support. * refactor: strip module-io, fix greptile review issues 7-13 - Remove dimos/module_io handler + 'dimos mcp module-io' CLI command (showed skill descriptions, not actual IO — misleading) - Remove unused rpc_calls param from _handle_dimos_agent_send - Add JSON-RPC error checking to mcp_list_tools, mcp_status, mcp_modules - Fix empty tool content: exit 0 with '(no output)' instead of exit 1 - Add Content-Type header to curl example in agent context - Fix double-stop in test_registry_cleanup_after_stop - Remove module-io references from standalone test scripts * cleanup: remove agent_context.py, fix final greptile nits - Delete agent_context.py entirely — runtime info is available via `dimos status` and `dimos mcp status`, no need for a separate file - Remove generate_context() calls from CLI (daemon + foreground paths) - Fix non-deterministic module list in dimos/status (set → dict.fromkeys) - Remove unused heartbeat: Out[str] from StressTestModule - Remove dead list literal from e2e_mcp_killtest.py * fix: address latest greptile review round - Remove stale module-io step from e2e_devex_test (command was removed) - Fix step numbering in devex test (7-10 sequential, no gaps) - Fix double remove() on registry entry — let fixture handle cleanup - Remove misleading 'non-default to avoid conflicts' port comment * fix: resolve mypy errors in worker.py and stress_test_module - worker.py: use typed local variable for process.pid instead of direct return (fixes no-any-return) - stress_test_module.py: add return type annotation to start() * perf: class-scoped MCP fixtures, 125s → 51s test runtime - Make mcp_shared fixture class-scoped: blueprint starts once per class instead of once per test (~4s setup saved per test) - Move no-server tests (dead_after_stop, cli_no_server_error, agent_send_cli_no_server) to dedicated TestMCPNoServer class to avoid port conflict with shared fixture - Add full type annotations to all fixtures and helpers - Add docstring explaining performance rationale - Remove unused mcp_blueprint fixture (replaced by mcp_shared) * fix: resolve remaining CI failures (mypy + all_blueprints) - Fix mypy errors in standalone test scripts (e2e_devex_test.py, e2e_mcp_killtest.py): typed CompletedProcess, multiprocessing.Event, dict params, assert pid not None before os.kill - Regenerate all_blueprints.py (stress-test entry now alphabetically sorted) * refactor: McpAdapter class + convert custom methods to @Skill tools Address Paul review comments on PR #1451: - New McpAdapter class replacing 3 duplicated _mcp_call implementations - Convert dimos/status, list_modules, agent_send to @Skill on McpServer - CLI thin wrappers over McpAdapter, added --json-args flag - worker.py: os.kill(pid, 0) for pid check - Renamed test files (demo_ prefix for non-CI, integration for pytest) - transport.start()/stop(), removed skill_count, requests in pyproject 29/29 tests pass locally (41s). * fix: alphabetical order in all_blueprints.py for demo-mcp-stress-test * fix: catch HTTPError in McpAdapter, guard None pid in Worker - McpAdapter.call(): catch requests.HTTPError and re-raise as McpError so CLI callers get clean error messages instead of raw tracebacks - Worker.pid: check for None before os.kill() — unstarted processes have pid=None which would raise TypeError (not caught by OSError) * fix: server_status returns main process PID, not worker PID McpServer runs in a forkserver worker, so os.getpid() returns the worker PID. Read from RunEntry instead to get the main daemon PID that dimos stop/status actually needs. * refactor: use click.ParamType for --arg parsing in mcp call Replace manual string splitting with _KeyValueType click.ParamType per Paul's review suggestion. Validation and JSON coercion now handled by click's type system instead of inline loop. * fix: viewer_backend → viewer rename + KeyValueType test fix - Update test_mcp_integration.py and demo_mcp_killtest.py to use 'viewer' instead of 'viewer_backend' (renamed in #1477) - Fix test_cli_call_tool_wrong_arg_format: exit code 2 (click ParamType validation) instead of 1, check KEY=VALUE in output - Merge dev to pick up #1477 rename and #1494 replay loop * fix: mypy arg-type error for KeyValueType dict(args)
spomichter
added a commit
that referenced
this pull request
Mar 12, 2026
Release v0.0.11 82 PRs, 10 contributors, 396 files changed. This release brings a production CLI, MCP tooling, temporal memory, and first-class support for coding agents. Dask has been removed. The entire stack now runs from `dimos run` through `dimos stop`. ### Agent-Native Development DimOS is now built to be driven by coding agents. Point OpenClaw, Claude Code, or Cursor at [AGENTS.md](AGENTS.md) and they can build, run, and debug Dimensional applications using the CLI and MCP interfaces directly. - **AGENTS.md** — comprehensive onboarding doc: architecture, CLI reference, skill rules, blueprint quick-reference. Your agent reads this and starts coding. - **MCP server** — all `@skill` methods exposed as HTTP tools. External agents call `dimos mcp call relative_move --arg forward=0.5` or connect via JSON-RPC. - **MCP CLI** — `dimos mcp list-tools`, `dimos mcp call`, `dimos mcp status`, `dimos mcp modules` - **Agent context logging** — MCP tool calls and agent messages logged to per-run JSONL for debugging and replay. ### CLI & Daemon Full process lifecycle — no more Ctrl-C in tmux. - `dimos run --daemon` — background execution with health checks and run registry - `dimos stop [--force]` — graceful shutdown with SIGTERM → SIGKILL fallback - `dimos restart` — replays the original CLI arguments - `dimos status` — PID, blueprint, uptime, MCP port - `dimos log -f` — structured per-run logs with follow, JSON output, filtering - `dimos show-config` — resolved GlobalConfig with source tracing ### Temporal-Spatial Memory Robots in physical space ingest hours of video and lidar. Temporal-spatial memory gives them a human-like understanding of the world — causal object relationships, entity tracking through time and physical space, and the ability to answer complex temporal queries: *Who spends the most time in the kitchen? What time on average do I wake up? Which set of switches toggles the main lights? Who was at the office at 9am last Thursday?* Traditional frame-level embeddings (CLIP, ViT) lose temporal context and don't scale beyond a handful of frames. Video transformers are expensive and don't operate in RGB-D. Dimensional agents work with video + lidar natively, tracking entities across hours and days. ```bash dimos --replay --replay-dir unitree_go2_office_walk2 run unitree-go2-temporal-memory ``` ### Interactive Viewer Custom Rerun fork (`dimos-viewer`) is now the default. Click-to-navigate: click a point in the 3D view → PointStamped → A* planner → robot moves. - Camera | 3D split layout on Go2, G1, and drone blueprints - Native keyboard teleop in the viewer - `--viewer rerun|rerun-web|rerun-connect|foxglove|none` ### Drone Support Drone blueprints modernized to match Go2 composition pattern. `drone-basic` and `drone-agentic` work with replay, Rerun, and the full CLI. ```bash dimos --replay run drone-basic dimos --replay run drone-agentic ``` ### More - **Go2 fleet control** — multi-robot with `--robot-ips` (#1487) - **Replay `--replay-dir`** — select dataset, loops by default (#1519, #1494) - **Interactive install** — `curl -fsSL .../install.sh | bash` (#1395) - **Nix on non-Debian Linux** (#1472) - **Remove Dask** — native worker pool (#1365) - **Remove asyncio dependency** (#1367) - **Perceive loop** — continuous observation module for agents (#1411) - **Worker resource monitor** — `dtop` TUI (#1378) - **G1 agent wiring fix** (#1518) - **Rerun rate limiting** — prevents viewer OOM on continuous streams (#1509, #1521) - **RotatingFileHandler** — prevents unbounded log growth (#1492) - **Test coverage** (#1397), draft PR CI skip (#1398), manipulation test fixes (#1522) ### Breaking Changes - `--viewer-backend` renamed to `--viewer` - Dask removed — blueprints using Dask workers need migration to native worker pool - Default viewer changed from `rerun-web` to `rerun` (native dimos-viewer) ### Contributors @spomichter, @PaulNechifor, @ruthwikdasyam, @summeryang, @MustafaBhadsorawala, @leshy, @sambull, @JeffHykin, @RadientBrain ## Contributor License Agreement - [x] I have read and approved the [CLA](https://github.com/dimensionalOS/dimos/blob/main/CLA.md).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Replay mode plays through the recorded data once and then stops. For development and testing, you want continuous playback so the robot keeps streaming data.
Solution
ReplayConnectionnow defaultsloop=Truein its replay configLegacyPickleStore.stream()where looped iterations fired instantly because the reference timestamps were never reset on loop restart. Now detects backwards timestamp jumps and resets timing.2 files changed, +10 -2
How to Test
dimos --replay run unitree-go2-basic # Data should continuously loop instead of stopping after ~10sContributor License Agreement