feat: implement checkpoint recovery strategy#366
Conversation
Adds the Checkpoint Recovery strategy (Engine design §6.6, Strategy 2) which persists AgentContext snapshots after each completed turn and resumes from the last checkpoint on crash, preserving progress. After max_resume_attempts failures, falls back to fail-and-reassign. New modules: - engine/checkpoint/ — models, callback factory, strategy - persistence/sqlite/ — checkpoint and heartbeat repositories - observability/events/checkpoint.py — lifecycle event constants Key changes: - RecoveryResult extended with checkpoint_context_json, resume_attempt, can_resume fields (backward-compatible defaults) - ReactLoop and PlanExecuteLoop accept optional checkpoint_callback - AgentEngine wires checkpoint repos, builds callback-enabled loops, and resumes from checkpoint on recoverable errors - SQLite schema V6 adds checkpoints and heartbeats tables - PersistenceBackend protocol gains checkpoints/heartbeats properties - 100 new unit tests, full suite green at 94% coverage
Pre-reviewed by 15 agents, 42 findings addressed: - Fix missing completion_config and effective_autonomy on resume path - Add MemoryError/RecursionError re-raise in strategy checkpoint load - Add asyncio.Lock to protect _resume_counts from TOCTOU race - Clean up resume counts in fallback path to prevent memory leak - Validate checkpoint_repo and heartbeat_repo are provided together - Log warning for unsupported loop types in checkpoint injection - Skip heartbeat update when checkpoint save fails (prevent limbo) - Guard against turn-zero checkpoint via explicit check - Sanitize error message in LLM reconciliation ChatMessage - Replace assert with explicit RuntimeError for checkpoint_context_json - Add cross-field validator for checkpoint_context_json/resume_attempt - Clean up checkpoints after successful resume completion - Thread completion_config and effective_autonomy through recovery chain - Extract duplicate checkpoint callback blocks into helper method - Add public config property to PlanExecuteLoop (remove SLF001 noqa) - Use EXECUTION_CHECKPOINT_CALLBACK_FAILED event for error paths - Add structured error info to callback exception logging - Use NotBlankStr for repo params and callback_factory identifiers - Move cursor.rowcount read before commit in repos - Fix HeartbeatRepository.get_stale to accept AwareDatetime - Drop DESC from composite index (portability for SQLite < 3.47) - Document intentional FK absence in migrations - Document heartbeat_interval_seconds as reserved for future use - Update CLAUDE.md package structure and logging event docs - Update PersistenceBackend and RecoveryResult docstrings - Fix various docstring accuracy issues across modules - Add FakePersistenceBackend checkpoint/heartbeat repos in test conftest - Add protocol compliance assertions for CheckpointRepository/HeartbeatRepository - Add test_checkpoint_events_exist assertions - Fix _FakeCostRecordRepository.aggregate missing task_id (pre-existing) - Verify checkpoint/heartbeat model content in callback factory tests - Fix test_strategy.py return type annotation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a checkpoint-based crash recovery subsystem: models (Checkpoint, Heartbeat, CheckpointConfig), repositories and sqlite implementations, migration to schema v6, a CheckpointRecoveryStrategy, per-turn checkpoint callbacks wired into execution loops and AgentEngine resume/cleanup flows, resume helpers, observability events, and comprehensive tests and docs updates. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
📝 Coding Plan
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's resilience by introducing a comprehensive checkpoint recovery mechanism. It allows agent executions to gracefully recover from unexpected interruptions by saving their state periodically and resuming from the last valid checkpoint. This ensures that long-running or critical agent tasks can withstand failures without losing significant progress, leading to more reliable and robust operations. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR implements the checkpoint recovery strategy (§6.6 Strategy 2) for agent crash recovery. It adds per-turn AgentContext persistence, heartbeat-based failure detection, resume from last checkpoint with reconciliation, and max resume attempts with fallback to fail-and-reassign.
Changes:
- New
engine/checkpoint/package withCheckpoint,Heartbeat,CheckpointConfigmodels,CheckpointRecoveryStrategy,CheckpointCallbacktype alias, andmake_checkpoint_callbackfactory - SQLite persistence layer additions:
CheckpointRepositoryandHeartbeatRepositoryprotocols with SQLite implementations, V6 migration forcheckpointsandheartbeatstables - Integration of checkpoint callback into
ReactLoopandPlanExecuteLoop, and resume path inAgentEnginewithcompletion_config/effective_autonomythreading
Reviewed changes
Copilot reviewed 34 out of 35 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/ai_company/engine/checkpoint/__init__.py |
Package init re-exporting checkpoint types |
src/ai_company/engine/checkpoint/callback.py |
CheckpointCallback type alias |
src/ai_company/engine/checkpoint/callback_factory.py |
Factory closure for persisting checkpoints/heartbeats per turn |
src/ai_company/engine/checkpoint/models.py |
Frozen Pydantic models: Checkpoint, Heartbeat, CheckpointConfig |
src/ai_company/engine/checkpoint/strategy.py |
CheckpointRecoveryStrategy with async-safe resume counter |
src/ai_company/engine/recovery.py |
New checkpoint_context_json, resume_attempt, can_resume fields on RecoveryResult |
src/ai_company/engine/react_loop.py |
Checkpoint callback invocation after each turn |
src/ai_company/engine/plan_execute_loop.py |
Checkpoint callback invocation via extracted _invoke_checkpoint_callback |
src/ai_company/engine/agent_engine.py |
Resume path, loop-with-callback factory, paired-repo validation |
src/ai_company/engine/__init__.py |
Re-exports for new checkpoint types |
src/ai_company/persistence/repositories.py |
CheckpointRepository and HeartbeatRepository protocols |
src/ai_company/persistence/protocol.py |
Added checkpoints and heartbeats properties to PersistenceBackend |
src/ai_company/persistence/sqlite/checkpoint_repo.py |
SQLite CheckpointRepository implementation |
src/ai_company/persistence/sqlite/heartbeat_repo.py |
SQLite HeartbeatRepository implementation |
src/ai_company/persistence/sqlite/backend.py |
Wires new repos into SQLitePersistenceBackend |
src/ai_company/persistence/sqlite/__init__.py |
Re-exports for new SQLite repo classes |
src/ai_company/persistence/sqlite/migrations.py |
V6 migration: checkpoints and heartbeats tables with indexes |
src/ai_company/observability/events/checkpoint.py |
13 checkpoint/heartbeat/recovery event constants |
src/ai_company/observability/events/execution.py |
5 new checkpoint callback & resume event constants |
src/ai_company/observability/events/persistence.py |
16 new persistence-layer checkpoint/heartbeat event constants |
docs/design/engine.md |
Removes "Planned" warning from checkpoint strategy docs |
README.md |
Adds Persistence→Engine dependency edge |
CLAUDE.md |
Updates engine description and event name list |
tests/unit/engine/checkpoint/test_models.py |
Model creation, validation, and constraint tests |
tests/unit/engine/checkpoint/test_callback_factory.py |
Callback factory boundary, heartbeat, and error handling tests |
tests/unit/engine/checkpoint/test_strategy.py |
Strategy protocol, resume, fallback, and counter tests |
tests/unit/engine/test_recovery_checkpoint_fields.py |
RecoveryResult checkpoint field tests |
tests/unit/persistence/sqlite/test_migrations_v6.py |
V6 migration table/index/idempotency tests |
tests/unit/persistence/sqlite/test_checkpoint_repo.py |
SQLite checkpoint repo CRUD tests |
tests/unit/persistence/sqlite/test_heartbeat_repo.py |
SQLite heartbeat repo CRUD tests |
tests/unit/persistence/test_protocol.py |
Fake repo protocol conformance tests |
tests/unit/persistence/test_migrations_v2.py |
Updated schema version assertions to 6 |
tests/unit/observability/test_events.py |
New checkpoint event module discovery and value tests |
tests/unit/api/conftest.py |
Fake repos for API test backend |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| msg = "checkpoint_context_json is None but can_resume was True" | ||
| raise RuntimeError(msg) | ||
|
|
||
| execution_id = f"{agent_id}:{task_id}" |
There was a problem hiding this comment.
Code Review
This is an excellent and comprehensive implementation of the checkpoint recovery strategy. The changes are well-structured across new modules, persistence layers, and the agent engine. The attention to detail is impressive, particularly in handling potential race conditions with asyncio.Lock for resume counts and preventing inconsistent states by linking heartbeat updates to successful checkpoint saves. The addition of thorough unit tests for the new models, repositories, strategy, and callback factory ensures the robustness of this critical feature.
I have one suggestion regarding minor code duplication for invoking the checkpoint callback, which could be refactored for better maintainability.
| if self._checkpoint_callback is not None: | ||
| try: | ||
| await self._checkpoint_callback(ctx) | ||
| except MemoryError, RecursionError: | ||
| raise | ||
| except Exception as exc: | ||
| logger.exception( | ||
| EXECUTION_CHECKPOINT_CALLBACK_FAILED, | ||
| execution_id=ctx.execution_id, | ||
| turn=turn_number, | ||
| error=f"{type(exc).__name__}: {exc}", | ||
| ) |
There was a problem hiding this comment.
This logic for invoking the checkpoint callback is duplicated from PlanExecuteLoop._invoke_checkpoint_callback. To improve maintainability and avoid code duplication, consider extracting this logic into a shared helper function, for example in src/ai_company/engine/loop_helpers.py.
As an intermediate step, you could extract this block into a private _invoke_checkpoint_callback method within this class, mirroring the implementation in PlanExecuteLoop.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #366 +/- ##
==========================================
- Coverage 93.90% 93.50% -0.41%
==========================================
Files 447 456 +9
Lines 20819 21313 +494
Branches 2011 2044 +33
==========================================
+ Hits 19551 19929 +378
- Misses 981 1088 +107
- Partials 287 296 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Greptile SummaryThis PR implements §6.6 Strategy 2 checkpoint recovery for the agent engine: per-turn Key findings from this review:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant ExecLoop as ExecutionLoop
participant CB as CheckpointCallback
participant CPRepo as CheckpointRepository
participant HBRepo as HeartbeatRepository
participant Engine as AgentEngine
participant Strategy as CheckpointRecoveryStrategy
participant Fallback as FailAndReassignStrategy
ExecLoop->>CB: invoke after each completed turn
CB->>CB: skip if turn==0 or turn mod N != 0
CB->>CPRepo: save(Checkpoint)
alt save succeeded
CB->>HBRepo: save(Heartbeat)
else save failed
CB->>CB: log warning, skip heartbeat
end
Note over ExecLoop,Engine: Crash or ERROR termination
Engine->>Strategy: recover(task_execution, error_message, context)
Strategy->>CPRepo: get_latest(execution_id)
alt no checkpoint or attempts exhausted
Strategy->>CPRepo: delete_by_execution
Strategy->>HBRepo: delete
Strategy->>Fallback: recover(...)
Fallback-->>Engine: RecoveryResult can_resume=False
else checkpoint found and attempt available
Strategy->>Strategy: increment _resume_counts[execution_id]
Strategy-->>Engine: RecoveryResult can_resume=True
end
alt can_resume is True
Engine->>Engine: deserialize_and_reconcile checkpoint_json
Engine->>Engine: inject SYSTEM reconciliation message
Engine->>ExecLoop: execute(checkpoint_ctx)
ExecLoop-->>Engine: ExecutionResult
alt termination reason is not ERROR
Engine->>Strategy: clear_resume_count(execution_id)
Engine->>CPRepo: delete_by_execution
Engine->>HBRepo: delete
end
end
|
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/ai_company/engine/agent_engine.py (1)
395-421:⚠️ Potential issue | 🟠 MajorRecovered/resumed execution can skip final cost recording.
record_execution_costs(...)runs before recovery. If Line 671 replaces the error result with a resumed result, the resumed portion’s final costs are not recorded in this pipeline pass.Also applies to: 671-677
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 395 - 421, The current pipeline calls record_execution_costs(...) before calling _apply_recovery(...) which can replace an ERROR termination with a resumed/executed result and thus skip final cost recording; modify the flow so that after apply_post_execution_transitions(...) and after calling self._apply_recovery(...) (when execution_result.termination_reason was ERROR and a recovery/resume occurred) you call record_execution_costs(...) again on the new execution_result (using the same identity, agent_id, task_id, tracker=self._cost_tracker) to ensure final costs for the resumed execution are recorded; use the symbols execution_result, TerminationReason.ERROR, pre_recovery_ctx/pre_recovery_status to detect recovery and re-run record_execution_costs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CLAUDE.md`:
- Line 154: The guide references a non-existent symbol EXECUTION_LOOP_STARTED;
update the text and import example to use the correct exported constant
EXECUTION_LOOP_START from ai_company.observability.events.execution (i.e.,
replace any occurrence of EXECUTION_LOOP_STARTED with EXECUTION_LOOP_START and
adjust the import example to `from ai_company.observability.events.execution
import EXECUTION_LOOP_START`) so the listed event constant matches the actual
export in src/ai_company/observability/events/execution.py.
In `@src/ai_company/engine/agent_engine.py`:
- Around line 696-704: The resume path in _resume_from_checkpoint currently runs
the agent loop directly and builds a budget checker ad hoc, bypassing the
engine's timeout and budget orchestration; change _resume_from_checkpoint to
reuse the existing execution orchestration by invoking _run_loop_with_timeout
(the same wrapper used for fresh runs) and call or delegate to _execute (or its
budget-checker factory) to construct the budget-enforcer-specific checker so
resumed executions are subject to the same timeout/budget logic as new runs;
ensure you pass recovery_result, agent_id, task_id, completion_config, and
effective_autonomy through the same parameters that
_run_loop_with_timeout/_execute expect and remove the ad-hoc loop and checker
creation in favor of those shared helpers (references: _resume_from_checkpoint,
_run_loop_with_timeout, _execute, budget checker creation).
- Around line 696-814: The _resume_from_checkpoint function is too large and
should be split into focused helpers: extract deserialization into a helper
load_checkpoint_from_json(recovery_result: RecoveryResult) that validates
checkpoint_context_json and returns AgentContext; extract the reconciliation
injection into build_reconciliation_message(checkpoint_ctx: AgentContext) and
apply_reconciliation(checkpoint_ctx, message) which returns the new context;
extract loop execution into run_resume_loop(agent_id, task_id, checkpoint_ctx,
completion_config, effective_autonomy) that constructs the loop, tool_invoker,
budget_checker and awaits loop.execute returning ExecutionResult; and extract
checkpoint cleanup into cleanup_checkpoints_after_success(execution_id) which
handles _checkpoint_repo.delete_by_execution and logs warnings. Refactor
_resume_from_checkpoint to call these helpers in sequence and keep its
try/except only around the minimal orchestration, preserving current logging,
MemoryError/RecursionError re-raises, and the resume-count-clear logic using
self._recovery_strategy and checkpoint_ctx.execution_id.
- Around line 721-723: Add a structured log entry (logger.error or
logger.warning) immediately before raising the RuntimeError in the resume
precondition check for recovery_result.checkpoint_context_json, including
agent_id, task_id, and resume_attempt to provide context; locate the check
around recovery_result.checkpoint_context_json in agent_engine.py (the block
that currently raises "checkpoint_context_json is None but can_resume was True")
and call the module/class logger (e.g., logger.error(..., extra={...}) or
logger.warning(..., extra={...})) with a clear message and the fields agent_id,
task_id and resume_attempt, then raise the same RuntimeError as before.
- Around line 803-813: The checkpoint cleanup is running unconditionally after
loop.execute in AgentEngine, which deletes checkpoints even when the returned
ExecutionResult has termination_reason == ERROR; modify the post-execution block
in agent_engine.py (the code that calls loop.execute and then uses
self._checkpoint_repo.delete_by_execution) to only attempt deletion when
result.termination_reason indicates success (e.g., not ERROR) — check
ExecutionResult.termination_reason against the ERROR constant before calling
delete_by_execution, and keep the existing exception handler that logs
EXECUTION_CHECKPOINT_CALLBACK_FAILED with execution_id and error details.
In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 64-171: The recover method is doing too many things; split it into
smaller functions: extract checkpoint-loading (the try/except around
self._checkpoint_repo.get_latest and logging) into a helper like
_load_latest_checkpoint(execution_id, task_id) that returns the checkpoint or
None, and extract the resume-attempt reservation and check (the async with
self._resume_lock block that reads/modifies self._resume_counts and logs
fallback) into a helper like _reserve_resume_attempt(execution_id, task_id) that
returns the current resume_attempt or raises/returns a sentinel to indicate
fallback; keep recover() to orchestrate calling _load_latest_checkpoint,
delegating to self._delegate_to_fallback when needed, calling
context.to_snapshot(), and assembling the RecoveryResult (using existing names
recovery fields: task_execution, strategy_type, context_snapshot, error_message,
checkpoint_context_json, resume_attempt); ensure all original logging
(CHECKPOINT_LOAD_FAILED, CHECKPOINT_RECOVERY_NO_CHECKPOINT,
CHECKPOINT_RECOVERY_FALLBACK, CHECKPOINT_RECOVERY_RESUME, CHECKPOINT_LOADED) and
exception passthrough for MemoryError/RecursionError remain in the new helpers.
- Around line 98-110: The current blanket "except Exception" around
self._checkpoint_repo.get_latest masks unexpected bugs; change it to catch only
the repository's persistence error(s) (e.g., CheckpointPersistenceError or the
repo-specific exception type) and log the exception details in the structured
logger call (use CHECKPOINT_LOAD_FAILED with execution_id, task_id, and the
caught exception info), leave the existing MemoryError and RecursionError
re-raises intact, and re-raise any other unexpected exceptions instead of
treating them as "no checkpoint".
In `@src/ai_company/persistence/sqlite/heartbeat_repo.py`:
- Around line 39-49: When persisting and when querying heartbeats, normalize the
timestamp to UTC so lexicographic string comparisons work: in the save path
where heartbeat.model_dump(...) is used (the method that calls self._db.execute
for INSERT OR REPLACE) convert heartbeat.last_heartbeat_at to an ISO UTC string
(e.g. .astimezone(datetime.timezone.utc).isoformat() or equivalent) before
dumping/parametrizing the data, and likewise in get_stale() normalize compared
timestamps to UTC before building the query parameters; update the places that
call self._db.execute/commit for heartbeats and the get_stale() query code to
always use that UTC-normalized ISO string.
In `@tests/unit/api/conftest.py`:
- Around line 347-360: The fake repository's get_latest method should enforce
the real repo contract by rejecting calls with both execution_id and task_id as
None; modify get_latest (in the test fake checkpoint repo) to raise a ValueError
when neither execution_id nor task_id is provided, then keep the existing
filtering on self._checkpoints and return None or the max by turn_number as
before when at least one filter is supplied.
In `@tests/unit/engine/checkpoint/test_callback_factory.py`:
- Around line 147-162: Add parametrized cases covering turn=0 to assert the
factory skips saving at turn zero; update the pytest.mark.parametrize tuple in
tests/unit/engine/checkpoint/test_callback_factory.py (the
("persist_every_n","turn","should_save") parameter list) to include entries like
(1, 0, False) and at least one other (e.g. (3, 0, False)) so the behavior
guarded by the condition in the callback factory (the check using turn == 0 or
turn % config.persist_every_n_turns != 0) is explicitly exercised.
In `@tests/unit/engine/test_recovery_checkpoint_fields.py`:
- Around line 43-50: Add a failure-case test that ensures
RecoveryResult._validate_checkpoint_consistency() rejects cases where only one
of checkpoint_context_json or resume_attempt is set: construct RecoveryResult
with checkpoint_context_json set but resume_attempt omitted (and vice versa) and
assert it raises the appropriate exception (e.g., ValueError) using
pytest.raises; add the same negative test alongside the existing positive cases
around the current blocks that create RecoveryResult (the spots that set
checkpoint_context_json and resume_attempt together) so the invariant is covered
in all three places.
In `@tests/unit/persistence/sqlite/test_migrations_v6.py`:
- Around line 18-21: The test duplicate asserts SCHEMA_VERSION == 6 in
TestSchemaVersion::test_schema_version_is_six; remove this duplicate test from
this file so the canonical check remains only in the existing test (the one in
the other test module), i.e., delete the TestSchemaVersion class or at least the
test_schema_version_is_six method referencing SCHEMA_VERSION to avoid redundant
assertions and maintenance overhead.
In `@tests/unit/persistence/test_migrations_v2.py`:
- Around line 31-32: The test test_schema_version_is_six should not hardcode the
literal 6; replace the equality assertion with a non-version-specific check
using the SCHEMA_VERSION symbol (or remove the exact-version assertion here and
move/keep exact-version checks in the newest migration test). Update both
occurrences (test_schema_version_is_six and the similar assertion at the other
location) to reference SCHEMA_VERSION or to assert behavior relative to the v2
baseline rather than == 6 so future migration bumps won’t force churn in this
v2-focused test.
---
Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 395-421: The current pipeline calls record_execution_costs(...)
before calling _apply_recovery(...) which can replace an ERROR termination with
a resumed/executed result and thus skip final cost recording; modify the flow so
that after apply_post_execution_transitions(...) and after calling
self._apply_recovery(...) (when execution_result.termination_reason was ERROR
and a recovery/resume occurred) you call record_execution_costs(...) again on
the new execution_result (using the same identity, agent_id, task_id,
tracker=self._cost_tracker) to ensure final costs for the resumed execution are
recorded; use the symbols execution_result, TerminationReason.ERROR,
pre_recovery_ctx/pre_recovery_status to detect recovery and re-run
record_execution_costs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0f947cd0-dfd7-489e-989f-6bed19b02c7f
📒 Files selected for processing (35)
CLAUDE.mdREADME.mddocs/design/engine.mdsrc/ai_company/engine/__init__.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/recovery.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/observability/events/execution.pysrc/ai_company/observability/events/persistence.pysrc/ai_company/persistence/protocol.pysrc/ai_company/persistence/repositories.pysrc/ai_company/persistence/sqlite/__init__.pysrc/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/persistence/sqlite/migrations.pytests/unit/api/conftest.pytests/unit/engine/checkpoint/__init__.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/engine/checkpoint/test_models.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/observability/test_events.pytests/unit/persistence/sqlite/test_checkpoint_repo.pytests/unit/persistence/sqlite/test_heartbeat_repo.pytests/unit/persistence/sqlite/test_migrations_v6.pytests/unit/persistence/test_migrations_v2.pytests/unit/persistence/test_protocol.py
💤 Files with no reviewable changes (1)
- docs/design/engine.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Agent
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
- GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax:except A, B:(no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001. Tests must usetest-provider,test-small-001, etc.
Files:
tests/unit/engine/checkpoint/test_models.pysrc/ai_company/observability/events/execution.pytests/unit/persistence/test_migrations_v2.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/persistence/sqlite/__init__.pysrc/ai_company/engine/checkpoint/__init__.pytests/unit/engine/test_recovery_checkpoint_fields.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/react_loop.pytests/unit/persistence/sqlite/test_heartbeat_repo.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/agent_engine.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/engine/checkpoint/test_strategy.pysrc/ai_company/persistence/protocol.pytests/unit/persistence/sqlite/test_migrations_v6.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/persistence/sqlite/backend.pytests/unit/api/conftest.pysrc/ai_company/persistence/sqlite/migrations.pytests/unit/observability/test_events.pytests/unit/persistence/sqlite/test_checkpoint_repo.pysrc/ai_company/persistence/repositories.pysrc/ai_company/engine/checkpoint/strategy.pytests/unit/persistence/test_protocol.pysrc/ai_company/observability/events/persistence.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Tests must use markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: useasyncio_mode = "auto"— no manual@pytest.mark.asyncioneeded.
Test timeout: 30 seconds per test.
Prefer@pytest.mark.parametrizefor testing similar cases.
Files:
tests/unit/engine/checkpoint/test_models.pytests/unit/persistence/test_migrations_v2.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/persistence/sqlite/test_heartbeat_repo.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/persistence/sqlite/test_migrations_v6.pytests/unit/api/conftest.pytests/unit/observability/test_events.pytests/unit/persistence/sqlite/test_checkpoint_repo.pytests/unit/persistence/test_protocol.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction plusMappingProxyTypewrapping for read-only enforcement.
Fordict/listfields in frozen Pydantic models, rely onfrozen=Truefor field reassignment prevention and usecopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field,ConfigDict. Use@computed_fieldfor derived values instead of storing redundant fields (e.g.,TokenUsage.total_tokens).
UseNotBlankStrfromcore.typesfor all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER useimport logging,logging.getLogger(), orprint()in application code.
Always useloggeras the variable name for loggers (not_logger, notlog).
Event names must always use constants from domain-specific modules underai_company.observability.events(e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import dire...
Files:
src/ai_company/observability/events/execution.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/persistence/sqlite/__init__.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/react_loop.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/persistence/protocol.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/migrations.pysrc/ai_company/persistence/repositories.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/observability/events/persistence.py
src/ai_company/**/[!_]*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Every module with business logic MUST have:
from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Files:
src/ai_company/observability/events/execution.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/react_loop.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/persistence/protocol.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/migrations.pysrc/ai_company/persistence/repositories.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/observability/events/persistence.py
🧠 Learnings (10)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/observability/events/execution.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/react_loop.pytests/unit/observability/test_events.pysrc/ai_company/observability/events/persistence.pyCLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/ai_company/**/[!_]*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
src/ai_company/engine/plan_execute_loop.pyCLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Applied to files:
src/ai_company/engine/checkpoint/models.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Structured logging: always use `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : NEVER use `import logging`, `logging.getLogger()`, or `print()` in application code.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All error paths must log at WARNING or ERROR with context before raising.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Always use `logger` as the variable name for loggers (not `_logger`, not `log`).
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : DEBUG logging for object creation, internal flow, and entry/exit of key functions.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Pure data models, enums, and re-exports do NOT need logging.
Applied to files:
CLAUDE.md
🧬 Code graph analysis (22)
tests/unit/engine/checkpoint/test_models.py (1)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-62)CheckpointConfig(85-117)Heartbeat(65-82)
src/ai_company/persistence/sqlite/__init__.py (2)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
SQLiteCheckpointRepository(27-174)src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
SQLiteHeartbeatRepository(26-162)
src/ai_company/engine/checkpoint/__init__.py (3)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-125)src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-62)CheckpointConfig(85-117)Heartbeat(65-82)src/ai_company/engine/checkpoint/strategy.py (1)
CheckpointRecoveryStrategy(33-199)
tests/unit/engine/test_recovery_checkpoint_fields.py (1)
src/ai_company/engine/recovery.py (4)
RecoveryResult(30-102)can_resume(100-102)recover(114-133)recover(152-206)
src/ai_company/engine/plan_execute_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
_checkpoint_callback(63-123)src/ai_company/engine/context.py (1)
AgentContext(87-307)
src/ai_company/engine/checkpoint/models.py (2)
src/ai_company/tools/base.py (1)
description(138-140)src/ai_company/engine/parallel_models.py (2)
agent_id(79-81)task_id(87-89)
src/ai_company/engine/checkpoint/callback_factory.py (4)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-62)CheckpointConfig(85-117)Heartbeat(65-82)src/ai_company/engine/context.py (1)
AgentContext(87-307)src/ai_company/persistence/repositories.py (11)
CheckpointRepository(484-533)HeartbeatRepository(537-595)save(48-57)save(114-123)save(170-180)save(207-216)save(284-294)save(335-344)save(415-424)save(487-496)save(540-549)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
src/ai_company/engine/recovery.py (1)
src/ai_company/engine/task_execution.py (1)
TaskExecution(60-231)
src/ai_company/engine/react_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
_checkpoint_callback(63-123)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
tests/unit/persistence/sqlite/test_heartbeat_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
Heartbeat(65-82)src/ai_company/persistence/sqlite/heartbeat_repo.py (5)
SQLiteHeartbeatRepository(26-162)save(36-61)get(63-88)get_stale(90-121)delete(123-145)tests/unit/persistence/sqlite/conftest.py (1)
migrated_db(26-34)
src/ai_company/persistence/sqlite/checkpoint_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
Checkpoint(25-62)src/ai_company/persistence/errors.py (1)
QueryError(33-34)src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
save(36-61)_row_to_model(147-162)get(63-88)
src/ai_company/engine/__init__.py (3)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-62)CheckpointConfig(85-117)Heartbeat(65-82)src/ai_company/engine/checkpoint/strategy.py (1)
CheckpointRecoveryStrategy(33-199)src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-125)
tests/unit/engine/checkpoint/test_callback_factory.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-125)src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(85-117)
src/ai_company/persistence/protocol.py (1)
src/ai_company/persistence/repositories.py (2)
CheckpointRepository(484-533)HeartbeatRepository(537-595)
tests/unit/persistence/sqlite/test_migrations_v6.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
run_migrations(337-405)tests/unit/persistence/test_migrations_v2.py (2)
test_schema_version_is_six(31-32)memory_db(21-26)
src/ai_company/engine/checkpoint/callback.py (1)
src/ai_company/engine/context.py (1)
AgentContext(87-307)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
Heartbeat(65-82)src/ai_company/persistence/errors.py (1)
QueryError(33-34)src/ai_company/persistence/repositories.py (11)
get(59-71)get(218-230)get(346-358)get(426-438)get(551-563)get_stale(565-581)delete(95-107)delete(260-272)delete(396-408)delete(468-480)delete(583-595)
src/ai_company/persistence/sqlite/backend.py (5)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
SQLiteCheckpointRepository(27-174)src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
SQLiteHeartbeatRepository(26-162)src/ai_company/persistence/protocol.py (2)
checkpoints(146-148)heartbeats(151-153)tests/unit/api/conftest.py (2)
checkpoints(472-473)heartbeats(476-477)tests/unit/persistence/test_protocol.py (2)
checkpoints(309-310)heartbeats(313-314)
tests/unit/api/conftest.py (4)
src/ai_company/engine/checkpoint/models.py (2)
Checkpoint(25-62)Heartbeat(65-82)src/ai_company/persistence/sqlite/checkpoint_repo.py (2)
get_latest(68-132)delete_by_execution(134-157)src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
get(63-88)get_stale(90-121)delete(123-145)src/ai_company/persistence/protocol.py (2)
checkpoints(146-148)heartbeats(151-153)
tests/unit/persistence/sqlite/test_checkpoint_repo.py (1)
src/ai_company/persistence/sqlite/checkpoint_repo.py (3)
save(37-66)get_latest(68-132)delete_by_execution(134-157)
src/ai_company/persistence/repositories.py (3)
src/ai_company/engine/checkpoint/models.py (2)
Checkpoint(25-62)Heartbeat(65-82)src/ai_company/persistence/sqlite/checkpoint_repo.py (2)
get_latest(68-132)delete_by_execution(134-157)src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
get(63-88)get_stale(90-121)delete(123-145)
src/ai_company/engine/checkpoint/strategy.py (6)
src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(85-117)src/ai_company/engine/context.py (2)
AgentContext(87-307)to_snapshot(280-302)src/ai_company/engine/recovery.py (4)
RecoveryResult(30-102)RecoveryStrategy(106-137)recover(114-133)recover(152-206)src/ai_company/engine/task_execution.py (1)
TaskExecution(60-231)src/ai_company/persistence/repositories.py (7)
CheckpointRepository(484-533)get_latest(498-519)get(59-71)get(218-230)get(346-358)get(426-438)get(551-563)src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
get_latest(68-132)
🪛 LanguageTool
CLAUDE.md
[style] ~154-~154: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...
(EG_NO_COMMA)
🔇 Additional comments (19)
tests/unit/observability/test_events.py (1)
11-15: LGTM!The checkpoint event imports, domain module discovery extension, and new test assertions follow the established patterns in this test file. Event constants are correctly imported from the domain-specific module
ai_company.observability.events.checkpoint.Also applies to: 227-227, 254-258
src/ai_company/engine/react_loop.py (2)
70-74: LGTM!The new
__init__method correctly accepts an optional checkpoint callback and stores it as an instance attribute. The type hint uses the forward reference pattern appropriately withTYPE_CHECKING.
209-220: LGTM!MemoryErrorcorrectly refers to the built-in.The checkpoint callback invocation with error handling is correct:
- Re-raises
MemoryError(built-in, for out-of-memory conditions) andRecursionError(non-recoverable system errors)- Logs and swallows other exceptions with structured context
- Uses PEP 758 except syntax correctly
The codebase defines a domain-specific
ai_company.memory.errors.MemoryErrorclass, butreact_loop.pydoes not import it, so the bareMemoryErrorhere unambiguously refers to the built-in Python exception.tests/unit/persistence/sqlite/test_migrations_v6.py (1)
24-135: LGTM!The V6 migration tests are comprehensive and well-structured:
- Table creation verification for both
checkpointsandheartbeats- Column schema validation with expected field sets
- Index creation verification with appropriate naming patterns
- Idempotency test ensuring migrations can run twice safely
All test classes are properly marked with
@pytest.mark.unitand follow the established async test patterns.src/ai_company/persistence/sqlite/__init__.py (1)
7-12: LGTM!The new checkpoint and heartbeat repository exports are correctly added following the existing pattern. The
__all__list maintains alphabetical ordering.Also applies to: 26-26, 28-28
tests/unit/engine/checkpoint/test_models.py (1)
1-268: LGTM!Comprehensive test coverage for the checkpoint models:
Checkpoint: ID generation, immutability, JSON validation, turn number constraints, NotBlankStr validationHeartbeat: Field assignment and immutabilityCheckpointConfig: Default values, custom values, immutability, and field constraintsThe use of
@pytest.mark.parametrizefor blank/whitespace field rejection (lines 156-170) follows best practices. All test classes are properly marked with@pytest.mark.unit.src/ai_company/engine/checkpoint/__init__.py (1)
1-23: LGTM!Clean module initializer with appropriate public API surface:
- Exports all relevant checkpoint types:
Checkpoint,CheckpointCallback,CheckpointConfig,CheckpointRecoveryStrategy,Heartbeat,make_checkpoint_callback- Module docstring accurately describes the checkpoint recovery functionality
__all__is alphabetically orderedtests/unit/persistence/sqlite/test_checkpoint_repo.py (1)
1-238: LGTM!Comprehensive test coverage for
SQLiteCheckpointRepository:
- Save/retrieve roundtrip with field equality verification
- Highest turn number selection when multiple checkpoints exist
- Filtering by
execution_id,task_id, and combined filters- Edge cases: no matching checkpoint (returns
None), no filters (raisesValueError)- Upsert behavior verification
- Deletion with count validation and isolation checks
The helper function
_make_checkpointuses keyword-only arguments for clarity, and all tests are properly marked with@pytest.mark.unit.tests/unit/engine/checkpoint/test_callback_factory.py (1)
77-382: LGTM!Comprehensive test coverage for the checkpoint callback factory:
- Boundary turn detection with various
persist_every_n_turnsconfigurations- Heartbeat update behavior (on save vs. on skip)
- Error handling: non-fatal errors swallowed,
MemoryError/RecursionErrorpropagated- Checkpoint failure correctly skips heartbeat to avoid inconsistent state
The tests use
test-providerandtest-small-001as per coding guidelines, and all test classes are properly marked with@pytest.mark.unit.src/ai_company/engine/checkpoint/callback_factory.py (1)
1-125: LGTM!The callback factory is well-structured with proper best-effort error handling. The PEP 758
except A, B:syntax for Python 3.14+ is correctly applied, and the logic properly:
- Skips turn 0 and non-boundary turns
- Persists checkpoints before updating heartbeats (avoiding limbo states)
- Propagates critical errors while swallowing recoverable ones
src/ai_company/observability/events/checkpoint.py (1)
1-25: LGTM!Event constants are well-organized into logical groups (checkpoint lifecycle, heartbeat lifecycle, recovery flow) with consistent naming conventions.
tests/unit/engine/checkpoint/test_strategy.py (1)
1-438: LGTM!Comprehensive test coverage for
CheckpointRecoveryStrategyincluding:
- Protocol compliance verification
- Resume behavior with valid checkpoints
- Fallback scenarios (no checkpoint, max attempts exhausted, repo errors)
- Per-execution counter tracking and independence
- Custom fallback strategy injection
Test structure follows guidelines with proper markers and timeout configuration.
src/ai_company/persistence/repositories.py (1)
483-595: LGTM!The
CheckpointRepositoryandHeartbeatRepositoryprotocols are well-defined with:
- Consistent method signatures matching existing repository patterns
- Clear preconditions documented (e.g.,
get_latestrequires at least one filter)- Appropriate return types (optional for single lookups, tuples for collections)
- Proper error documentation
src/ai_company/engine/plan_execute_loop.py (2)
86-97: LGTM!Clean integration of checkpoint callback into the loop constructor with appropriate optional typing and a useful
configproperty for external access.
768-792: LGTM!The
_invoke_checkpoint_callbackmethod correctly:
- Guards against None callback
- Propagates critical errors (MemoryError, RecursionError)
- Logs and swallows other exceptions to prevent checkpoint failures from interrupting execution
- Includes useful error context in the log
src/ai_company/engine/__init__.py (1)
31-38: LGTM!Checkpoint-related entities are properly re-exported from the engine package, maintaining the established pattern for public API exposure.
src/ai_company/engine/recovery.py (1)
61-102: LGTM!Well-designed extension to
RecoveryResult:
checkpoint_context_jsonandresume_attemptfields with appropriate defaults- Consistency validator ensures both are set together or neither (prevents invalid states)
can_resumecomputed field provides a clean API for downstream consumerssrc/ai_company/engine/checkpoint/models.py (1)
1-117: LGTM!Well-designed checkpoint models following frozen Pydantic patterns:
Checkpoint: Proper validation ofcontext_jsonas valid JSON, auto-generated ID, UTC timestampsHeartbeat: Clean liveness tracking structureCheckpointConfig: Sensible defaults with appropriate constraints (allowsmax_resume_attempts=0for disabling resume)All models correctly use
NotBlankStrfor identifiers andAwareDatetimefor timestamps.src/ai_company/engine/agent_engine.py (1)
174-183: Good guard for paired checkpoint dependencies.The constructor validation at Line 174 prevents partial checkpoint wiring and fails fast with a clear message.
| async def _resume_from_checkpoint( | ||
| self, | ||
| recovery_result: RecoveryResult, | ||
| agent_id: str, | ||
| task_id: str, | ||
| *, | ||
| completion_config: CompletionConfig | None = None, | ||
| effective_autonomy: EffectiveAutonomy | None = None, | ||
| ) -> ExecutionResult: |
There was a problem hiding this comment.
Resume path bypasses engine timeout/budget orchestration.
At Line 758, resume executes the loop directly instead of going through _run_loop_with_timeout, so resumed executions are no longer governed by the same timeout control path. Also, the budget checker is built ad hoc instead of reusing the _execute path (including budget-enforcer-specific checker creation).
Suggested direction
- result = await loop.execute(
- context=checkpoint_ctx,
- provider=self._provider,
- ...
- )
+ result = await self._run_loop_with_timeout(
+ ctx=checkpoint_ctx,
+ agent_id=agent_id,
+ task_id=task_id,
+ completion_config=completion_config,
+ budget_checker=budget_checker, # built using same policy as _execute
+ tool_invoker=...,
+ start=resume_start,
+ timeout_seconds=timeout_seconds,
+ )Also applies to: 756-773
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/agent_engine.py` around lines 696 - 704, The resume
path in _resume_from_checkpoint currently runs the agent loop directly and
builds a budget checker ad hoc, bypassing the engine's timeout and budget
orchestration; change _resume_from_checkpoint to reuse the existing execution
orchestration by invoking _run_loop_with_timeout (the same wrapper used for
fresh runs) and call or delegate to _execute (or its budget-checker factory) to
construct the budget-enforcer-specific checker so resumed executions are subject
to the same timeout/budget logic as new runs; ensure you pass recovery_result,
agent_id, task_id, completion_config, and effective_autonomy through the same
parameters that _run_loop_with_timeout/_execute expect and remove the ad-hoc
loop and checker creation in favor of those shared helpers (references:
_resume_from_checkpoint, _run_loop_with_timeout, _execute, budget checker
creation).
Critical fixes:
- execution_id mismatch: use checkpoint_ctx.execution_id for cleanup
instead of fabricated f"{agent_id}:{task_id}" (Copilot, Greptile,
code-reviewer, python-reviewer, silent-failure, security)
- PEP 758 except syntax in models.py
- Extract _resume_from_checkpoint (118→35 lines) into helpers
- Extract recover() (108→30 lines) into smaller methods
- Move _make_loop_with_callback to checkpoint/resume.py
- Create engine/checkpoint/resume.py for resume orchestration helpers
Major fixes:
- Include actual error_message in reconciliation (not hardcoded text)
- Move _delegate_to_fallback outside lock scope in strategy
- Clean up heartbeat after successful resume (Greptile)
- Conditional checkpoint cleanup — skip on ERROR (CodeRabbit)
- Fix EXECUTION_LOOP_STARTED → EXECUTION_LOOP_START in CLAUDE.md
- Fix PARALLEL_EXECUTION_STARTED → PARALLEL_GROUP_START in CLAUDE.md
- Add 3 missing checkpoint fields to RecoveryResult table in design spec
- Remove non-existent storage field from YAML config example
- Update reconciliation description to match implementation
- Make clear_resume_count async with lock for thread safety
- Normalize heartbeat timestamps to UTC for correct comparisons
- Narrow except Exception to PersistenceError in strategy
- Split make_checkpoint_callback (92→3 functions under 50 lines)
Medium fixes:
- Use CHECKPOINT_DELETE_FAILED event constant for cleanup warnings
- Add exc_info=True on cleanup warning logs
- Log before raising RuntimeError on None checkpoint_context_json
- Type clear_resume_count parameter as NotBlankStr
- Add JSON validation to RecoveryResult.checkpoint_context_json
- Bound _resume_counts dict with _MAX_TRACKED_EXECUTIONS eviction
- Update RecoveryStrategy protocol docstring for checkpoint recovery
- Add checkpoint recovery to README capabilities
- Update ReactLoop checkpoint_callback docstring
- FakeCheckpointRepository.get_latest validates no-filter ValueError
- Add test for RecoveryResult._validate_checkpoint_consistency
- Add CHECKPOINT_UNSUPPORTED_LOOP event constant
Minor fixes:
- Fix checkpoint __init__.py docstring (configurable intervals)
- Fix callback.py type alias docstring (may skip)
- Fix callback_factory docstring (turn-0 skip mention)
- FakeHeartbeatRepository.get_stale sorts by timestamp
- Add turn=0 parametrize cases to test_callback_factory
- Tighten context_json validator to reject arrays/primitives
- Add composite (task_id, turn_number) index
- Add CHECK (turn_number >= 0) constraint to DDL
- Remove duplicate test_schema_version from test_migrations_v2
- Use SCHEMA_VERSION constant in test_migrations_v2
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/ai_company/engine/react_loop.py (1)
209-240:⚠️ Potential issue | 🔴 CriticalCheckpoint is written before tool side effects are finalized.
At Line 209, the callback runs before tool execution. If a crash happens after a tool mutates external state but before the next checkpoint, resume can replay that tool path and duplicate side effects.
💡 Safer checkpoint boundary (after tool execution for tool-call turns)
- if self._checkpoint_callback is not None: - try: - await self._checkpoint_callback(ctx) - except MemoryError, RecursionError: - raise - except Exception as exc: - logger.exception( - EXECUTION_CHECKPOINT_CALLBACK_FAILED, - execution_id=ctx.execution_id, - turn=turn_number, - error=f"{type(exc).__name__}: {exc}", - ) - if not response.tool_calls: + await self._run_checkpoint_callback(ctx, turn_number) return self._handle_completion(ctx, response, turns) @@ - return await execute_tool_calls( + post_tool = await execute_tool_calls( ctx, tool_invoker, response, turn_number, turns, ) + if isinstance(post_tool, ExecutionResult): + return post_tool + await self._run_checkpoint_callback(post_tool, turn_number) + return post_tool + + async def _run_checkpoint_callback( + self, + ctx: AgentContext, + turn_number: int, + ) -> None: + if self._checkpoint_callback is None: + return + try: + await self._checkpoint_callback(ctx) + except MemoryError, RecursionError: + raise + except Exception as exc: + logger.exception( + EXECUTION_CHECKPOINT_CALLBACK_FAILED, + execution_id=ctx.execution_id, + turn=turn_number, + error=f"{type(exc).__name__}: {exc}", + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/react_loop.py` around lines 209 - 240, The checkpoint callback is currently invoked before tool execution, which risks checkpointing prior to tool side-effects; change the flow so _checkpoint_callback is invoked after tool execution for turns with tool calls: keep the existing call before _handle_completion when response.tool_calls is empty, but remove the pre-tool invocation and instead, after execute_tool_calls(...) returns (and after check_shutdown handling), run await self._checkpoint_callback(ctx) with the same exception handling (re-raise MemoryError/RecursionError, log other exceptions using EXECUTION_CHECKPOINT_CALLBACK_FAILED with execution_id and turn_number). Ensure you call it only once per turn and reference the existing symbols _checkpoint_callback, execute_tool_calls, check_shutdown, clear_last_turn_tool_calls, and _handle_completion to locate where to move the invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/design/engine.md`:
- Around line 575-579: The earlier bullet about reconciliation currently implies
environment-state reconciliation is implemented; update that bullet in
docs/design/engine.md so it matches the later statement that only a simple
reconciliation message (agent receives system message with resume turn number
and error) is implemented now and that richer reconciliation
(workspace/environment-state detection) is planned future work—replace wording
that suggests full environment-state reconciliation with language explicitly
stating it is not yet implemented and refers readers to the "reconciliation
message" / "resume from a checkpoint" behavior described later.
In `@src/ai_company/engine/agent_engine.py`:
- Around line 741-767: The resumed-execution path in _execute_resumed_loop
builds a budget checker via make_budget_checker(...) directly, which bypasses
any configured BudgetEnforcer logic used in the normal _execute path; change
_execute_resumed_loop to construct the budget checker the same way as _execute
by preferring self._budget_enforcer.make_budget_checker(...) when
self._budget_enforcer is present (fall back to make_budget_checker(...) only if
no enforcer is set), ensuring the same per-agent allocation, downgrade policies,
and in-flight checks are applied for resumed runs.
In `@src/ai_company/engine/checkpoint/resume.py`:
- Around line 157-162: The heartbeat cleanup failure in resume.py is logged
using CHECKPOINT_DELETE_FAILED which mixes event types; change the
logger.warning call inside the heartbeat-delete branch to use a
heartbeat-specific event constant (e.g., HEARTBEAT_DELETE_FAILED or
CHECKPOINT_HEARTBEAT_DELETE_FAILED) instead of CHECKPOINT_DELETE_FAILED, and if
that constant doesn't exist add it to the events/constants module and import it
into src/ai_company/engine/checkpoint/resume.py so the logger call references
the correct heartbeat failure event (preserve the same message, execution_id,
error, and exc_info parameters).
- Line 59: Wrap the call to AgentContext.model_validate_json(checkpoint_json) in
a try/except that catches the validation/deserialization exception, log a
warning or error with context (e.g., the checkpoint identifier or a short
preview of checkpoint_json and the exception) using the module's logger, then
re-raise the exception; specifically update the code around
AgentContext.model_validate_json(checkpoint_json) to emit a process/logger.error
or logger.warning with the exception details before raising.
In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 191-218: The eviction in _reserve_resume_attempt relies on dict
insertion order via self._resume_counts and uses next(iter(...)) which yields
FIFO by insertion time, not least-recently-used; to adopt true LRU behavior
replace the dict with collections.OrderedDict (update its declaration where
_resume_counts is created), call move_to_end(execution_id) whenever you access
or increment an entry in _reserve_resume_attempt (i.e., after reading or
updating resume_count) and perform eviction with popitem(last=False) instead of
next(iter(...)) so the least-recently-used key is removed when
len(self._resume_counts) > _MAX_TRACKED_EXECUTIONS; keep the existing logging
and pop(execution_id, None) when exhausting attempts.
---
Outside diff comments:
In `@src/ai_company/engine/react_loop.py`:
- Around line 209-240: The checkpoint callback is currently invoked before tool
execution, which risks checkpointing prior to tool side-effects; change the flow
so _checkpoint_callback is invoked after tool execution for turns with tool
calls: keep the existing call before _handle_completion when response.tool_calls
is empty, but remove the pre-tool invocation and instead, after
execute_tool_calls(...) returns (and after check_shutdown handling), run await
self._checkpoint_callback(ctx) with the same exception handling (re-raise
MemoryError/RecursionError, log other exceptions using
EXECUTION_CHECKPOINT_CALLBACK_FAILED with execution_id and turn_number). Ensure
you call it only once per turn and reference the existing symbols
_checkpoint_callback, execute_tool_calls, check_shutdown,
clear_last_turn_tool_calls, and _handle_completion to locate where to move the
invocation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b9cf14e2-89ad-4ffd-8ca0-83cd315999f4
📒 Files selected for processing (21)
CLAUDE.mdREADME.mddocs/design/engine.mdsrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/recovery.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/persistence/sqlite/migrations.pytests/unit/api/conftest.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/engine/checkpoint/test_models.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/persistence/test_migrations_v2.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
- GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (6)
docs/design/**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
When approved deviations occur, update the relevant
docs/design/page to reflect the new reality.
Files:
docs/design/engine.md
docs/**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
Documentation source in
docs/built with Zensical. Design spec indocs/design/(7 pages). Architecture indocs/architecture/. Roadmap indocs/roadmap/. Security indocs/security.md.
Files:
docs/design/engine.md
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax:except A, B:(no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001. Tests must usetest-provider,test-small-001, etc.
Files:
tests/unit/engine/checkpoint/test_callback_factory.pytests/unit/persistence/test_migrations_v2.pysrc/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/callback.pytests/unit/api/conftest.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/recovery.pytests/unit/engine/checkpoint/test_models.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/persistence/sqlite/migrations.pytests/unit/engine/test_recovery_checkpoint_fields.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/agent_engine.pytests/unit/engine/checkpoint/test_strategy.pysrc/ai_company/observability/events/checkpoint.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Tests must use markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: useasyncio_mode = "auto"— no manual@pytest.mark.asyncioneeded.
Test timeout: 30 seconds per test.
Prefer@pytest.mark.parametrizefor testing similar cases.
Files:
tests/unit/engine/checkpoint/test_callback_factory.pytests/unit/persistence/test_migrations_v2.pytests/unit/api/conftest.pytests/unit/engine/checkpoint/test_models.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/engine/checkpoint/test_strategy.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction plusMappingProxyTypewrapping for read-only enforcement.
Fordict/listfields in frozen Pydantic models, rely onfrozen=Truefor field reassignment prevention and usecopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field,ConfigDict. Use@computed_fieldfor derived values instead of storing redundant fields (e.g.,TokenUsage.total_tokens).
UseNotBlankStrfromcore.typesfor all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER useimport logging,logging.getLogger(), orprint()in application code.
Always useloggeras the variable name for loggers (not_logger, notlog).
Event names must always use constants from domain-specific modules underai_company.observability.events(e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import dire...
Files:
src/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/recovery.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/persistence/sqlite/migrations.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/observability/events/checkpoint.py
src/ai_company/**/[!_]*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Every module with business logic MUST have:
from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Files:
src/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/recovery.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/persistence/sqlite/migrations.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/observability/events/checkpoint.py
🧠 Learnings (11)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow exceptions.
Applied to files:
src/ai_company/engine/checkpoint/strategy.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/engine/react_loop.pyCLAUDE.mdsrc/ai_company/observability/events/checkpoint.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Applied to files:
src/ai_company/engine/checkpoint/models.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All error paths must log at WARNING or ERROR with context before raising.
Applied to files:
src/ai_company/engine/agent_engine.pyCLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/ai_company/**/[!_]*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Structured logging: always use `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : NEVER use `import logging`, `logging.getLogger()`, or `print()` in application code.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Always use `logger` as the variable name for loggers (not `_logger`, not `log`).
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : DEBUG logging for object creation, internal flow, and entry/exit of key functions.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Pure data models, enums, and re-exports do NOT need logging.
Applied to files:
CLAUDE.md
🧬 Code graph analysis (12)
tests/unit/engine/checkpoint/test_callback_factory.py (4)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(88-120)src/ai_company/engine/context.py (2)
AgentContext(87-307)from_identity(140-171)src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
save(37-67)
tests/unit/persistence/test_migrations_v2.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
get_user_version(261-265)tests/unit/persistence/sqlite/conftest.py (1)
memory_db(15-22)
src/ai_company/engine/checkpoint/resume.py (6)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)src/ai_company/engine/context.py (2)
AgentContext(87-307)with_message(173-182)src/ai_company/providers/enums.py (1)
MessageRole(6-12)src/ai_company/providers/models.py (1)
ChatMessage(138-210)src/ai_company/engine/loop_protocol.py (1)
ExecutionLoop(151-189)src/ai_company/persistence/repositories.py (2)
CheckpointRepository(484-533)HeartbeatRepository(537-595)
src/ai_company/engine/checkpoint/callback.py (1)
src/ai_company/engine/context.py (1)
AgentContext(87-307)
tests/unit/api/conftest.py (2)
src/ai_company/engine/checkpoint/models.py (2)
Checkpoint(25-65)Heartbeat(68-85)src/ai_company/persistence/protocol.py (2)
checkpoints(146-148)heartbeats(151-153)
src/ai_company/engine/checkpoint/callback_factory.py (4)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-65)CheckpointConfig(88-120)Heartbeat(68-85)src/ai_company/engine/context.py (1)
AgentContext(87-307)src/ai_company/persistence/repositories.py (5)
CheckpointRepository(484-533)HeartbeatRepository(537-595)save(48-57)save(114-123)save(170-180)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
src/ai_company/engine/recovery.py (1)
src/ai_company/engine/task_execution.py (1)
TaskExecution(60-231)
tests/unit/engine/checkpoint/test_models.py (1)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-65)CheckpointConfig(88-120)Heartbeat(68-85)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
Heartbeat(68-85)src/ai_company/persistence/errors.py (1)
QueryError(33-34)src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
_row_to_model(159-174)
src/ai_company/engine/react_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
_checkpoint_callback(63-76)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
src/ai_company/persistence/sqlite/migrations.py (2)
tests/unit/persistence/sqlite/test_user_repo.py (1)
db(22-28)tests/unit/hr/test_persistence.py (1)
db(29-35)
src/ai_company/engine/agent_engine.py (6)
src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(88-120)src/ai_company/engine/checkpoint/resume.py (2)
cleanup_after_resume(119-162)deserialize_and_reconcile(37-76)src/ai_company/engine/checkpoint/strategy.py (1)
CheckpointRecoveryStrategy(41-264)src/ai_company/engine/recovery.py (4)
FailAndReassignStrategy(150-220)RecoveryResult(31-112)RecoveryStrategy(116-147)can_resume(110-112)src/ai_company/persistence/repositories.py (2)
CheckpointRepository(484-533)HeartbeatRepository(537-595)src/ai_company/engine/plan_execute_loop.py (1)
execute(103-181)
🪛 LanguageTool
CLAUDE.md
[style] ~154-~154: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...
(EG_NO_COMMA)
🔇 Additional comments (31)
CLAUDE.md (2)
107-107: LGTM!The engine description accurately reflects the new checkpoint recovery feature with per-turn persistence, heartbeat detection, and the
CheckpointRecoveryStrategy.
154-154: LGTM!The event names list is comprehensive and correctly documents the new checkpoint-related event constants (
CHECKPOINT_SAVED,PERSISTENCE_CHECKPOINT_SAVED, etc.) with the proper import pattern.src/ai_company/persistence/sqlite/migrations.py (2)
224-256: LGTM!The V6 migration DDL is well-structured:
checkpointstable with appropriate constraints (CHECK (turn_number >= 0)) and indexes for efficient lookups byexecution_idandtask_idheartbeatstable withexecution_idas primary key and index onlast_heartbeat_atfor stale detection queries- Good documentation explaining the ascending index choice for SQLite portability and the rationale for no FK constraints on ephemeral recovery data
320-334: LGTM!The
_apply_v6function follows the established migration pattern and is correctly registered in_MIGRATIONS.src/ai_company/engine/checkpoint/models.py (2)
25-65: LGTM!The
Checkpointmodel is well-designed:
- Frozen for immutability
NotBlankStrfor identifier fieldsAwareDatetimeensures timezone-aware timestamps- The
_validate_context_jsonvalidator correctly ensures the serialized context is valid JSON and specifically a JSON object (not primitive or array)
68-120: LGTM!Both
HeartbeatandCheckpointConfigmodels follow the established patterns:
- Frozen configuration with appropriate field constraints
- Clear documentation of reserved fields (
heartbeat_interval_secondsfor future background loop)- Sensible defaults (persist every turn, 30s heartbeat interval, 2 max resume attempts)
tests/unit/api/conftest.py (2)
338-395: LGTM!The fake repository implementations are well-crafted:
FakeCheckpointRepository.get_latestcorrectly enforces the contract requiring at least one ofexecution_idortask_id(Lines 353-355)get_stalesorts bylast_heartbeat_atfor consistent ordering- Both repositories provide complete method coverage matching the protocol expectations
411-412: LGTM!The
FakePersistenceBackendis properly extended with checkpoint and heartbeat repository instances and property accessors, aligning with thePersistenceBackendprotocol.Also applies to: 476-482
src/ai_company/engine/checkpoint/strategy.py (4)
41-71: LGTM!The
CheckpointRecoveryStrategyinitialization is clean:
- Proper dependency injection for
checkpoint_repo,config, and optionalfallback- Default fallback to
FailAndReassignStrategyis sensibleasyncio.Lockfor thread-safe resume count management
72-132: LGTM!The
recover()method is well-structured after refactoring:
- Clear flow: load checkpoint → check if exists → reserve attempt → build result or delegate
- Proper logging at recovery start
- Helper methods handle the complexity, keeping this orchestrator readable
155-189: LGTM!The
_load_latest_checkpointhelper properly:
- Re-raises
MemoryError, RecursionError(PEP 758 syntax)- Catches
PersistenceErrorspecifically and logs before returningNone- Logs at appropriate levels (exception for errors, info for no checkpoint, debug for success)
220-250: LGTM!The
_build_resume_resulthelper correctly assembles theRecoveryResultwith checkpoint context and logs the resume attempt details.src/ai_company/engine/checkpoint/__init__.py (1)
1-29: LGTM!The package
__init__.pycleanly exposes the public API for checkpoint recovery with a clear docstring and explicit__all__definition.README.md (2)
34-34: LGTM!Documentation accurately reflects the new crash recovery capability in the Agent Orchestration section.
114-114: LGTM!The architecture diagram correctly adds the
Persistence -.-> Engineedge to reflect the new checkpoint and heartbeat persistence relationship.tests/unit/persistence/test_migrations_v2.py (1)
58-58: LGTM!Good improvement — using
SCHEMA_VERSIONinstead of hardcoded values makes this test resilient to future migration bumps while still validating that migrations run successfully.src/ai_company/engine/checkpoint/callback.py (1)
13-14: Callback type contract is clean and consistent.The alias is clear and matches the async callback shape used by loop integrations.
tests/unit/engine/test_recovery_checkpoint_fields.py (1)
97-127: Checkpoint consistency negative paths are now properly pinned.Good addition of both failure cases (
json without attempt,attempt without json) to guard the invariant.tests/unit/engine/checkpoint/test_models.py (1)
106-137: JSON validation test coverage is solid.Good coverage for valid object JSON plus invalid array/primitive/invalid-string paths.
tests/unit/engine/checkpoint/test_callback_factory.py (1)
147-163: Turn-0 boundary coverage is correctly added.The parametrized matrix now explicitly validates the “turn 0 always skipped” guard.
tests/unit/engine/checkpoint/test_strategy.py (1)
191-255: Max-attempt fallback behavior is well exercised.The sequence tests for attempts 1/2 then fallback (and the zero-attempt edge) are strong and clear.
src/ai_company/persistence/sqlite/heartbeat_repo.py (2)
37-67: LGTM! UTC normalization properly implemented.The UTC normalization at lines 43-45 correctly addresses the lexicographic comparison issue for timestamps with different timezone offsets. The
savemethod properly convertslast_heartbeat_atto UTC before persisting, ensuring consistent string ordering in SQLite queries.
96-127: LGTM! Threshold normalization ensures consistent stale detection.The
get_stalemethod correctly normalizes the threshold to UTC at line 103, ensuring lexicographic comparisons work correctly regardless of the input timezone.src/ai_company/engine/recovery.py (2)
72-92: LGTM! Cross-field validation ensures checkpoint data consistency.The
_validate_checkpoint_consistencyvalidator correctly enforces the invariant thatcheckpoint_context_jsonandresume_attemptmust be set together (both non-default) or both at defaults. The JSON validation also ensures the checkpoint context is a valid JSON object.
106-112: LGTM! Clean computed property for resume capability.The
can_resumeproperty provides a simple boolean check derived fromcheckpoint_context_json, keeping the API clean and avoiding redundant state.src/ai_company/engine/checkpoint/callback_factory.py (2)
63-76: LGTM! Clean callback implementation with correct sequencing.The callback correctly:
- Skips turn 0 and non-boundary turns
- Only updates heartbeat after checkpoint save succeeds (preventing orphaned heartbeats)
78-104: LGTM! Best-effort checkpoint save with appropriate error isolation.The
_save_checkpointhelper correctly re-raisesMemoryErrorandRecursionError(non-recoverable), while swallowing other exceptions to avoid crashing the execution loop. The return value communicates success to control heartbeat update.src/ai_company/observability/events/checkpoint.py (1)
1-28: LGTM! Well-organized event constants.The constants follow the established naming patterns and provide complete coverage for checkpoint lifecycle, heartbeat lifecycle, loop integration, and recovery flow events.
src/ai_company/engine/agent_engine.py (3)
173-181: LGTM! Proper validation for paired repository configuration.The validation ensures that
checkpoint_repoandheartbeat_repoare either both provided or both omitted, preventing misconfiguration.
691-699: LGTM! Error logging before raising addressed.The code now logs with structured context (
agent_id,task_id,error) at ERROR level before raisingRuntimeError, satisfying the guideline that all error paths must log before raising.
783-795: LGTM! Checkpoint cleanup correctly conditioned on non-ERROR termination.The check at line 783 ensures checkpoints and heartbeats are only deleted after successful resume completion, preserving them for potential subsequent resume attempts if the resumed execution itself ends in an error.
| async def _reserve_resume_attempt( | ||
| self, | ||
| execution_id: str, | ||
| task_id: str, | ||
| ) -> bool: | ||
| """Reserve a resume attempt, returning ``True`` when exhausted.""" | ||
| async with self._resume_lock: | ||
| resume_count = self._resume_counts.get(execution_id, 0) | ||
| if resume_count >= self._config.max_resume_attempts: | ||
| logger.info( | ||
| CHECKPOINT_RECOVERY_FALLBACK, | ||
| execution_id=execution_id, | ||
| task_id=task_id, | ||
| resume_count=resume_count, | ||
| max_resume_attempts=self._config.max_resume_attempts, | ||
| reason="max_resume_attempts_exhausted", | ||
| ) | ||
| self._resume_counts.pop(execution_id, None) | ||
| return True | ||
|
|
||
| self._resume_counts[execution_id] = resume_count + 1 | ||
|
|
||
| # Evict oldest entries when the dict grows too large | ||
| if len(self._resume_counts) > _MAX_TRACKED_EXECUTIONS: | ||
| oldest = next(iter(self._resume_counts)) | ||
| self._resume_counts.pop(oldest, None) | ||
|
|
||
| return False |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider FIFO eviction ordering.
The eviction logic uses next(iter(self._resume_counts)) to remove the "oldest" entry. However, since Python 3.7+ dicts maintain insertion order, this removes the first inserted key, not necessarily the least recently used. If an execution is resumed multiple times, it stays at its original insertion position.
This is acceptable for a safety bound (the goal is preventing unbounded growth, not perfect LRU), but worth noting. If precise eviction semantics matter, an OrderedDict with move-to-end on access would be needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/checkpoint/strategy.py` around lines 191 - 218, The
eviction in _reserve_resume_attempt relies on dict insertion order via
self._resume_counts and uses next(iter(...)) which yields FIFO by insertion
time, not least-recently-used; to adopt true LRU behavior replace the dict with
collections.OrderedDict (update its declaration where _resume_counts is
created), call move_to_end(execution_id) whenever you access or increment an
entry in _reserve_resume_attempt (i.e., after reading or updating resume_count)
and perform eviction with popitem(last=False) instead of next(iter(...)) so the
least-recently-used key is removed when len(self._resume_counts) >
_MAX_TRACKED_EXECUTIONS; keep the existing logging and pop(execution_id, None)
when exhausting attempts.
CodeRabbit (5 new actionable comments on commit 6a8cfa7): - Resume path now uses BudgetEnforcer when configured (was bypassing enforcer-specific features like per-agent allocation, auto-downgrade) - Log deserialization errors before re-raising in resume.py - Use HEARTBEAT_DELETE_FAILED event constant for heartbeat cleanup failures (was incorrectly using CHECKPOINT_DELETE_FAILED) - Fix reconciliation bullet in design spec (was inconsistent with implementation description below it) - Add comment explaining checkpoint placement before tool execution (intentional: preserves LLM response on mid-tool crash) Greptile (3 findings, confidence 3/5): - Clean up orphaned checkpoint/heartbeat rows on max-resume fallback path (delegate_to_fallback now calls cleanup_after_resume) - Add heartbeat_repo to CheckpointRecoveryStrategy for fallback cleanup - Document in-memory resume counter limitation with TODO comment (persisting counter across restarts is a planned improvement) - Wrong heartbeat event constant — already fixed above
There was a problem hiding this comment.
Pull request overview
This PR implements the checkpoint recovery strategy (§6.6 Strategy 2) for agent crash recovery, enabling per-turn AgentContext persistence, heartbeat-based failure detection, and resume from last checkpoint with reconciliation. It falls back to fail-and-reassign after max resume attempts are exhausted.
Changes:
- Adds
Checkpoint,Heartbeat,CheckpointConfigfrozen models,CheckpointRecoveryStrategywith async-safe resume counter, callback factory, and resume/cleanup helpers in a newengine/checkpoint/package - Adds
CheckpointRepositoryandHeartbeatRepositoryprotocols with SQLite implementations, V6 migration (checkpoints + heartbeats tables with indexes), and integrates them into thePersistenceBackendprotocol - Integrates checkpoint callbacks into
ReactLoopandPlanExecuteLoop, adds resume path inAgentEnginewithcompletion_config/effective_autonomythreading, and extendsRecoveryResultwithcheckpoint_context_json,resume_attempt, andcan_resumefields
Reviewed changes
Copilot reviewed 35 out of 36 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/ai_company/engine/checkpoint/__init__.py |
Package init re-exporting all checkpoint public symbols |
src/ai_company/engine/checkpoint/callback.py |
CheckpointCallback type alias definition |
src/ai_company/engine/checkpoint/callback_factory.py |
Factory closure for checkpoint+heartbeat persistence per turn |
src/ai_company/engine/checkpoint/models.py |
Checkpoint, Heartbeat, CheckpointConfig frozen Pydantic models |
src/ai_company/engine/checkpoint/resume.py |
Deserialization, reconciliation, loop-with-callback builder, cleanup helpers |
src/ai_company/engine/checkpoint/strategy.py |
CheckpointRecoveryStrategy with async-safe resume counter and fallback |
src/ai_company/engine/recovery.py |
Extended RecoveryResult with checkpoint fields and consistency validator |
src/ai_company/engine/react_loop.py |
Checkpoint callback integration after LLM response |
src/ai_company/engine/plan_execute_loop.py |
Checkpoint callback integration via _invoke_checkpoint_callback |
src/ai_company/engine/agent_engine.py |
Resume path, loop-with-callback, paired repo validation, completion_config/effective_autonomy threading |
src/ai_company/engine/__init__.py |
Re-exports checkpoint symbols |
src/ai_company/persistence/repositories.py |
CheckpointRepository and HeartbeatRepository protocol definitions |
src/ai_company/persistence/protocol.py |
Added checkpoints and heartbeats properties to PersistenceBackend |
src/ai_company/persistence/sqlite/checkpoint_repo.py |
SQLite checkpoint repository implementation |
src/ai_company/persistence/sqlite/heartbeat_repo.py |
SQLite heartbeat repository implementation |
src/ai_company/persistence/sqlite/backend.py |
Integrated checkpoint/heartbeat repos into SQLite backend |
src/ai_company/persistence/sqlite/__init__.py |
Re-exports new SQLite repos |
src/ai_company/persistence/sqlite/migrations.py |
V6 migration (checkpoints + heartbeats tables with indexes) |
src/ai_company/observability/events/checkpoint.py |
16 checkpoint/heartbeat/recovery event constants |
src/ai_company/observability/events/persistence.py |
Persistence-layer checkpoint/heartbeat event constants |
src/ai_company/observability/events/execution.py |
Execution-layer checkpoint callback and resume event constants |
tests/unit/engine/checkpoint/test_models.py |
Tests for Checkpoint, Heartbeat, CheckpointConfig models |
tests/unit/engine/checkpoint/test_callback_factory.py |
Tests for checkpoint callback factory |
tests/unit/engine/checkpoint/test_strategy.py |
Tests for CheckpointRecoveryStrategy |
tests/unit/engine/test_recovery_checkpoint_fields.py |
Tests for RecoveryResult checkpoint fields |
tests/unit/persistence/sqlite/test_checkpoint_repo.py |
Tests for SQLiteCheckpointRepository |
tests/unit/persistence/sqlite/test_heartbeat_repo.py |
Tests for SQLiteHeartbeatRepository |
tests/unit/persistence/sqlite/test_migrations_v6.py |
Tests for V6 migration |
tests/unit/persistence/test_protocol.py |
Protocol conformance tests for new repos |
tests/unit/persistence/test_migrations_v2.py |
Updated to use SCHEMA_VERSION constant instead of hardcoded 5 |
tests/unit/api/conftest.py |
Fake checkpoint/heartbeat repos for API tests |
tests/unit/observability/test_events.py |
Event constant discovery and existence tests |
docs/design/engine.md |
Updated recovery docs with checkpoint fields, removed "Planned" warning |
README.md |
Added checkpoint resume mention and Persistence→Engine dependency |
CLAUDE.md |
Updated engine description and event name examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| expected = { | ||
| "idx_cp_execution_id", | ||
| "idx_cp_task_id", | ||
| "idx_cp_exec_turn", | ||
| } | ||
| assert expected.issubset(indexes) |
Summary
AgentContextpersistence, heartbeat-based failure detection, resume from last checkpoint with reconciliation, max resume attempts with fallback to fail-and-reassignCheckpoint,Heartbeat,CheckpointConfigfrozen Pydantic models,CheckpointRecoveryStrategywith async-safe resume counter,CheckpointCallbacktype alias, andmake_checkpoint_callbackfactoryCheckpointRepositoryandHeartbeatRepositoryprotocols with SQLite implementations, V6 migration (checkpoints + heartbeats tables with indexes)ReactLoopandPlanExecuteLoop, resume path inAgentEnginewithcompletion_configandeffective_autonomythreadingevents/checkpoint,events/execution, andevents/persistenceCloses #201
Pre-PR Review
Pre-reviewed by 15 agents, 42 findings addressed in a dedicated fix commit:
Key fixes applied:
completion_config/effective_autonomyon resume path (security + correctness)_resume_countsTOCTOU race fixed withasyncio.Lock+ memory leak cleanupRecoveryResultfor checkpoint consistencydelete_by_executionNotBlankStrfor all identifier params,AwareDatetimefor heartbeat thresholdconfigproperty toPlanExecuteLoopTest plan
🤖 Generated with Claude Code