Skip to content

feat: implement checkpoint recovery strategy#366

Closed
Aureliolo wants to merge 4 commits into
mainfrom
feat/checkpoint-recovery
Closed

feat: implement checkpoint recovery strategy#366
Aureliolo wants to merge 4 commits into
mainfrom
feat/checkpoint-recovery

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • Implement checkpoint recovery strategy (§6.6 Strategy 2) — per-turn AgentContext persistence, heartbeat-based failure detection, resume from last checkpoint with reconciliation, max resume attempts with fallback to fail-and-reassign
  • Add Checkpoint, Heartbeat, CheckpointConfig frozen Pydantic models, CheckpointRecoveryStrategy with async-safe resume counter, CheckpointCallback type alias, and make_checkpoint_callback factory
  • Add CheckpointRepository and HeartbeatRepository protocols with SQLite implementations, V6 migration (checkpoints + heartbeats tables with indexes)
  • Integrate checkpoint callback into ReactLoop and PlanExecuteLoop, resume path in AgentEngine with completion_config and effective_autonomy threading
  • Add 16 checkpoint event constants across events/checkpoint, events/execution, and events/persistence

Closes #201

Pre-PR Review

Pre-reviewed by 15 agents, 42 findings addressed in a dedicated fix commit:

Agent Findings
code-reviewer 13
docs-consistency 6
silent-failure-hunter 5
type-design-analyzer 7
conventions-enforcer 5
security-reviewer 4
comment-analyzer 7
persistence-reviewer 3
async-concurrency 4
test-quality 5
pr-test-analyzer 4
python-reviewer 2
logging-audit 0 (clean)
resilience-audit 0 (clean)
issue-resolution 0 (all RESOLVED)

Key fixes applied:

  • Missing completion_config/effective_autonomy on resume path (security + correctness)
  • _resume_counts TOCTOU race fixed with asyncio.Lock + memory leak cleanup
  • Heartbeat skipped when checkpoint save fails (prevents limbo state)
  • Cross-field validator on RecoveryResult for checkpoint consistency
  • Checkpoint cleanup after successful resume via delete_by_execution
  • Paired-repo validation, unknown-loop-type warning, turn-zero guard
  • Sanitized error message in LLM reconciliation ChatMessage
  • NotBlankStr for all identifier params, AwareDatetime for heartbeat threshold
  • Extracted duplicate callback blocks, added public config property to PlanExecuteLoop
  • Updated CLAUDE.md, README.md, and docstrings for accuracy

Test plan

  • All 7573 tests pass (3 new tests added)
  • 94.43% coverage (80% minimum)
  • mypy strict: 0 errors (933 files)
  • ruff lint + format: clean
  • Pre-commit hooks: all pass
  • Verify checkpoint save/load roundtrip in integration environment
  • Verify resume from checkpoint produces correct agent behavior
  • Verify max_resume_attempts fallback to fail-and-reassign

🤖 Generated with Claude Code

Adds the Checkpoint Recovery strategy (Engine design §6.6, Strategy 2)
which persists AgentContext snapshots after each completed turn and
resumes from the last checkpoint on crash, preserving progress. After
max_resume_attempts failures, falls back to fail-and-reassign.

New modules:
- engine/checkpoint/ — models, callback factory, strategy
- persistence/sqlite/ — checkpoint and heartbeat repositories
- observability/events/checkpoint.py — lifecycle event constants

Key changes:
- RecoveryResult extended with checkpoint_context_json, resume_attempt,
  can_resume fields (backward-compatible defaults)
- ReactLoop and PlanExecuteLoop accept optional checkpoint_callback
- AgentEngine wires checkpoint repos, builds callback-enabled loops,
  and resumes from checkpoint on recoverable errors
- SQLite schema V6 adds checkpoints and heartbeats tables
- PersistenceBackend protocol gains checkpoints/heartbeats properties
- 100 new unit tests, full suite green at 94% coverage
Pre-reviewed by 15 agents, 42 findings addressed:

- Fix missing completion_config and effective_autonomy on resume path
- Add MemoryError/RecursionError re-raise in strategy checkpoint load
- Add asyncio.Lock to protect _resume_counts from TOCTOU race
- Clean up resume counts in fallback path to prevent memory leak
- Validate checkpoint_repo and heartbeat_repo are provided together
- Log warning for unsupported loop types in checkpoint injection
- Skip heartbeat update when checkpoint save fails (prevent limbo)
- Guard against turn-zero checkpoint via explicit check
- Sanitize error message in LLM reconciliation ChatMessage
- Replace assert with explicit RuntimeError for checkpoint_context_json
- Add cross-field validator for checkpoint_context_json/resume_attempt
- Clean up checkpoints after successful resume completion
- Thread completion_config and effective_autonomy through recovery chain
- Extract duplicate checkpoint callback blocks into helper method
- Add public config property to PlanExecuteLoop (remove SLF001 noqa)
- Use EXECUTION_CHECKPOINT_CALLBACK_FAILED event for error paths
- Add structured error info to callback exception logging
- Use NotBlankStr for repo params and callback_factory identifiers
- Move cursor.rowcount read before commit in repos
- Fix HeartbeatRepository.get_stale to accept AwareDatetime
- Drop DESC from composite index (portability for SQLite < 3.47)
- Document intentional FK absence in migrations
- Document heartbeat_interval_seconds as reserved for future use
- Update CLAUDE.md package structure and logging event docs
- Update PersistenceBackend and RecoveryResult docstrings
- Fix various docstring accuracy issues across modules
- Add FakePersistenceBackend checkpoint/heartbeat repos in test conftest
- Add protocol compliance assertions for CheckpointRepository/HeartbeatRepository
- Add test_checkpoint_events_exist assertions
- Fix _FakeCostRecordRepository.aggregate missing task_id (pre-existing)
- Verify checkpoint/heartbeat model content in callback factory tests
- Fix test_strategy.py return type annotation
Copilot AI review requested due to automatic review settings March 13, 2026 22:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 13, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d5e03b61-27af-4496-be42-1d688f76dfb4

📥 Commits

Reviewing files that changed from the base of the PR and between 6a8cfa7 and 838670e.

📒 Files selected for processing (6)
  • docs/design/engine.md
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/observability/events/checkpoint.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Checkpoint-based crash recovery: resume executions from saved checkpoints with per‑turn state persistence, heartbeat liveness, and loop integration via checkpoint callbacks.
    • SQLite persistence and migrations added for checkpoints and heartbeats.
  • Observability

    • New structured events for checkpoint, heartbeat, persistence, and resume lifecycles.
  • Documentation

    • Engine docs and README updated to describe checkpoint recovery and persistence interactions.
  • Tests

    • Comprehensive unit tests for checkpoint models, repos, strategy, callbacks, and migrations.

Walkthrough

Adds a checkpoint-based crash recovery subsystem: models (Checkpoint, Heartbeat, CheckpointConfig), repositories and sqlite implementations, migration to schema v6, a CheckpointRecoveryStrategy, per-turn checkpoint callbacks wired into execution loops and AgentEngine resume/cleanup flows, resume helpers, observability events, and comprehensive tests and docs updates.

Changes

Cohort / File(s) Summary
Checkpoint Domain & API
src/ai_company/engine/checkpoint/models.py, src/ai_company/engine/checkpoint/callback.py, src/ai_company/engine/checkpoint/__init__.py, src/ai_company/engine/__init__.py
Add immutable Pydantic models (Checkpoint, Heartbeat, CheckpointConfig), CheckpointCallback type alias, resume helpers re-exported, and engine-level re-exports to expose checkpoint API.
Callback Factory & Resume Helpers
src/ai_company/engine/checkpoint/callback_factory.py, src/ai_company/engine/checkpoint/resume.py
Factory to produce per-turn async checkpoint callbacks (save checkpoint + update heartbeat, best-effort error handling) and helpers to deserialize/reconcile, inject callbacks into loops, and cleanup after resume.
Recovery Strategy & RecoveryResult
src/ai_company/engine/checkpoint/strategy.py, src/ai_company/engine/recovery.py
New CheckpointRecoveryStrategy implementing resume-from-latest-checkpoint with per-execution attempt tracking and fallback delegation; RecoveryResult extended with checkpoint_context_json, resume_attempt, and computed can_resume plus validation.
Engine Integration
src/ai_company/engine/agent_engine.py, src/ai_company/engine/plan_execute_loop.py, src/ai_company/engine/react_loop.py
AgentEngine accepts checkpoint_repo/heartbeat_repo/config, wraps execution loops with checkpoint callbacks, adds resume flow (_resume_from_checkpoint, _execute_resumed_loop, _finalize_resume) and propagates completion_config/effective_autonomy; loops accept and invoke checkpoint_callback.
Persistence Protocol & In-Memory Fakes
src/ai_company/persistence/protocol.py, src/ai_company/persistence/repositories.py, tests/unit/api/conftest.py, tests/unit/persistence/test_protocol.py
Add CheckpointRepository and HeartbeatRepository protocols and PersistenceBackend accessors; test fakes and fake backend extended to expose checkpoint/heartbeat repos.
SQLite Repos & Backend
src/ai_company/persistence/sqlite/checkpoint_repo.py, src/ai_company/persistence/sqlite/heartbeat_repo.py, src/ai_company/persistence/sqlite/backend.py, src/ai_company/persistence/sqlite/__init__.py
SQLiteCheckpointRepository and SQLiteHeartbeatRepository implementations (save/get/get_latest/get_stale/delete/upsert), backend wiring to expose repos, and exports updated.
Migrations
src/ai_company/persistence/sqlite/migrations.py, tests/unit/persistence/test_migrations_v2.py
Schema version bumped to 6; add v6 migration DDL for checkpoints and heartbeats and indexes; update tests accordingly.
Observability Events
src/ai_company/observability/events/checkpoint.py, src/ai_company/observability/events/execution.py, src/ai_company/observability/events/persistence.py, tests/unit/observability/test_events.py
New structured event constants for checkpoint lifecycle, heartbeat lifecycle, loop integration, recovery lifecycle, and persistence-layer events; execution events added for checkpoint callback and resume lifecycle.
Tests: Models, Callback, Strategy, Persistence
tests/unit/engine/checkpoint/test_models.py, tests/unit/engine/checkpoint/test_callback_factory.py, tests/unit/engine/checkpoint/test_strategy.py, tests/unit/persistence/sqlite/test_checkpoint_repo.py, tests/unit/persistence/sqlite/test_heartbeat_repo.py, tests/unit/persistence/sqlite/test_migrations_v6.py, tests/unit/engine/test_recovery_checkpoint_fields.py
Add comprehensive unit tests covering models, callback behavior (boundary turns, heartbeat updates, error semantics), recovery strategy (resume, fallback, attempt tracking), sqlite repo behaviors, and migration v6.
Docs & README
CLAUDE.md, README.md, docs/design/engine.md
Document checkpoint recovery in engine design and README, add persistence→engine diagram edge, and update engine design text (RecoveryResult fields and config wording).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Engine
participant Loop
participant Callback
participant CheckpointRepo
participant HeartbeatRepo
Engine->>Loop: start execution (checkpoint callback injected)
Loop->>Callback: invoke after turn (ctx, turn_number)
Callback->>CheckpointRepo: save(checkpoint)
CheckpointRepo-->>Callback: saved / error
Callback->>HeartbeatRepo: save(heartbeat)
HeartbeatRepo-->>Callback: saved / error
Callback-->>Loop: return
Note right of Engine: on crash, RecoveryStrategy loads latest checkpoint
Engine->>CheckpointRepo: get_latest(execution_id)
CheckpointRepo-->>Engine: checkpoint or None
Engine->>Engine: deserialize_and_reconcile -> restored ctx
Engine->>Loop: resume execution with restored ctx

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • #201: Implements the checkpoint recovery strategy (per-turn context persistence, heartbeat detection, resume-from-latest, max resume attempts and fallback) described in the issue.

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: implement checkpoint recovery strategy' clearly and concisely summarizes the main implementation objective of this changeset.
Description check ✅ Passed The PR description comprehensively explains the checkpoint recovery implementation, lists specific changes, references the design spec, and documents the pre-review process and test coverage.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #201: per-turn persistence [checkpoint models, callback factory], heartbeat detection [Heartbeat model, repositories], resume flow [CheckpointRecoveryStrategy, agent engine changes], max attempts fallback [resume counter logic], and protocol compliance [RecoveryStrategy integration].
Out of Scope Changes check ✅ Passed All code changes directly support the checkpoint recovery implementation. Documentation updates (CLAUDE.md, README.md, engine.md) align with implementation. Test additions validate new functionality. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/checkpoint-recovery
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/checkpoint-recovery
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's resilience by introducing a comprehensive checkpoint recovery mechanism. It allows agent executions to gracefully recover from unexpected interruptions by saving their state periodically and resuming from the last valid checkpoint. This ensures that long-running or critical agent tasks can withstand failures without losing significant progress, leading to more reliable and robust operations.

Highlights

  • Checkpoint Recovery Strategy: Implemented a robust checkpoint recovery strategy (Strategy 2 from §6.6) that enables per-turn AgentContext persistence, heartbeat-based failure detection, and the ability to resume execution from the last saved checkpoint with reconciliation. This includes a configurable maximum number of resume attempts before falling back to a fail-and-reassign strategy.
  • New Models and Components: Introduced new Pydantic models: Checkpoint (for state snapshots), Heartbeat (for liveness signals), and CheckpointConfig (for recovery behavior). Also added CheckpointRecoveryStrategy, a CheckpointCallback type alias, and a make_checkpoint_callback factory function.
  • Persistence Layer Integration: Integrated checkpointing into the persistence layer by adding CheckpointRepository and HeartbeatRepository protocols, along with their SQLite implementations. A V6 database migration was included to create the necessary checkpoints and heartbeats tables with appropriate indexes.
  • Execution Loop Integration: The checkpoint callback mechanism has been integrated into both ReactLoop and PlanExecuteLoop to ensure AgentContext snapshots are taken at regular intervals. The AgentEngine now includes a resume path that leverages completion_config and effective_autonomy threading for seamless recovery.
  • Observability Events: Added 16 new event constants across events/checkpoint, events/execution, and events/persistence to provide detailed logging and observability for the checkpointing and recovery process.
Changelog
  • CLAUDE.md
    • Updated the description for the engine/ module to reflect the addition of checkpoint recovery.
    • Added new checkpoint and persistence event constants to the list of recommended event names.
  • README.md
    • Added a dependency arrow from Persistence to Engine in the architecture graph.
  • docs/design/engine.md
    • Removed the 'Planned' warning for the Checkpoint Recovery Strategy 2, indicating its implementation.
  • src/ai_company/engine/init.py
    • Imported and exposed new checkpoint-related models, strategy, and factory function.
  • src/ai_company/engine/agent_engine.py
    • Imported checkpoint-related modules and new execution events.
    • Modified the AgentEngine constructor to accept checkpoint and heartbeat repositories, and a checkpoint configuration, with validation for paired repositories.
    • Updated _execute and _post_execution_pipeline methods to correctly pass effective_autonomy and completion_config.
    • Introduced _make_loop_with_callback to dynamically inject checkpoint callbacks into execution loops.
    • Updated _run_loop_with_timeout to utilize the new callback injection mechanism.
    • Enhanced _apply_recovery to incorporate the checkpoint resume path, delegating to _resume_from_checkpoint.
    • Added _resume_from_checkpoint to handle deserialization of agent context, injection of reconciliation messages, and re-entry into the execution loop, including cleanup of checkpoints upon successful completion.
  • src/ai_company/engine/checkpoint/init.py
    • Added the __init__.py file to define the public API for the new checkpoint module.
  • src/ai_company/engine/checkpoint/callback.py
    • Added CheckpointCallback as a type alias for an asynchronous callable used to persist checkpoints.
  • src/ai_company/engine/checkpoint/callback_factory.py
    • Added make_checkpoint_callback factory function to create a closure responsible for persisting AgentContext snapshots and updating heartbeats based on configuration, with robust error handling.
  • src/ai_company/engine/checkpoint/models.py
    • Added Pydantic models Checkpoint (for agent state snapshots), Heartbeat (for liveness signals), and CheckpointConfig (for configuring checkpointing behavior).
  • src/ai_company/engine/checkpoint/strategy.py
    • Added CheckpointRecoveryStrategy to manage loading checkpoints, tracking resume attempts, and falling back to a default strategy after a configured number of retries, using an asynchronous lock for concurrency safety.
  • src/ai_company/engine/plan_execute_loop.py
    • Imported new checkpoint-related events and the CheckpointCallback type.
    • Modified the PlanExecuteLoop constructor to accept an optional checkpoint_callback.
    • Added a public config property to the loop.
    • Integrated _invoke_checkpoint_callback into _call_planner and _run_step_turn to trigger checkpointing.
  • src/ai_company/engine/react_loop.py
    • Imported new checkpoint-related events and the CheckpointCallback type.
    • Modified the ReactLoop constructor to accept an optional checkpoint_callback.
    • Integrated the _checkpoint_callback into _process_turn_response for state persistence.
  • src/ai_company/engine/recovery.py
    • Imported Self and model_validator for Pydantic model enhancements.
    • Extended RecoveryResult with checkpoint_context_json and resume_attempt fields.
    • Added a model_validator to ensure consistency between checkpoint_context_json and resume_attempt.
    • Introduced a can_resume computed property to indicate if an execution can be resumed from a checkpoint.
  • src/ai_company/observability/events/checkpoint.py
    • Added a new file defining constants for checkpoint lifecycle and recovery flow events.
  • src/ai_company/observability/events/execution.py
    • Added new event constants related to checkpoint callbacks and execution resume states.
  • src/ai_company/observability/events/persistence.py
    • Added new event constants for persistence operations specific to checkpoints and heartbeats.
  • src/ai_company/persistence/protocol.py
    • Imported new checkpoint and heartbeat repository protocols.
    • Added checkpoints and heartbeats properties to the PersistenceBackend protocol.
  • src/ai_company/persistence/repositories.py
    • Imported Checkpoint and Heartbeat models.
    • Added CheckpointRepository and HeartbeatRepository protocols, defining their respective CRUD interfaces.
  • src/ai_company/persistence/sqlite/init.py
    • Imported and exposed SQLiteCheckpointRepository and SQLiteHeartbeatRepository.
  • src/ai_company/persistence/sqlite/backend.py
    • Imported SQLiteCheckpointRepository and SQLiteHeartbeatRepository.
    • Added _checkpoints and _heartbeats attributes to SQLitePersistenceBackend.
    • Modified state management methods (_clear_state, _create_repositories) to include the new repositories.
    • Added checkpoints and heartbeats properties to SQLitePersistenceBackend for access to the new repositories.
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
    • Added SQLiteCheckpointRepository to provide SQLite-based CRUD operations for Checkpoint objects.
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
    • Added SQLiteHeartbeatRepository to provide SQLite-based CRUD operations for Heartbeat objects, including functionality to retrieve stale heartbeats.
  • src/ai_company/persistence/sqlite/migrations.py
    • Updated SCHEMA_VERSION to 6.
    • Added _V6_STATEMENTS to create the checkpoints and heartbeats tables with necessary indexes.
    • Included _apply_v6 in the migration sequence.
  • tests/unit/api/conftest.py
    • Imported Checkpoint and Heartbeat models.
    • Added FakeCheckpointRepository and FakeHeartbeatRepository for in-memory testing.
    • Updated FakePersistenceBackend to incorporate the new fake repositories.
  • tests/unit/engine/checkpoint/test_callback_factory.py
    • Added unit tests for make_checkpoint_callback, verifying turn-based persistence, heartbeat updates, and error handling.
  • tests/unit/engine/checkpoint/test_models.py
    • Added unit tests for Checkpoint, Heartbeat, and CheckpointConfig Pydantic models, covering creation, immutability, and validation.
  • tests/unit/engine/checkpoint/test_strategy.py
    • Added unit tests for CheckpointRecoveryStrategy, validating resume functionality, fallback behavior, and resume attempt tracking.
  • tests/unit/engine/test_recovery_checkpoint_fields.py
    • Added unit tests for the new can_resume and resume_attempt fields in RecoveryResult, ensuring correct behavior and backward compatibility.
  • tests/unit/observability/test_events.py
    • Imported new checkpoint and heartbeat event constants.
    • Updated event discovery tests to include the new 'checkpoint' domain.
    • Added tests to verify the new checkpoint event constants.
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
    • Added unit tests for SQLiteCheckpointRepository, covering save, get latest, and delete operations.
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
    • Added unit tests for SQLiteHeartbeatRepository, covering save, get, get stale, and delete operations.
  • tests/unit/persistence/sqlite/test_migrations_v6.py
    • Added unit tests for the V6 database migration, confirming the creation of checkpoint and heartbeat tables and their indexes.
  • tests/unit/persistence/test_migrations_v2.py
    • Updated schema version assertion from 5 to 6.
  • tests/unit/persistence/test_protocol.py
    • Imported new checkpoint and heartbeat models and repositories.
    • Added _FakeCheckpointRepository and _FakeHeartbeatRepository to the fake backend implementation.
    • Added tests to confirm that the fake checkpoint and heartbeat repositories adhere to their respective protocols.
Activity
  • The pull request underwent a thorough pre-review process by 15 different agents, leading to 42 findings being addressed in a dedicated fix commit.
  • All 7573 existing tests passed, and 3 new tests were added to cover the new functionality.
  • Code coverage stands at 94.43%, exceeding the 80% minimum requirement.
  • The codebase is clean with 0 mypy strict errors and passes all ruff lint and format checks.
  • All pre-commit hooks passed successfully.
  • Remaining verification steps include: verifying checkpoint save/load roundtrip in an integration environment, confirming correct agent behavior upon resuming from a checkpoint, and validating the fallback mechanism when max_resume_attempts is reached.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the checkpoint recovery strategy (§6.6 Strategy 2) for agent crash recovery. It adds per-turn AgentContext persistence, heartbeat-based failure detection, resume from last checkpoint with reconciliation, and max resume attempts with fallback to fail-and-reassign.

Changes:

  • New engine/checkpoint/ package with Checkpoint, Heartbeat, CheckpointConfig models, CheckpointRecoveryStrategy, CheckpointCallback type alias, and make_checkpoint_callback factory
  • SQLite persistence layer additions: CheckpointRepository and HeartbeatRepository protocols with SQLite implementations, V6 migration for checkpoints and heartbeats tables
  • Integration of checkpoint callback into ReactLoop and PlanExecuteLoop, and resume path in AgentEngine with completion_config/effective_autonomy threading

Reviewed changes

Copilot reviewed 34 out of 35 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ai_company/engine/checkpoint/__init__.py Package init re-exporting checkpoint types
src/ai_company/engine/checkpoint/callback.py CheckpointCallback type alias
src/ai_company/engine/checkpoint/callback_factory.py Factory closure for persisting checkpoints/heartbeats per turn
src/ai_company/engine/checkpoint/models.py Frozen Pydantic models: Checkpoint, Heartbeat, CheckpointConfig
src/ai_company/engine/checkpoint/strategy.py CheckpointRecoveryStrategy with async-safe resume counter
src/ai_company/engine/recovery.py New checkpoint_context_json, resume_attempt, can_resume fields on RecoveryResult
src/ai_company/engine/react_loop.py Checkpoint callback invocation after each turn
src/ai_company/engine/plan_execute_loop.py Checkpoint callback invocation via extracted _invoke_checkpoint_callback
src/ai_company/engine/agent_engine.py Resume path, loop-with-callback factory, paired-repo validation
src/ai_company/engine/__init__.py Re-exports for new checkpoint types
src/ai_company/persistence/repositories.py CheckpointRepository and HeartbeatRepository protocols
src/ai_company/persistence/protocol.py Added checkpoints and heartbeats properties to PersistenceBackend
src/ai_company/persistence/sqlite/checkpoint_repo.py SQLite CheckpointRepository implementation
src/ai_company/persistence/sqlite/heartbeat_repo.py SQLite HeartbeatRepository implementation
src/ai_company/persistence/sqlite/backend.py Wires new repos into SQLitePersistenceBackend
src/ai_company/persistence/sqlite/__init__.py Re-exports for new SQLite repo classes
src/ai_company/persistence/sqlite/migrations.py V6 migration: checkpoints and heartbeats tables with indexes
src/ai_company/observability/events/checkpoint.py 13 checkpoint/heartbeat/recovery event constants
src/ai_company/observability/events/execution.py 5 new checkpoint callback & resume event constants
src/ai_company/observability/events/persistence.py 16 new persistence-layer checkpoint/heartbeat event constants
docs/design/engine.md Removes "Planned" warning from checkpoint strategy docs
README.md Adds Persistence→Engine dependency edge
CLAUDE.md Updates engine description and event name list
tests/unit/engine/checkpoint/test_models.py Model creation, validation, and constraint tests
tests/unit/engine/checkpoint/test_callback_factory.py Callback factory boundary, heartbeat, and error handling tests
tests/unit/engine/checkpoint/test_strategy.py Strategy protocol, resume, fallback, and counter tests
tests/unit/engine/test_recovery_checkpoint_fields.py RecoveryResult checkpoint field tests
tests/unit/persistence/sqlite/test_migrations_v6.py V6 migration table/index/idempotency tests
tests/unit/persistence/sqlite/test_checkpoint_repo.py SQLite checkpoint repo CRUD tests
tests/unit/persistence/sqlite/test_heartbeat_repo.py SQLite heartbeat repo CRUD tests
tests/unit/persistence/test_protocol.py Fake repo protocol conformance tests
tests/unit/persistence/test_migrations_v2.py Updated schema version assertions to 6
tests/unit/observability/test_events.py New checkpoint event module discovery and value tests
tests/unit/api/conftest.py Fake repos for API test backend

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai_company/engine/agent_engine.py Outdated
msg = "checkpoint_context_json is None but can_resume was True"
raise RuntimeError(msg)

execution_id = f"{agent_id}:{task_id}"
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent and comprehensive implementation of the checkpoint recovery strategy. The changes are well-structured across new modules, persistence layers, and the agent engine. The attention to detail is impressive, particularly in handling potential race conditions with asyncio.Lock for resume counts and preventing inconsistent states by linking heartbeat updates to successful checkpoint saves. The addition of thorough unit tests for the new models, repositories, strategy, and callback factory ensures the robustness of this critical feature.

I have one suggestion regarding minor code duplication for invoking the checkpoint callback, which could be refactored for better maintainability.

Comment on lines +209 to +220
if self._checkpoint_callback is not None:
try:
await self._checkpoint_callback(ctx)
except MemoryError, RecursionError:
raise
except Exception as exc:
logger.exception(
EXECUTION_CHECKPOINT_CALLBACK_FAILED,
execution_id=ctx.execution_id,
turn=turn_number,
error=f"{type(exc).__name__}: {exc}",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for invoking the checkpoint callback is duplicated from PlanExecuteLoop._invoke_checkpoint_callback. To improve maintainability and avoid code duplication, consider extracting this logic into a shared helper function, for example in src/ai_company/engine/loop_helpers.py.

As an intermediate step, you could extract this block into a private _invoke_checkpoint_callback method within this class, mirroring the implementation in PlanExecuteLoop.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 76.89243% with 116 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.50%. Comparing base (494013f) to head (838670e).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/ai_company/engine/agent_engine.py 31.91% 30 Missing and 2 partials ⚠️
src/ai_company/engine/checkpoint/resume.py 41.86% 22 Missing and 3 partials ⚠️
...c/ai_company/persistence/sqlite/checkpoint_repo.py 76.47% 16 Missing ⚠️
...rc/ai_company/persistence/sqlite/heartbeat_repo.py 70.90% 16 Missing ⚠️
src/ai_company/engine/plan_execute_loop.py 50.00% 7 Missing and 1 partial ⚠️
src/ai_company/engine/react_loop.py 22.22% 6 Missing and 1 partial ⚠️
src/ai_company/engine/recovery.py 76.92% 5 Missing and 1 partial ⚠️
src/ai_company/engine/checkpoint/strategy.py 94.59% 3 Missing and 1 partial ⚠️
src/ai_company/persistence/sqlite/backend.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #366      +/-   ##
==========================================
- Coverage   93.90%   93.50%   -0.41%     
==========================================
  Files         447      456       +9     
  Lines       20819    21313     +494     
  Branches     2011     2044      +33     
==========================================
+ Hits        19551    19929     +378     
- Misses        981     1088     +107     
- Partials      287      296       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR implements §6.6 Strategy 2 checkpoint recovery for the agent engine: per-turn AgentContext persistence to SQLite, heartbeat-based liveness tracking, and automatic resume from the last checkpoint on crash with a configurable retry limit and fallback to fail-and-reassign. It introduces a full new engine/checkpoint/ module (models, callback factory, resume helpers, and strategy), two new persistence protocols with SQLite implementations, a V6 schema migration, and 16 new observability event constants. The integration is clean — completion_config and effective_autonomy are correctly threaded through the resume path, the TOCTOU race on the resume counter is fixed with an asyncio.Lock, and heartbeat updates are skipped when checkpoint saves fail to prevent limbo state.

Key findings from this review:

  • Logic: _load_latest_checkpoint in CheckpointRecoveryStrategy catches only PersistenceError. Any unexpected exception from the repository (e.g. AttributeError, KeyError) will propagate out of recover(), bypassing the FailAndReassignStrategy fallback entirely and leaving the task in an unrecovered state. The stated intent is to return None on any failure and fall back gracefully.
  • Style: EXECUTION_CHECKPOINT_CALLBACK is defined in events/execution.py but never imported or used in any changed file — it is a dead constant.
  • Style: ReactLoop handles the checkpoint callback inline with a duplicated try/except block, while PlanExecuteLoop correctly extracts this into _invoke_checkpoint_callback. The duplication means future changes to error handling or logging must be applied in two places.
  • Design note: _resume_counts is in-memory only and resets on process restart. Since crash recovery is most relevant after a process crash, the max_resume_attempts limit is effectively a per-process-lifetime guard rather than a true cumulative limit. The code documents this as a known limitation, but the CheckpointConfig.max_resume_attempts field docstring does not yet capture this scope.

Confidence Score: 4/5

  • Safe to merge after addressing the narrow exception catch in _load_latest_checkpoint; all other findings are style/documentation concerns.
  • The PR is large (36 files) but thoroughly pre-reviewed (42 pre-PR findings addressed). The architecture is sound, the asyncio lock correctly prevents TOCTOU on resume counts, heartbeat/checkpoint cleanup is handled in both success and fallback paths, and the SQLite implementation is well-structured. The one logic concern — _load_latest_checkpoint catching only PersistenceError — could cause the fallback strategy to be silently bypassed on unexpected repository errors, but the risk is low given the SQLite repo wraps all errors into QueryError. The max_resume_attempts in-process limitation is documented but is a meaningful behavioral gap for the crash-recovery use case.
  • Pay close attention to src/ai_company/engine/checkpoint/strategy.py — specifically the _load_latest_checkpoint exception handling and the _resume_counts in-memory lifetime.

Important Files Changed

Filename Overview
src/ai_company/engine/checkpoint/strategy.py New CheckpointRecoveryStrategy with asyncio-safe resume counting. Two issues: _load_latest_checkpoint catches only PersistenceError (unexpected exceptions bypass fallback), and _resume_counts resets on process restart making max_resume_attempts ineffective for crash recovery scenarios.
src/ai_company/engine/agent_engine.py Integrates checkpoint resume path into execution pipeline; adds _resume_from_checkpoint, _execute_resumed_loop, and _finalize_resume. completion_config and effective_autonomy are correctly threaded through the resume path. No new issues found here.
src/ai_company/engine/checkpoint/callback_factory.py Factory produces a closure that persists checkpoints and heartbeats. Heartbeat is correctly skipped when checkpoint save fails. All exceptions are best-effort (logged, never propagated except MemoryError/RecursionError). Clean implementation.
src/ai_company/engine/checkpoint/resume.py Standalone helpers for context deserialization, reconciliation message injection, loop callback wiring, and post-resume cleanup. Correctly handles unsupported loop types with a warning. cleanup_after_resume is best-effort.
src/ai_company/engine/react_loop.py Adds optional checkpoint_callback to ReactLoop. Checkpoint is intentionally saved after LLM response but before tool execution. Inline invocation duplicates the same pattern extracted into _invoke_checkpoint_callback in PlanExecuteLoop.
src/ai_company/engine/plan_execute_loop.py Adds checkpoint_callback support and exposes a public config property. Checkpoint invocations are wrapped in a dedicated _invoke_checkpoint_callback method — consistent and clean.
src/ai_company/engine/recovery.py Extends RecoveryResult with checkpoint_context_json, resume_attempt, and a can_resume computed field. Cross-field validator enforces consistency between the two new fields. Clean addition.
src/ai_company/persistence/sqlite/checkpoint_repo.py SQLite implementation of CheckpointRepository. Dynamic WHERE clause is built from hardcoded column names only (noqa S608 justified). Upsert, query, and delete all commit correctly and raise QueryError on failure.
src/ai_company/persistence/sqlite/heartbeat_repo.py SQLite implementation of HeartbeatRepository. Timestamps are normalized to UTC ISO format before storage for reliable lexicographic comparison in get_stale(). All operations wrapped in best-effort error handling.
src/ai_company/persistence/repositories.py Adds CheckpointRepository and HeartbeatRepository protocol definitions. get_latest supports optional execution_id/task_id filters. Protocols are @runtime_checkable for structural typing.
src/ai_company/persistence/sqlite/migrations.py V6 migration adds checkpoints and heartbeats tables with appropriate indexes. The compound (execution_id, turn_number) index supports efficient ORDER BY turn_number DESC LIMIT 1 queries. No FKs to tasks (intentional — ephemeral recovery data).
src/ai_company/observability/events/execution.py Adds checkpoint callback and resume event constants. EXECUTION_CHECKPOINT_CALLBACK is defined but never used anywhere in the changed files.

Sequence Diagram

sequenceDiagram
    participant ExecLoop as ExecutionLoop
    participant CB as CheckpointCallback
    participant CPRepo as CheckpointRepository
    participant HBRepo as HeartbeatRepository
    participant Engine as AgentEngine
    participant Strategy as CheckpointRecoveryStrategy
    participant Fallback as FailAndReassignStrategy

    ExecLoop->>CB: invoke after each completed turn
    CB->>CB: skip if turn==0 or turn mod N != 0
    CB->>CPRepo: save(Checkpoint)
    alt save succeeded
        CB->>HBRepo: save(Heartbeat)
    else save failed
        CB->>CB: log warning, skip heartbeat
    end

    Note over ExecLoop,Engine: Crash or ERROR termination

    Engine->>Strategy: recover(task_execution, error_message, context)
    Strategy->>CPRepo: get_latest(execution_id)
    alt no checkpoint or attempts exhausted
        Strategy->>CPRepo: delete_by_execution
        Strategy->>HBRepo: delete
        Strategy->>Fallback: recover(...)
        Fallback-->>Engine: RecoveryResult can_resume=False
    else checkpoint found and attempt available
        Strategy->>Strategy: increment _resume_counts[execution_id]
        Strategy-->>Engine: RecoveryResult can_resume=True
    end

    alt can_resume is True
        Engine->>Engine: deserialize_and_reconcile checkpoint_json
        Engine->>Engine: inject SYSTEM reconciliation message
        Engine->>ExecLoop: execute(checkpoint_ctx)
        ExecLoop-->>Engine: ExecutionResult
        alt termination reason is not ERROR
            Engine->>Strategy: clear_resume_count(execution_id)
            Engine->>CPRepo: delete_by_execution
            Engine->>HBRepo: delete
        end
    end
Loading

Comments Outside Diff (2)

  1. src/ai_company/engine/checkpoint/strategy.py, line 191-200 (link)

    Non-PersistenceError exceptions bypass the fallback strategy

    _load_latest_checkpoint only catches PersistenceError (and re-raises MemoryError/RecursionError). Any unexpected exception from the repository that is not a PersistenceError — for example an AttributeError, KeyError, or a future aiosqlite.Error that slips past the SQLite repo's own handler — will propagate out of recover() uncaught, bypassing the FailAndReassignStrategy fallback entirely. The task would then fail with an unhandled exception rather than being gracefully recovered.

    The intent stated in the docstring is "returning None on failure", so all unexpected exceptions should be treated the same way:

  2. src/ai_company/engine/checkpoint/strategy.py, line 62-73 (link)

    max_resume_attempts is not enforced across process restarts

    The _resume_counts dict is in-memory only, and the code notes this explicitly. However, crash recovery is triggered by process crashes — precisely the events that reset this counter. This means max_resume_attempts acts as a per-process-lifetime limit rather than a true upper bound on resume attempts. An execution that crashes repeatedly across process restarts will receive up to max_resume_attempts fresh attempts on each restart, effectively making the limit unbounded in the failure scenario it was designed to protect against.

    The comment acknowledges this limitation and calls it a "planned improvement," but it is worth ensuring users of CheckpointRecoveryStrategy are aware that the current max_resume_attempts semantics differ from what the name implies. At minimum, the public docstring for CheckpointConfig.max_resume_attempts could be updated to clarify the in-process scope:

    max_resume_attempts: int = Field(
        default=2,
        ge=0,
        description=(
            "Max resume attempts before fallback (per process lifetime; "
            "resets on process restart — see CheckpointRecoveryStrategy "
            "note on _resume_counts)."
        ),
    )
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/ai_company/engine/checkpoint/strategy.py
Line: 191-200

Comment:
**Non-`PersistenceError` exceptions bypass the fallback strategy**

`_load_latest_checkpoint` only catches `PersistenceError` (and re-raises `MemoryError`/`RecursionError`). Any unexpected exception from the repository that is *not* a `PersistenceError` — for example an `AttributeError`, `KeyError`, or a future `aiosqlite.Error` that slips past the SQLite repo's own handler — will propagate out of `recover()` uncaught, bypassing the `FailAndReassignStrategy` fallback entirely. The task would then fail with an unhandled exception rather than being gracefully recovered.

The intent stated in the docstring is *"returning `None` on failure"*, so all unexpected exceptions should be treated the same way:

```suggestion
        try:
            checkpoint = await self._checkpoint_repo.get_latest(
                execution_id=execution_id,
            )
        except (MemoryError, RecursionError):
            raise
        except Exception:
            logger.exception(
                CHECKPOINT_LOAD_FAILED,
                execution_id=execution_id,
                task_id=task_id,
            )
            return None
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/ai_company/observability/events/execution.py
Line: 67

Comment:
**Unused event constant**

`EXECUTION_CHECKPOINT_CALLBACK` is defined here but is never imported or referenced in any of the changed files (neither `react_loop.py` nor `plan_execute_loop.py` use it — they both use `EXECUTION_CHECKPOINT_CALLBACK_FAILED`). This dead constant should either be used or removed.

```suggestion
```

(Remove the line — it adds noise to the events module and may confuse future maintainers looking for its usage.)

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/ai_company/engine/react_loop.py
Line: 206-221

Comment:
**Inline checkpoint invocation duplicates `_invoke_checkpoint_callback`**

`PlanExecuteLoop` extracts the checkpoint invocation pattern into a dedicated `_invoke_checkpoint_callback` method, but `ReactLoop` keeps the exact same logic inline. The two implementations are functionally identical (same guard, same exception handling, same event constant, same log fields). Having two copies means future changes (e.g. adding a metric counter or changing the log level) must be applied in both places.

Consider extracting the same method in `ReactLoop`:

```python
async def _invoke_checkpoint_callback(
    self,
    ctx: AgentContext,
    turn_number: int,
) -> None:
    if self._checkpoint_callback is None:
        return
    try:
        await self._checkpoint_callback(ctx)
    except (MemoryError, RecursionError):
        raise
    except Exception as exc:
        logger.exception(
            EXECUTION_CHECKPOINT_CALLBACK_FAILED,
            execution_id=ctx.execution_id,
            turn=turn_number,
            error=f"{type(exc).__name__}: {exc}",
        )
```

Then replace the inline block with `await self._invoke_checkpoint_callback(ctx, turn_number)`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/ai_company/engine/checkpoint/strategy.py
Line: 62-73

Comment:
**`max_resume_attempts` is not enforced across process restarts**

The `_resume_counts` dict is in-memory only, and the code notes this explicitly. However, crash recovery is triggered by process crashes — precisely the events that reset this counter. This means `max_resume_attempts` acts as a per-process-lifetime limit rather than a true upper bound on resume attempts. An execution that crashes repeatedly across process restarts will receive up to `max_resume_attempts` fresh attempts on each restart, effectively making the limit unbounded in the failure scenario it was designed to protect against.

The comment acknowledges this limitation and calls it a "planned improvement," but it is worth ensuring users of `CheckpointRecoveryStrategy` are aware that the current `max_resume_attempts` semantics differ from what the name implies. At minimum, the public docstring for `CheckpointConfig.max_resume_attempts` could be updated to clarify the in-process scope:

```python
max_resume_attempts: int = Field(
    default=2,
    ge=0,
    description=(
        "Max resume attempts before fallback (per process lifetime; "
        "resets on process restart — see CheckpointRecoveryStrategy "
        "note on _resume_counts)."
    ),
)
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 838670e

Comment thread src/ai_company/engine/agent_engine.py Outdated
Comment thread src/ai_company/engine/agent_engine.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/ai_company/engine/agent_engine.py (1)

395-421: ⚠️ Potential issue | 🟠 Major

Recovered/resumed execution can skip final cost recording.

record_execution_costs(...) runs before recovery. If Line 671 replaces the error result with a resumed result, the resumed portion’s final costs are not recorded in this pipeline pass.

Also applies to: 671-677

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 395 - 421, The current
pipeline calls record_execution_costs(...) before calling _apply_recovery(...)
which can replace an ERROR termination with a resumed/executed result and thus
skip final cost recording; modify the flow so that after
apply_post_execution_transitions(...) and after calling
self._apply_recovery(...) (when execution_result.termination_reason was ERROR
and a recovery/resume occurred) you call record_execution_costs(...) again on
the new execution_result (using the same identity, agent_id, task_id,
tracker=self._cost_tracker) to ensure final costs for the resumed execution are
recorded; use the symbols execution_result, TerminationReason.ERROR,
pre_recovery_ctx/pre_recovery_status to detect recovery and re-run
record_execution_costs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Line 154: The guide references a non-existent symbol EXECUTION_LOOP_STARTED;
update the text and import example to use the correct exported constant
EXECUTION_LOOP_START from ai_company.observability.events.execution (i.e.,
replace any occurrence of EXECUTION_LOOP_STARTED with EXECUTION_LOOP_START and
adjust the import example to `from ai_company.observability.events.execution
import EXECUTION_LOOP_START`) so the listed event constant matches the actual
export in src/ai_company/observability/events/execution.py.

In `@src/ai_company/engine/agent_engine.py`:
- Around line 696-704: The resume path in _resume_from_checkpoint currently runs
the agent loop directly and builds a budget checker ad hoc, bypassing the
engine's timeout and budget orchestration; change _resume_from_checkpoint to
reuse the existing execution orchestration by invoking _run_loop_with_timeout
(the same wrapper used for fresh runs) and call or delegate to _execute (or its
budget-checker factory) to construct the budget-enforcer-specific checker so
resumed executions are subject to the same timeout/budget logic as new runs;
ensure you pass recovery_result, agent_id, task_id, completion_config, and
effective_autonomy through the same parameters that
_run_loop_with_timeout/_execute expect and remove the ad-hoc loop and checker
creation in favor of those shared helpers (references: _resume_from_checkpoint,
_run_loop_with_timeout, _execute, budget checker creation).
- Around line 696-814: The _resume_from_checkpoint function is too large and
should be split into focused helpers: extract deserialization into a helper
load_checkpoint_from_json(recovery_result: RecoveryResult) that validates
checkpoint_context_json and returns AgentContext; extract the reconciliation
injection into build_reconciliation_message(checkpoint_ctx: AgentContext) and
apply_reconciliation(checkpoint_ctx, message) which returns the new context;
extract loop execution into run_resume_loop(agent_id, task_id, checkpoint_ctx,
completion_config, effective_autonomy) that constructs the loop, tool_invoker,
budget_checker and awaits loop.execute returning ExecutionResult; and extract
checkpoint cleanup into cleanup_checkpoints_after_success(execution_id) which
handles _checkpoint_repo.delete_by_execution and logs warnings. Refactor
_resume_from_checkpoint to call these helpers in sequence and keep its
try/except only around the minimal orchestration, preserving current logging,
MemoryError/RecursionError re-raises, and the resume-count-clear logic using
self._recovery_strategy and checkpoint_ctx.execution_id.
- Around line 721-723: Add a structured log entry (logger.error or
logger.warning) immediately before raising the RuntimeError in the resume
precondition check for recovery_result.checkpoint_context_json, including
agent_id, task_id, and resume_attempt to provide context; locate the check
around recovery_result.checkpoint_context_json in agent_engine.py (the block
that currently raises "checkpoint_context_json is None but can_resume was True")
and call the module/class logger (e.g., logger.error(..., extra={...}) or
logger.warning(..., extra={...})) with a clear message and the fields agent_id,
task_id and resume_attempt, then raise the same RuntimeError as before.
- Around line 803-813: The checkpoint cleanup is running unconditionally after
loop.execute in AgentEngine, which deletes checkpoints even when the returned
ExecutionResult has termination_reason == ERROR; modify the post-execution block
in agent_engine.py (the code that calls loop.execute and then uses
self._checkpoint_repo.delete_by_execution) to only attempt deletion when
result.termination_reason indicates success (e.g., not ERROR) — check
ExecutionResult.termination_reason against the ERROR constant before calling
delete_by_execution, and keep the existing exception handler that logs
EXECUTION_CHECKPOINT_CALLBACK_FAILED with execution_id and error details.

In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 64-171: The recover method is doing too many things; split it into
smaller functions: extract checkpoint-loading (the try/except around
self._checkpoint_repo.get_latest and logging) into a helper like
_load_latest_checkpoint(execution_id, task_id) that returns the checkpoint or
None, and extract the resume-attempt reservation and check (the async with
self._resume_lock block that reads/modifies self._resume_counts and logs
fallback) into a helper like _reserve_resume_attempt(execution_id, task_id) that
returns the current resume_attempt or raises/returns a sentinel to indicate
fallback; keep recover() to orchestrate calling _load_latest_checkpoint,
delegating to self._delegate_to_fallback when needed, calling
context.to_snapshot(), and assembling the RecoveryResult (using existing names
recovery fields: task_execution, strategy_type, context_snapshot, error_message,
checkpoint_context_json, resume_attempt); ensure all original logging
(CHECKPOINT_LOAD_FAILED, CHECKPOINT_RECOVERY_NO_CHECKPOINT,
CHECKPOINT_RECOVERY_FALLBACK, CHECKPOINT_RECOVERY_RESUME, CHECKPOINT_LOADED) and
exception passthrough for MemoryError/RecursionError remain in the new helpers.
- Around line 98-110: The current blanket "except Exception" around
self._checkpoint_repo.get_latest masks unexpected bugs; change it to catch only
the repository's persistence error(s) (e.g., CheckpointPersistenceError or the
repo-specific exception type) and log the exception details in the structured
logger call (use CHECKPOINT_LOAD_FAILED with execution_id, task_id, and the
caught exception info), leave the existing MemoryError and RecursionError
re-raises intact, and re-raise any other unexpected exceptions instead of
treating them as "no checkpoint".

In `@src/ai_company/persistence/sqlite/heartbeat_repo.py`:
- Around line 39-49: When persisting and when querying heartbeats, normalize the
timestamp to UTC so lexicographic string comparisons work: in the save path
where heartbeat.model_dump(...) is used (the method that calls self._db.execute
for INSERT OR REPLACE) convert heartbeat.last_heartbeat_at to an ISO UTC string
(e.g. .astimezone(datetime.timezone.utc).isoformat() or equivalent) before
dumping/parametrizing the data, and likewise in get_stale() normalize compared
timestamps to UTC before building the query parameters; update the places that
call self._db.execute/commit for heartbeats and the get_stale() query code to
always use that UTC-normalized ISO string.

In `@tests/unit/api/conftest.py`:
- Around line 347-360: The fake repository's get_latest method should enforce
the real repo contract by rejecting calls with both execution_id and task_id as
None; modify get_latest (in the test fake checkpoint repo) to raise a ValueError
when neither execution_id nor task_id is provided, then keep the existing
filtering on self._checkpoints and return None or the max by turn_number as
before when at least one filter is supplied.

In `@tests/unit/engine/checkpoint/test_callback_factory.py`:
- Around line 147-162: Add parametrized cases covering turn=0 to assert the
factory skips saving at turn zero; update the pytest.mark.parametrize tuple in
tests/unit/engine/checkpoint/test_callback_factory.py (the
("persist_every_n","turn","should_save") parameter list) to include entries like
(1, 0, False) and at least one other (e.g. (3, 0, False)) so the behavior
guarded by the condition in the callback factory (the check using turn == 0 or
turn % config.persist_every_n_turns != 0) is explicitly exercised.

In `@tests/unit/engine/test_recovery_checkpoint_fields.py`:
- Around line 43-50: Add a failure-case test that ensures
RecoveryResult._validate_checkpoint_consistency() rejects cases where only one
of checkpoint_context_json or resume_attempt is set: construct RecoveryResult
with checkpoint_context_json set but resume_attempt omitted (and vice versa) and
assert it raises the appropriate exception (e.g., ValueError) using
pytest.raises; add the same negative test alongside the existing positive cases
around the current blocks that create RecoveryResult (the spots that set
checkpoint_context_json and resume_attempt together) so the invariant is covered
in all three places.

In `@tests/unit/persistence/sqlite/test_migrations_v6.py`:
- Around line 18-21: The test duplicate asserts SCHEMA_VERSION == 6 in
TestSchemaVersion::test_schema_version_is_six; remove this duplicate test from
this file so the canonical check remains only in the existing test (the one in
the other test module), i.e., delete the TestSchemaVersion class or at least the
test_schema_version_is_six method referencing SCHEMA_VERSION to avoid redundant
assertions and maintenance overhead.

In `@tests/unit/persistence/test_migrations_v2.py`:
- Around line 31-32: The test test_schema_version_is_six should not hardcode the
literal 6; replace the equality assertion with a non-version-specific check
using the SCHEMA_VERSION symbol (or remove the exact-version assertion here and
move/keep exact-version checks in the newest migration test). Update both
occurrences (test_schema_version_is_six and the similar assertion at the other
location) to reference SCHEMA_VERSION or to assert behavior relative to the v2
baseline rather than == 6 so future migration bumps won’t force churn in this
v2-focused test.

---

Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 395-421: The current pipeline calls record_execution_costs(...)
before calling _apply_recovery(...) which can replace an ERROR termination with
a resumed/executed result and thus skip final cost recording; modify the flow so
that after apply_post_execution_transitions(...) and after calling
self._apply_recovery(...) (when execution_result.termination_reason was ERROR
and a recovery/resume occurred) you call record_execution_costs(...) again on
the new execution_result (using the same identity, agent_id, task_id,
tracker=self._cost_tracker) to ensure final costs for the resumed execution are
recorded; use the symbols execution_result, TerminationReason.ERROR,
pre_recovery_ctx/pre_recovery_status to detect recovery and re-run
record_execution_costs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0f947cd0-dfd7-489e-989f-6bed19b02c7f

📥 Commits

Reviewing files that changed from the base of the PR and between 494013f and 2d22e9f.

📒 Files selected for processing (35)
  • CLAUDE.md
  • README.md
  • docs/design/engine.md
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/observability/events/persistence.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/persistence/sqlite/__init__.py
  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/persistence/sqlite/migrations.py
  • tests/unit/api/conftest.py
  • tests/unit/engine/checkpoint/__init__.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/engine/checkpoint/test_models.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/observability/test_events.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • tests/unit/persistence/test_migrations_v2.py
  • tests/unit/persistence/test_protocol.py
💤 Files with no reviewable changes (1)
  • docs/design/engine.md
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Agent
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
  • GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax: except A, B: (no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001. Tests must use test-provider, test-small-001, etc.

Files:

  • tests/unit/engine/checkpoint/test_models.py
  • src/ai_company/observability/events/execution.py
  • tests/unit/persistence/test_migrations_v2.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/persistence/sqlite/__init__.py
  • src/ai_company/engine/checkpoint/__init__.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/react_loop.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/agent_engine.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • src/ai_company/persistence/protocol.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/persistence/sqlite/backend.py
  • tests/unit/api/conftest.py
  • src/ai_company/persistence/sqlite/migrations.py
  • tests/unit/observability/test_events.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/engine/checkpoint/strategy.py
  • tests/unit/persistence/test_protocol.py
  • src/ai_company/observability/events/persistence.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed.
Test timeout: 30 seconds per test.
Prefer @pytest.mark.parametrize for testing similar cases.

Files:

  • tests/unit/engine/checkpoint/test_models.py
  • tests/unit/persistence/test_migrations_v2.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • tests/unit/api/conftest.py
  • tests/unit/observability/test_events.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • tests/unit/persistence/test_protocol.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction plus MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with BaseModel, model_validator, computed_field, ConfigDict. Use @computed_field for derived values instead of storing redundant fields (e.g., TokenUsage.total_tokens).
Use NotBlankStr from core.types for all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER use import logging, logging.getLogger(), or print() in application code.
Always use logger as the variable name for loggers (not _logger, not log).
Event names must always use constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import dire...

Files:

  • src/ai_company/observability/events/execution.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/persistence/sqlite/__init__.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/migrations.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/observability/events/persistence.py
src/ai_company/**/[!_]*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)

Files:

  • src/ai_company/observability/events/execution.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/migrations.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/observability/events/persistence.py
🧠 Learnings (10)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/ai_company/observability/events/execution.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/react_loop.py
  • tests/unit/observability/test_events.py
  • src/ai_company/observability/events/persistence.py
  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/ai_company/**/[!_]*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • src/ai_company/engine/plan_execute_loop.py
  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

  • src/ai_company/engine/checkpoint/models.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Structured logging: always use `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : NEVER use `import logging`, `logging.getLogger()`, or `print()` in application code.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Always use `logger` as the variable name for loggers (not `_logger`, not `log`).

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : DEBUG logging for object creation, internal flow, and entry/exit of key functions.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (22)
tests/unit/engine/checkpoint/test_models.py (1)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-62)
  • CheckpointConfig (85-117)
  • Heartbeat (65-82)
src/ai_company/persistence/sqlite/__init__.py (2)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • SQLiteCheckpointRepository (27-174)
src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
  • SQLiteHeartbeatRepository (26-162)
src/ai_company/engine/checkpoint/__init__.py (3)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-125)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-62)
  • CheckpointConfig (85-117)
  • Heartbeat (65-82)
src/ai_company/engine/checkpoint/strategy.py (1)
  • CheckpointRecoveryStrategy (33-199)
tests/unit/engine/test_recovery_checkpoint_fields.py (1)
src/ai_company/engine/recovery.py (4)
  • RecoveryResult (30-102)
  • can_resume (100-102)
  • recover (114-133)
  • recover (152-206)
src/ai_company/engine/plan_execute_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • _checkpoint_callback (63-123)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/engine/checkpoint/models.py (2)
src/ai_company/tools/base.py (1)
  • description (138-140)
src/ai_company/engine/parallel_models.py (2)
  • agent_id (79-81)
  • task_id (87-89)
src/ai_company/engine/checkpoint/callback_factory.py (4)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-62)
  • CheckpointConfig (85-117)
  • Heartbeat (65-82)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/persistence/repositories.py (11)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
  • save (48-57)
  • save (114-123)
  • save (170-180)
  • save (207-216)
  • save (284-294)
  • save (335-344)
  • save (415-424)
  • save (487-496)
  • save (540-549)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
src/ai_company/engine/recovery.py (1)
src/ai_company/engine/task_execution.py (1)
  • TaskExecution (60-231)
src/ai_company/engine/react_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • _checkpoint_callback (63-123)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
tests/unit/persistence/sqlite/test_heartbeat_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
  • Heartbeat (65-82)
src/ai_company/persistence/sqlite/heartbeat_repo.py (5)
  • SQLiteHeartbeatRepository (26-162)
  • save (36-61)
  • get (63-88)
  • get_stale (90-121)
  • delete (123-145)
tests/unit/persistence/sqlite/conftest.py (1)
  • migrated_db (26-34)
src/ai_company/persistence/sqlite/checkpoint_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
  • Checkpoint (25-62)
src/ai_company/persistence/errors.py (1)
  • QueryError (33-34)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
  • save (36-61)
  • _row_to_model (147-162)
  • get (63-88)
src/ai_company/engine/__init__.py (3)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-62)
  • CheckpointConfig (85-117)
  • Heartbeat (65-82)
src/ai_company/engine/checkpoint/strategy.py (1)
  • CheckpointRecoveryStrategy (33-199)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-125)
tests/unit/engine/checkpoint/test_callback_factory.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-125)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (85-117)
src/ai_company/persistence/protocol.py (1)
src/ai_company/persistence/repositories.py (2)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
tests/unit/persistence/sqlite/test_migrations_v6.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
  • run_migrations (337-405)
tests/unit/persistence/test_migrations_v2.py (2)
  • test_schema_version_is_six (31-32)
  • memory_db (21-26)
src/ai_company/engine/checkpoint/callback.py (1)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
  • Heartbeat (65-82)
src/ai_company/persistence/errors.py (1)
  • QueryError (33-34)
src/ai_company/persistence/repositories.py (11)
  • get (59-71)
  • get (218-230)
  • get (346-358)
  • get (426-438)
  • get (551-563)
  • get_stale (565-581)
  • delete (95-107)
  • delete (260-272)
  • delete (396-408)
  • delete (468-480)
  • delete (583-595)
src/ai_company/persistence/sqlite/backend.py (5)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • SQLiteCheckpointRepository (27-174)
src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
  • SQLiteHeartbeatRepository (26-162)
src/ai_company/persistence/protocol.py (2)
  • checkpoints (146-148)
  • heartbeats (151-153)
tests/unit/api/conftest.py (2)
  • checkpoints (472-473)
  • heartbeats (476-477)
tests/unit/persistence/test_protocol.py (2)
  • checkpoints (309-310)
  • heartbeats (313-314)
tests/unit/api/conftest.py (4)
src/ai_company/engine/checkpoint/models.py (2)
  • Checkpoint (25-62)
  • Heartbeat (65-82)
src/ai_company/persistence/sqlite/checkpoint_repo.py (2)
  • get_latest (68-132)
  • delete_by_execution (134-157)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
  • get (63-88)
  • get_stale (90-121)
  • delete (123-145)
src/ai_company/persistence/protocol.py (2)
  • checkpoints (146-148)
  • heartbeats (151-153)
tests/unit/persistence/sqlite/test_checkpoint_repo.py (1)
src/ai_company/persistence/sqlite/checkpoint_repo.py (3)
  • save (37-66)
  • get_latest (68-132)
  • delete_by_execution (134-157)
src/ai_company/persistence/repositories.py (3)
src/ai_company/engine/checkpoint/models.py (2)
  • Checkpoint (25-62)
  • Heartbeat (65-82)
src/ai_company/persistence/sqlite/checkpoint_repo.py (2)
  • get_latest (68-132)
  • delete_by_execution (134-157)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
  • get (63-88)
  • get_stale (90-121)
  • delete (123-145)
src/ai_company/engine/checkpoint/strategy.py (6)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (85-117)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • to_snapshot (280-302)
src/ai_company/engine/recovery.py (4)
  • RecoveryResult (30-102)
  • RecoveryStrategy (106-137)
  • recover (114-133)
  • recover (152-206)
src/ai_company/engine/task_execution.py (1)
  • TaskExecution (60-231)
src/ai_company/persistence/repositories.py (7)
  • CheckpointRepository (484-533)
  • get_latest (498-519)
  • get (59-71)
  • get (218-230)
  • get (346-358)
  • get (426-438)
  • get (551-563)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • get_latest (68-132)
🪛 LanguageTool
CLAUDE.md

[style] ~154-~154: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...

(EG_NO_COMMA)

🔇 Additional comments (19)
tests/unit/observability/test_events.py (1)

11-15: LGTM!

The checkpoint event imports, domain module discovery extension, and new test assertions follow the established patterns in this test file. Event constants are correctly imported from the domain-specific module ai_company.observability.events.checkpoint.

Also applies to: 227-227, 254-258

src/ai_company/engine/react_loop.py (2)

70-74: LGTM!

The new __init__ method correctly accepts an optional checkpoint callback and stores it as an instance attribute. The type hint uses the forward reference pattern appropriately with TYPE_CHECKING.


209-220: LGTM! MemoryError correctly refers to the built-in.

The checkpoint callback invocation with error handling is correct:

  • Re-raises MemoryError (built-in, for out-of-memory conditions) and RecursionError (non-recoverable system errors)
  • Logs and swallows other exceptions with structured context
  • Uses PEP 758 except syntax correctly

The codebase defines a domain-specific ai_company.memory.errors.MemoryError class, but react_loop.py does not import it, so the bare MemoryError here unambiguously refers to the built-in Python exception.

tests/unit/persistence/sqlite/test_migrations_v6.py (1)

24-135: LGTM!

The V6 migration tests are comprehensive and well-structured:

  • Table creation verification for both checkpoints and heartbeats
  • Column schema validation with expected field sets
  • Index creation verification with appropriate naming patterns
  • Idempotency test ensuring migrations can run twice safely

All test classes are properly marked with @pytest.mark.unit and follow the established async test patterns.

src/ai_company/persistence/sqlite/__init__.py (1)

7-12: LGTM!

The new checkpoint and heartbeat repository exports are correctly added following the existing pattern. The __all__ list maintains alphabetical ordering.

Also applies to: 26-26, 28-28

tests/unit/engine/checkpoint/test_models.py (1)

1-268: LGTM!

Comprehensive test coverage for the checkpoint models:

  • Checkpoint: ID generation, immutability, JSON validation, turn number constraints, NotBlankStr validation
  • Heartbeat: Field assignment and immutability
  • CheckpointConfig: Default values, custom values, immutability, and field constraints

The use of @pytest.mark.parametrize for blank/whitespace field rejection (lines 156-170) follows best practices. All test classes are properly marked with @pytest.mark.unit.

src/ai_company/engine/checkpoint/__init__.py (1)

1-23: LGTM!

Clean module initializer with appropriate public API surface:

  • Exports all relevant checkpoint types: Checkpoint, CheckpointCallback, CheckpointConfig, CheckpointRecoveryStrategy, Heartbeat, make_checkpoint_callback
  • Module docstring accurately describes the checkpoint recovery functionality
  • __all__ is alphabetically ordered
tests/unit/persistence/sqlite/test_checkpoint_repo.py (1)

1-238: LGTM!

Comprehensive test coverage for SQLiteCheckpointRepository:

  • Save/retrieve roundtrip with field equality verification
  • Highest turn number selection when multiple checkpoints exist
  • Filtering by execution_id, task_id, and combined filters
  • Edge cases: no matching checkpoint (returns None), no filters (raises ValueError)
  • Upsert behavior verification
  • Deletion with count validation and isolation checks

The helper function _make_checkpoint uses keyword-only arguments for clarity, and all tests are properly marked with @pytest.mark.unit.

tests/unit/engine/checkpoint/test_callback_factory.py (1)

77-382: LGTM!

Comprehensive test coverage for the checkpoint callback factory:

  • Boundary turn detection with various persist_every_n_turns configurations
  • Heartbeat update behavior (on save vs. on skip)
  • Error handling: non-fatal errors swallowed, MemoryError/RecursionError propagated
  • Checkpoint failure correctly skips heartbeat to avoid inconsistent state

The tests use test-provider and test-small-001 as per coding guidelines, and all test classes are properly marked with @pytest.mark.unit.

src/ai_company/engine/checkpoint/callback_factory.py (1)

1-125: LGTM!

The callback factory is well-structured with proper best-effort error handling. The PEP 758 except A, B: syntax for Python 3.14+ is correctly applied, and the logic properly:

  • Skips turn 0 and non-boundary turns
  • Persists checkpoints before updating heartbeats (avoiding limbo states)
  • Propagates critical errors while swallowing recoverable ones
src/ai_company/observability/events/checkpoint.py (1)

1-25: LGTM!

Event constants are well-organized into logical groups (checkpoint lifecycle, heartbeat lifecycle, recovery flow) with consistent naming conventions.

tests/unit/engine/checkpoint/test_strategy.py (1)

1-438: LGTM!

Comprehensive test coverage for CheckpointRecoveryStrategy including:

  • Protocol compliance verification
  • Resume behavior with valid checkpoints
  • Fallback scenarios (no checkpoint, max attempts exhausted, repo errors)
  • Per-execution counter tracking and independence
  • Custom fallback strategy injection

Test structure follows guidelines with proper markers and timeout configuration.

src/ai_company/persistence/repositories.py (1)

483-595: LGTM!

The CheckpointRepository and HeartbeatRepository protocols are well-defined with:

  • Consistent method signatures matching existing repository patterns
  • Clear preconditions documented (e.g., get_latest requires at least one filter)
  • Appropriate return types (optional for single lookups, tuples for collections)
  • Proper error documentation
src/ai_company/engine/plan_execute_loop.py (2)

86-97: LGTM!

Clean integration of checkpoint callback into the loop constructor with appropriate optional typing and a useful config property for external access.


768-792: LGTM!

The _invoke_checkpoint_callback method correctly:

  • Guards against None callback
  • Propagates critical errors (MemoryError, RecursionError)
  • Logs and swallows other exceptions to prevent checkpoint failures from interrupting execution
  • Includes useful error context in the log
src/ai_company/engine/__init__.py (1)

31-38: LGTM!

Checkpoint-related entities are properly re-exported from the engine package, maintaining the established pattern for public API exposure.

src/ai_company/engine/recovery.py (1)

61-102: LGTM!

Well-designed extension to RecoveryResult:

  • checkpoint_context_json and resume_attempt fields with appropriate defaults
  • Consistency validator ensures both are set together or neither (prevents invalid states)
  • can_resume computed field provides a clean API for downstream consumers
src/ai_company/engine/checkpoint/models.py (1)

1-117: LGTM!

Well-designed checkpoint models following frozen Pydantic patterns:

  • Checkpoint: Proper validation of context_json as valid JSON, auto-generated ID, UTC timestamps
  • Heartbeat: Clean liveness tracking structure
  • CheckpointConfig: Sensible defaults with appropriate constraints (allows max_resume_attempts=0 for disabling resume)

All models correctly use NotBlankStr for identifiers and AwareDatetime for timestamps.

src/ai_company/engine/agent_engine.py (1)

174-183: Good guard for paired checkpoint dependencies.

The constructor validation at Line 174 prevents partial checkpoint wiring and fails fast with a clear message.

Comment thread CLAUDE.md Outdated
Comment on lines +696 to +704
async def _resume_from_checkpoint(
self,
recovery_result: RecoveryResult,
agent_id: str,
task_id: str,
*,
completion_config: CompletionConfig | None = None,
effective_autonomy: EffectiveAutonomy | None = None,
) -> ExecutionResult:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resume path bypasses engine timeout/budget orchestration.

At Line 758, resume executes the loop directly instead of going through _run_loop_with_timeout, so resumed executions are no longer governed by the same timeout control path. Also, the budget checker is built ad hoc instead of reusing the _execute path (including budget-enforcer-specific checker creation).

Suggested direction
- result = await loop.execute(
-     context=checkpoint_ctx,
-     provider=self._provider,
-     ...
- )
+ result = await self._run_loop_with_timeout(
+     ctx=checkpoint_ctx,
+     agent_id=agent_id,
+     task_id=task_id,
+     completion_config=completion_config,
+     budget_checker=budget_checker,  # built using same policy as _execute
+     tool_invoker=...,
+     start=resume_start,
+     timeout_seconds=timeout_seconds,
+ )

Also applies to: 756-773

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 696 - 704, The resume
path in _resume_from_checkpoint currently runs the agent loop directly and
builds a budget checker ad hoc, bypassing the engine's timeout and budget
orchestration; change _resume_from_checkpoint to reuse the existing execution
orchestration by invoking _run_loop_with_timeout (the same wrapper used for
fresh runs) and call or delegate to _execute (or its budget-checker factory) to
construct the budget-enforcer-specific checker so resumed executions are subject
to the same timeout/budget logic as new runs; ensure you pass recovery_result,
agent_id, task_id, completion_config, and effective_autonomy through the same
parameters that _run_loop_with_timeout/_execute expect and remove the ad-hoc
loop and checker creation in favor of those shared helpers (references:
_resume_from_checkpoint, _run_loop_with_timeout, _execute, budget checker
creation).

Comment thread src/ai_company/engine/agent_engine.py
Comment thread src/ai_company/engine/agent_engine.py
Comment thread src/ai_company/engine/agent_engine.py Outdated
Comment thread tests/unit/api/conftest.py
Comment thread tests/unit/engine/checkpoint/test_callback_factory.py
Comment thread tests/unit/engine/test_recovery_checkpoint_fields.py
Comment thread tests/unit/persistence/sqlite/test_migrations_v6.py
Comment thread tests/unit/persistence/test_migrations_v2.py Outdated
Critical fixes:
- execution_id mismatch: use checkpoint_ctx.execution_id for cleanup
  instead of fabricated f"{agent_id}:{task_id}" (Copilot, Greptile,
  code-reviewer, python-reviewer, silent-failure, security)
- PEP 758 except syntax in models.py
- Extract _resume_from_checkpoint (118→35 lines) into helpers
- Extract recover() (108→30 lines) into smaller methods
- Move _make_loop_with_callback to checkpoint/resume.py
- Create engine/checkpoint/resume.py for resume orchestration helpers

Major fixes:
- Include actual error_message in reconciliation (not hardcoded text)
- Move _delegate_to_fallback outside lock scope in strategy
- Clean up heartbeat after successful resume (Greptile)
- Conditional checkpoint cleanup — skip on ERROR (CodeRabbit)
- Fix EXECUTION_LOOP_STARTED → EXECUTION_LOOP_START in CLAUDE.md
- Fix PARALLEL_EXECUTION_STARTED → PARALLEL_GROUP_START in CLAUDE.md
- Add 3 missing checkpoint fields to RecoveryResult table in design spec
- Remove non-existent storage field from YAML config example
- Update reconciliation description to match implementation
- Make clear_resume_count async with lock for thread safety
- Normalize heartbeat timestamps to UTC for correct comparisons
- Narrow except Exception to PersistenceError in strategy
- Split make_checkpoint_callback (92→3 functions under 50 lines)

Medium fixes:
- Use CHECKPOINT_DELETE_FAILED event constant for cleanup warnings
- Add exc_info=True on cleanup warning logs
- Log before raising RuntimeError on None checkpoint_context_json
- Type clear_resume_count parameter as NotBlankStr
- Add JSON validation to RecoveryResult.checkpoint_context_json
- Bound _resume_counts dict with _MAX_TRACKED_EXECUTIONS eviction
- Update RecoveryStrategy protocol docstring for checkpoint recovery
- Add checkpoint recovery to README capabilities
- Update ReactLoop checkpoint_callback docstring
- FakeCheckpointRepository.get_latest validates no-filter ValueError
- Add test for RecoveryResult._validate_checkpoint_consistency
- Add CHECKPOINT_UNSUPPORTED_LOOP event constant

Minor fixes:
- Fix checkpoint __init__.py docstring (configurable intervals)
- Fix callback.py type alias docstring (may skip)
- Fix callback_factory docstring (turn-0 skip mention)
- FakeHeartbeatRepository.get_stale sorts by timestamp
- Add turn=0 parametrize cases to test_callback_factory
- Tighten context_json validator to reject arrays/primitives
- Add composite (task_id, turn_number) index
- Add CHECK (turn_number >= 0) constraint to DDL
- Remove duplicate test_schema_version from test_migrations_v2
- Use SCHEMA_VERSION constant in test_migrations_v2
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/ai_company/engine/react_loop.py (1)

209-240: ⚠️ Potential issue | 🔴 Critical

Checkpoint is written before tool side effects are finalized.

At Line 209, the callback runs before tool execution. If a crash happens after a tool mutates external state but before the next checkpoint, resume can replay that tool path and duplicate side effects.

💡 Safer checkpoint boundary (after tool execution for tool-call turns)
-        if self._checkpoint_callback is not None:
-            try:
-                await self._checkpoint_callback(ctx)
-            except MemoryError, RecursionError:
-                raise
-            except Exception as exc:
-                logger.exception(
-                    EXECUTION_CHECKPOINT_CALLBACK_FAILED,
-                    execution_id=ctx.execution_id,
-                    turn=turn_number,
-                    error=f"{type(exc).__name__}: {exc}",
-                )
-
         if not response.tool_calls:
+            await self._run_checkpoint_callback(ctx, turn_number)
             return self._handle_completion(ctx, response, turns)
@@
-        return await execute_tool_calls(
+        post_tool = await execute_tool_calls(
             ctx,
             tool_invoker,
             response,
             turn_number,
             turns,
         )
+        if isinstance(post_tool, ExecutionResult):
+            return post_tool
+        await self._run_checkpoint_callback(post_tool, turn_number)
+        return post_tool
+
+    async def _run_checkpoint_callback(
+        self,
+        ctx: AgentContext,
+        turn_number: int,
+    ) -> None:
+        if self._checkpoint_callback is None:
+            return
+        try:
+            await self._checkpoint_callback(ctx)
+        except MemoryError, RecursionError:
+            raise
+        except Exception as exc:
+            logger.exception(
+                EXECUTION_CHECKPOINT_CALLBACK_FAILED,
+                execution_id=ctx.execution_id,
+                turn=turn_number,
+                error=f"{type(exc).__name__}: {exc}",
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/react_loop.py` around lines 209 - 240, The checkpoint
callback is currently invoked before tool execution, which risks checkpointing
prior to tool side-effects; change the flow so _checkpoint_callback is invoked
after tool execution for turns with tool calls: keep the existing call before
_handle_completion when response.tool_calls is empty, but remove the pre-tool
invocation and instead, after execute_tool_calls(...) returns (and after
check_shutdown handling), run await self._checkpoint_callback(ctx) with the same
exception handling (re-raise MemoryError/RecursionError, log other exceptions
using EXECUTION_CHECKPOINT_CALLBACK_FAILED with execution_id and turn_number).
Ensure you call it only once per turn and reference the existing symbols
_checkpoint_callback, execute_tool_calls, check_shutdown,
clear_last_turn_tool_calls, and _handle_completion to locate where to move the
invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/engine.md`:
- Around line 575-579: The earlier bullet about reconciliation currently implies
environment-state reconciliation is implemented; update that bullet in
docs/design/engine.md so it matches the later statement that only a simple
reconciliation message (agent receives system message with resume turn number
and error) is implemented now and that richer reconciliation
(workspace/environment-state detection) is planned future work—replace wording
that suggests full environment-state reconciliation with language explicitly
stating it is not yet implemented and refers readers to the "reconciliation
message" / "resume from a checkpoint" behavior described later.

In `@src/ai_company/engine/agent_engine.py`:
- Around line 741-767: The resumed-execution path in _execute_resumed_loop
builds a budget checker via make_budget_checker(...) directly, which bypasses
any configured BudgetEnforcer logic used in the normal _execute path; change
_execute_resumed_loop to construct the budget checker the same way as _execute
by preferring self._budget_enforcer.make_budget_checker(...) when
self._budget_enforcer is present (fall back to make_budget_checker(...) only if
no enforcer is set), ensuring the same per-agent allocation, downgrade policies,
and in-flight checks are applied for resumed runs.

In `@src/ai_company/engine/checkpoint/resume.py`:
- Around line 157-162: The heartbeat cleanup failure in resume.py is logged
using CHECKPOINT_DELETE_FAILED which mixes event types; change the
logger.warning call inside the heartbeat-delete branch to use a
heartbeat-specific event constant (e.g., HEARTBEAT_DELETE_FAILED or
CHECKPOINT_HEARTBEAT_DELETE_FAILED) instead of CHECKPOINT_DELETE_FAILED, and if
that constant doesn't exist add it to the events/constants module and import it
into src/ai_company/engine/checkpoint/resume.py so the logger call references
the correct heartbeat failure event (preserve the same message, execution_id,
error, and exc_info parameters).
- Line 59: Wrap the call to AgentContext.model_validate_json(checkpoint_json) in
a try/except that catches the validation/deserialization exception, log a
warning or error with context (e.g., the checkpoint identifier or a short
preview of checkpoint_json and the exception) using the module's logger, then
re-raise the exception; specifically update the code around
AgentContext.model_validate_json(checkpoint_json) to emit a process/logger.error
or logger.warning with the exception details before raising.

In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 191-218: The eviction in _reserve_resume_attempt relies on dict
insertion order via self._resume_counts and uses next(iter(...)) which yields
FIFO by insertion time, not least-recently-used; to adopt true LRU behavior
replace the dict with collections.OrderedDict (update its declaration where
_resume_counts is created), call move_to_end(execution_id) whenever you access
or increment an entry in _reserve_resume_attempt (i.e., after reading or
updating resume_count) and perform eviction with popitem(last=False) instead of
next(iter(...)) so the least-recently-used key is removed when
len(self._resume_counts) > _MAX_TRACKED_EXECUTIONS; keep the existing logging
and pop(execution_id, None) when exhausting attempts.

---

Outside diff comments:
In `@src/ai_company/engine/react_loop.py`:
- Around line 209-240: The checkpoint callback is currently invoked before tool
execution, which risks checkpointing prior to tool side-effects; change the flow
so _checkpoint_callback is invoked after tool execution for turns with tool
calls: keep the existing call before _handle_completion when response.tool_calls
is empty, but remove the pre-tool invocation and instead, after
execute_tool_calls(...) returns (and after check_shutdown handling), run await
self._checkpoint_callback(ctx) with the same exception handling (re-raise
MemoryError/RecursionError, log other exceptions using
EXECUTION_CHECKPOINT_CALLBACK_FAILED with execution_id and turn_number). Ensure
you call it only once per turn and reference the existing symbols
_checkpoint_callback, execute_tool_calls, check_shutdown,
clear_last_turn_tool_calls, and _handle_completion to locate where to move the
invocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b9cf14e2-89ad-4ffd-8ca0-83cd315999f4

📥 Commits

Reviewing files that changed from the base of the PR and between 2d22e9f and 6a8cfa7.

📒 Files selected for processing (21)
  • CLAUDE.md
  • README.md
  • docs/design/engine.md
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/persistence/sqlite/migrations.py
  • tests/unit/api/conftest.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/engine/checkpoint/test_models.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/persistence/test_migrations_v2.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
  • GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (6)
docs/design/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

When approved deviations occur, update the relevant docs/design/ page to reflect the new reality.

Files:

  • docs/design/engine.md
docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation source in docs/ built with Zensical. Design spec in docs/design/ (7 pages). Architecture in docs/architecture/. Roadmap in docs/roadmap/. Security in docs/security.md.

Files:

  • docs/design/engine.md
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax: except A, B: (no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001. Tests must use test-provider, test-small-001, etc.

Files:

  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/persistence/test_migrations_v2.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/callback.py
  • tests/unit/api/conftest.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/recovery.py
  • tests/unit/engine/checkpoint/test_models.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/persistence/sqlite/migrations.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/agent_engine.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • src/ai_company/observability/events/checkpoint.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed.
Test timeout: 30 seconds per test.
Prefer @pytest.mark.parametrize for testing similar cases.

Files:

  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/persistence/test_migrations_v2.py
  • tests/unit/api/conftest.py
  • tests/unit/engine/checkpoint/test_models.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/engine/checkpoint/test_strategy.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction plus MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with BaseModel, model_validator, computed_field, ConfigDict. Use @computed_field for derived values instead of storing redundant fields (e.g., TokenUsage.total_tokens).
Use NotBlankStr from core.types for all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER use import logging, logging.getLogger(), or print() in application code.
Always use logger as the variable name for loggers (not _logger, not log).
Event names must always use constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import dire...

Files:

  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/persistence/sqlite/migrations.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/observability/events/checkpoint.py
src/ai_company/**/[!_]*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)

Files:

  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/persistence/sqlite/migrations.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/observability/events/checkpoint.py
🧠 Learnings (11)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow exceptions.

Applied to files:

  • src/ai_company/engine/checkpoint/strategy.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/ai_company/engine/react_loop.py
  • CLAUDE.md
  • src/ai_company/observability/events/checkpoint.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.

Applied to files:

  • src/ai_company/engine/checkpoint/models.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • src/ai_company/engine/agent_engine.py
  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/ai_company/**/[!_]*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Structured logging: always use `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : NEVER use `import logging`, `logging.getLogger()`, or `print()` in application code.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Always use `logger` as the variable name for loggers (not `_logger`, not `log`).

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : DEBUG logging for object creation, internal flow, and entry/exit of key functions.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (12)
tests/unit/engine/checkpoint/test_callback_factory.py (4)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (88-120)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • from_identity (140-171)
src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
  • save (37-67)
tests/unit/persistence/test_migrations_v2.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
  • get_user_version (261-265)
tests/unit/persistence/sqlite/conftest.py (1)
  • memory_db (15-22)
src/ai_company/engine/checkpoint/resume.py (6)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • with_message (173-182)
src/ai_company/providers/enums.py (1)
  • MessageRole (6-12)
src/ai_company/providers/models.py (1)
  • ChatMessage (138-210)
src/ai_company/engine/loop_protocol.py (1)
  • ExecutionLoop (151-189)
src/ai_company/persistence/repositories.py (2)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
src/ai_company/engine/checkpoint/callback.py (1)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
tests/unit/api/conftest.py (2)
src/ai_company/engine/checkpoint/models.py (2)
  • Checkpoint (25-65)
  • Heartbeat (68-85)
src/ai_company/persistence/protocol.py (2)
  • checkpoints (146-148)
  • heartbeats (151-153)
src/ai_company/engine/checkpoint/callback_factory.py (4)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-65)
  • CheckpointConfig (88-120)
  • Heartbeat (68-85)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/persistence/repositories.py (5)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
  • save (48-57)
  • save (114-123)
  • save (170-180)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
src/ai_company/engine/recovery.py (1)
src/ai_company/engine/task_execution.py (1)
  • TaskExecution (60-231)
tests/unit/engine/checkpoint/test_models.py (1)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-65)
  • CheckpointConfig (88-120)
  • Heartbeat (68-85)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
  • Heartbeat (68-85)
src/ai_company/persistence/errors.py (1)
  • QueryError (33-34)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • _row_to_model (159-174)
src/ai_company/engine/react_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • _checkpoint_callback (63-76)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
src/ai_company/persistence/sqlite/migrations.py (2)
tests/unit/persistence/sqlite/test_user_repo.py (1)
  • db (22-28)
tests/unit/hr/test_persistence.py (1)
  • db (29-35)
src/ai_company/engine/agent_engine.py (6)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (88-120)
src/ai_company/engine/checkpoint/resume.py (2)
  • cleanup_after_resume (119-162)
  • deserialize_and_reconcile (37-76)
src/ai_company/engine/checkpoint/strategy.py (1)
  • CheckpointRecoveryStrategy (41-264)
src/ai_company/engine/recovery.py (4)
  • FailAndReassignStrategy (150-220)
  • RecoveryResult (31-112)
  • RecoveryStrategy (116-147)
  • can_resume (110-112)
src/ai_company/persistence/repositories.py (2)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
src/ai_company/engine/plan_execute_loop.py (1)
  • execute (103-181)
🪛 LanguageTool
CLAUDE.md

[style] ~154-~154: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...

(EG_NO_COMMA)

🔇 Additional comments (31)
CLAUDE.md (2)

107-107: LGTM!

The engine description accurately reflects the new checkpoint recovery feature with per-turn persistence, heartbeat detection, and the CheckpointRecoveryStrategy.


154-154: LGTM!

The event names list is comprehensive and correctly documents the new checkpoint-related event constants (CHECKPOINT_SAVED, PERSISTENCE_CHECKPOINT_SAVED, etc.) with the proper import pattern.

src/ai_company/persistence/sqlite/migrations.py (2)

224-256: LGTM!

The V6 migration DDL is well-structured:

  • checkpoints table with appropriate constraints (CHECK (turn_number >= 0)) and indexes for efficient lookups by execution_id and task_id
  • heartbeats table with execution_id as primary key and index on last_heartbeat_at for stale detection queries
  • Good documentation explaining the ascending index choice for SQLite portability and the rationale for no FK constraints on ephemeral recovery data

320-334: LGTM!

The _apply_v6 function follows the established migration pattern and is correctly registered in _MIGRATIONS.

src/ai_company/engine/checkpoint/models.py (2)

25-65: LGTM!

The Checkpoint model is well-designed:

  • Frozen for immutability
  • NotBlankStr for identifier fields
  • AwareDatetime ensures timezone-aware timestamps
  • The _validate_context_json validator correctly ensures the serialized context is valid JSON and specifically a JSON object (not primitive or array)

68-120: LGTM!

Both Heartbeat and CheckpointConfig models follow the established patterns:

  • Frozen configuration with appropriate field constraints
  • Clear documentation of reserved fields (heartbeat_interval_seconds for future background loop)
  • Sensible defaults (persist every turn, 30s heartbeat interval, 2 max resume attempts)
tests/unit/api/conftest.py (2)

338-395: LGTM!

The fake repository implementations are well-crafted:

  • FakeCheckpointRepository.get_latest correctly enforces the contract requiring at least one of execution_id or task_id (Lines 353-355)
  • get_stale sorts by last_heartbeat_at for consistent ordering
  • Both repositories provide complete method coverage matching the protocol expectations

411-412: LGTM!

The FakePersistenceBackend is properly extended with checkpoint and heartbeat repository instances and property accessors, aligning with the PersistenceBackend protocol.

Also applies to: 476-482

src/ai_company/engine/checkpoint/strategy.py (4)

41-71: LGTM!

The CheckpointRecoveryStrategy initialization is clean:

  • Proper dependency injection for checkpoint_repo, config, and optional fallback
  • Default fallback to FailAndReassignStrategy is sensible
  • asyncio.Lock for thread-safe resume count management

72-132: LGTM!

The recover() method is well-structured after refactoring:

  • Clear flow: load checkpoint → check if exists → reserve attempt → build result or delegate
  • Proper logging at recovery start
  • Helper methods handle the complexity, keeping this orchestrator readable

155-189: LGTM!

The _load_latest_checkpoint helper properly:

  • Re-raises MemoryError, RecursionError (PEP 758 syntax)
  • Catches PersistenceError specifically and logs before returning None
  • Logs at appropriate levels (exception for errors, info for no checkpoint, debug for success)

220-250: LGTM!

The _build_resume_result helper correctly assembles the RecoveryResult with checkpoint context and logs the resume attempt details.

src/ai_company/engine/checkpoint/__init__.py (1)

1-29: LGTM!

The package __init__.py cleanly exposes the public API for checkpoint recovery with a clear docstring and explicit __all__ definition.

README.md (2)

34-34: LGTM!

Documentation accurately reflects the new crash recovery capability in the Agent Orchestration section.


114-114: LGTM!

The architecture diagram correctly adds the Persistence -.-> Engine edge to reflect the new checkpoint and heartbeat persistence relationship.

tests/unit/persistence/test_migrations_v2.py (1)

58-58: LGTM!

Good improvement — using SCHEMA_VERSION instead of hardcoded values makes this test resilient to future migration bumps while still validating that migrations run successfully.

src/ai_company/engine/checkpoint/callback.py (1)

13-14: Callback type contract is clean and consistent.

The alias is clear and matches the async callback shape used by loop integrations.

tests/unit/engine/test_recovery_checkpoint_fields.py (1)

97-127: Checkpoint consistency negative paths are now properly pinned.

Good addition of both failure cases (json without attempt, attempt without json) to guard the invariant.

tests/unit/engine/checkpoint/test_models.py (1)

106-137: JSON validation test coverage is solid.

Good coverage for valid object JSON plus invalid array/primitive/invalid-string paths.

tests/unit/engine/checkpoint/test_callback_factory.py (1)

147-163: Turn-0 boundary coverage is correctly added.

The parametrized matrix now explicitly validates the “turn 0 always skipped” guard.

tests/unit/engine/checkpoint/test_strategy.py (1)

191-255: Max-attempt fallback behavior is well exercised.

The sequence tests for attempts 1/2 then fallback (and the zero-attempt edge) are strong and clear.

src/ai_company/persistence/sqlite/heartbeat_repo.py (2)

37-67: LGTM! UTC normalization properly implemented.

The UTC normalization at lines 43-45 correctly addresses the lexicographic comparison issue for timestamps with different timezone offsets. The save method properly converts last_heartbeat_at to UTC before persisting, ensuring consistent string ordering in SQLite queries.


96-127: LGTM! Threshold normalization ensures consistent stale detection.

The get_stale method correctly normalizes the threshold to UTC at line 103, ensuring lexicographic comparisons work correctly regardless of the input timezone.

src/ai_company/engine/recovery.py (2)

72-92: LGTM! Cross-field validation ensures checkpoint data consistency.

The _validate_checkpoint_consistency validator correctly enforces the invariant that checkpoint_context_json and resume_attempt must be set together (both non-default) or both at defaults. The JSON validation also ensures the checkpoint context is a valid JSON object.


106-112: LGTM! Clean computed property for resume capability.

The can_resume property provides a simple boolean check derived from checkpoint_context_json, keeping the API clean and avoiding redundant state.

src/ai_company/engine/checkpoint/callback_factory.py (2)

63-76: LGTM! Clean callback implementation with correct sequencing.

The callback correctly:

  1. Skips turn 0 and non-boundary turns
  2. Only updates heartbeat after checkpoint save succeeds (preventing orphaned heartbeats)

78-104: LGTM! Best-effort checkpoint save with appropriate error isolation.

The _save_checkpoint helper correctly re-raises MemoryError and RecursionError (non-recoverable), while swallowing other exceptions to avoid crashing the execution loop. The return value communicates success to control heartbeat update.

src/ai_company/observability/events/checkpoint.py (1)

1-28: LGTM! Well-organized event constants.

The constants follow the established naming patterns and provide complete coverage for checkpoint lifecycle, heartbeat lifecycle, loop integration, and recovery flow events.

src/ai_company/engine/agent_engine.py (3)

173-181: LGTM! Proper validation for paired repository configuration.

The validation ensures that checkpoint_repo and heartbeat_repo are either both provided or both omitted, preventing misconfiguration.


691-699: LGTM! Error logging before raising addressed.

The code now logs with structured context (agent_id, task_id, error) at ERROR level before raising RuntimeError, satisfying the guideline that all error paths must log before raising.


783-795: LGTM! Checkpoint cleanup correctly conditioned on non-ERROR termination.

The check at line 783 ensures checkpoints and heartbeats are only deleted after successful resume completion, preserving them for potential subsequent resume attempts if the resumed execution itself ends in an error.

Comment thread docs/design/engine.md
Comment thread src/ai_company/engine/agent_engine.py
Comment thread src/ai_company/engine/checkpoint/resume.py Outdated
Comment thread src/ai_company/engine/checkpoint/resume.py
Comment on lines +191 to +218
async def _reserve_resume_attempt(
self,
execution_id: str,
task_id: str,
) -> bool:
"""Reserve a resume attempt, returning ``True`` when exhausted."""
async with self._resume_lock:
resume_count = self._resume_counts.get(execution_id, 0)
if resume_count >= self._config.max_resume_attempts:
logger.info(
CHECKPOINT_RECOVERY_FALLBACK,
execution_id=execution_id,
task_id=task_id,
resume_count=resume_count,
max_resume_attempts=self._config.max_resume_attempts,
reason="max_resume_attempts_exhausted",
)
self._resume_counts.pop(execution_id, None)
return True

self._resume_counts[execution_id] = resume_count + 1

# Evict oldest entries when the dict grows too large
if len(self._resume_counts) > _MAX_TRACKED_EXECUTIONS:
oldest = next(iter(self._resume_counts))
self._resume_counts.pop(oldest, None)

return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider FIFO eviction ordering.

The eviction logic uses next(iter(self._resume_counts)) to remove the "oldest" entry. However, since Python 3.7+ dicts maintain insertion order, this removes the first inserted key, not necessarily the least recently used. If an execution is resumed multiple times, it stays at its original insertion position.

This is acceptable for a safety bound (the goal is preventing unbounded growth, not perfect LRU), but worth noting. If precise eviction semantics matter, an OrderedDict with move-to-end on access would be needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/checkpoint/strategy.py` around lines 191 - 218, The
eviction in _reserve_resume_attempt relies on dict insertion order via
self._resume_counts and uses next(iter(...)) which yields FIFO by insertion
time, not least-recently-used; to adopt true LRU behavior replace the dict with
collections.OrderedDict (update its declaration where _resume_counts is
created), call move_to_end(execution_id) whenever you access or increment an
entry in _reserve_resume_attempt (i.e., after reading or updating resume_count)
and perform eviction with popitem(last=False) instead of next(iter(...)) so the
least-recently-used key is removed when len(self._resume_counts) >
_MAX_TRACKED_EXECUTIONS; keep the existing logging and pop(execution_id, None)
when exhausting attempts.

CodeRabbit (5 new actionable comments on commit 6a8cfa7):
- Resume path now uses BudgetEnforcer when configured (was bypassing
  enforcer-specific features like per-agent allocation, auto-downgrade)
- Log deserialization errors before re-raising in resume.py
- Use HEARTBEAT_DELETE_FAILED event constant for heartbeat cleanup
  failures (was incorrectly using CHECKPOINT_DELETE_FAILED)
- Fix reconciliation bullet in design spec (was inconsistent with
  implementation description below it)
- Add comment explaining checkpoint placement before tool execution
  (intentional: preserves LLM response on mid-tool crash)

Greptile (3 findings, confidence 3/5):
- Clean up orphaned checkpoint/heartbeat rows on max-resume fallback
  path (delegate_to_fallback now calls cleanup_after_resume)
- Add heartbeat_repo to CheckpointRecoveryStrategy for fallback cleanup
- Document in-memory resume counter limitation with TODO comment
  (persisting counter across restarts is a planned improvement)
- Wrong heartbeat event constant — already fixed above
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the checkpoint recovery strategy (§6.6 Strategy 2) for agent crash recovery, enabling per-turn AgentContext persistence, heartbeat-based failure detection, and resume from last checkpoint with reconciliation. It falls back to fail-and-reassign after max resume attempts are exhausted.

Changes:

  • Adds Checkpoint, Heartbeat, CheckpointConfig frozen models, CheckpointRecoveryStrategy with async-safe resume counter, callback factory, and resume/cleanup helpers in a new engine/checkpoint/ package
  • Adds CheckpointRepository and HeartbeatRepository protocols with SQLite implementations, V6 migration (checkpoints + heartbeats tables with indexes), and integrates them into the PersistenceBackend protocol
  • Integrates checkpoint callbacks into ReactLoop and PlanExecuteLoop, adds resume path in AgentEngine with completion_config/effective_autonomy threading, and extends RecoveryResult with checkpoint_context_json, resume_attempt, and can_resume fields

Reviewed changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ai_company/engine/checkpoint/__init__.py Package init re-exporting all checkpoint public symbols
src/ai_company/engine/checkpoint/callback.py CheckpointCallback type alias definition
src/ai_company/engine/checkpoint/callback_factory.py Factory closure for checkpoint+heartbeat persistence per turn
src/ai_company/engine/checkpoint/models.py Checkpoint, Heartbeat, CheckpointConfig frozen Pydantic models
src/ai_company/engine/checkpoint/resume.py Deserialization, reconciliation, loop-with-callback builder, cleanup helpers
src/ai_company/engine/checkpoint/strategy.py CheckpointRecoveryStrategy with async-safe resume counter and fallback
src/ai_company/engine/recovery.py Extended RecoveryResult with checkpoint fields and consistency validator
src/ai_company/engine/react_loop.py Checkpoint callback integration after LLM response
src/ai_company/engine/plan_execute_loop.py Checkpoint callback integration via _invoke_checkpoint_callback
src/ai_company/engine/agent_engine.py Resume path, loop-with-callback, paired repo validation, completion_config/effective_autonomy threading
src/ai_company/engine/__init__.py Re-exports checkpoint symbols
src/ai_company/persistence/repositories.py CheckpointRepository and HeartbeatRepository protocol definitions
src/ai_company/persistence/protocol.py Added checkpoints and heartbeats properties to PersistenceBackend
src/ai_company/persistence/sqlite/checkpoint_repo.py SQLite checkpoint repository implementation
src/ai_company/persistence/sqlite/heartbeat_repo.py SQLite heartbeat repository implementation
src/ai_company/persistence/sqlite/backend.py Integrated checkpoint/heartbeat repos into SQLite backend
src/ai_company/persistence/sqlite/__init__.py Re-exports new SQLite repos
src/ai_company/persistence/sqlite/migrations.py V6 migration (checkpoints + heartbeats tables with indexes)
src/ai_company/observability/events/checkpoint.py 16 checkpoint/heartbeat/recovery event constants
src/ai_company/observability/events/persistence.py Persistence-layer checkpoint/heartbeat event constants
src/ai_company/observability/events/execution.py Execution-layer checkpoint callback and resume event constants
tests/unit/engine/checkpoint/test_models.py Tests for Checkpoint, Heartbeat, CheckpointConfig models
tests/unit/engine/checkpoint/test_callback_factory.py Tests for checkpoint callback factory
tests/unit/engine/checkpoint/test_strategy.py Tests for CheckpointRecoveryStrategy
tests/unit/engine/test_recovery_checkpoint_fields.py Tests for RecoveryResult checkpoint fields
tests/unit/persistence/sqlite/test_checkpoint_repo.py Tests for SQLiteCheckpointRepository
tests/unit/persistence/sqlite/test_heartbeat_repo.py Tests for SQLiteHeartbeatRepository
tests/unit/persistence/sqlite/test_migrations_v6.py Tests for V6 migration
tests/unit/persistence/test_protocol.py Protocol conformance tests for new repos
tests/unit/persistence/test_migrations_v2.py Updated to use SCHEMA_VERSION constant instead of hardcoded 5
tests/unit/api/conftest.py Fake checkpoint/heartbeat repos for API tests
tests/unit/observability/test_events.py Event constant discovery and existence tests
docs/design/engine.md Updated recovery docs with checkpoint fields, removed "Planned" warning
README.md Added checkpoint resume mention and Persistence→Engine dependency
CLAUDE.md Updated engine description and event name examples

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +105
expected = {
"idx_cp_execution_id",
"idx_cp_task_id",
"idx_cp_exec_turn",
}
assert expected.issubset(indexes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement checkpoint recovery strategy

2 participants