Skip to content

feat(composio): incremental sync with per-item persistence for Gmail and Notion#519

Merged
senamakel merged 2 commits into
mainfrom
feat/composeio-sync
Apr 13, 2026
Merged

feat(composio): incremental sync with per-item persistence for Gmail and Notion#519
senamakel merged 2 commits into
mainfrom
feat/composeio-sync

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented Apr 13, 2026

Summary

  • Incremental sync — Gmail and Notion providers now track what has been synced via persistent cursor + synced-ID set in the local KV store, fetching only new/updated items instead of re-downloading everything each cycle.
  • Per-item memory documents — Each email and Notion page is stored as its own memory document (e.g. composio-gmail-msg-{id}, composio-notion-page-{id}) instead of one giant snapshot blob, improving agent recall granularity.
  • Daily request budget — Each provider connection is capped at 500 execute_tool API calls per calendar day (auto-resets), preventing runaway backfills during initial sync.
  • Pagination support — Both providers now follow pagination tokens (nextPageToken for Gmail, next_cursor for Notion) across multiple pages within budget.
  • Edit detection (Notion) — Uses composite {page_id}@{edited_time} keys so pages edited after their last sync are re-persisted.

Changes

File What
src/openhuman/composio/providers/sync_state.rs NewSyncState, DailyBudget, KV-backed load/save, dedup helpers, per-item persist helper
src/openhuman/composio/providers/gmail.rs Rewritten sync loop: cursor-based date filtering, pagination, per-message persistence
src/openhuman/composio/providers/notion.rs Rewritten sync loop: cursor boundary detection, pagination, per-page persistence with title extraction
src/openhuman/composio/providers/mod.rs Added pub mod sync_state

Test plan

  • cargo check passes
  • All 33 provider unit tests pass (sync_state, gmail, notion, registry)
  • Full cargo test suite passes (2464 passed, 1 pre-existing flaky failure unrelated to this PR)
  • Pre-push hooks pass (format, lint, typecheck, rust check)
  • Manual test: connect Gmail via OAuth → verify incremental sync persists individual message docs
  • Manual test: connect Notion via OAuth → verify per-page docs and edit re-sync
  • Manual test: verify daily budget caps at 500 requests and resets next day

Summary by CodeRabbit

  • New Features
    • Gmail and Notion providers now support incremental syncing, fetching only new and updated items instead of complete snapshots.
    • Added pagination support for handling large datasets efficiently.
    • Implemented daily request budget tracking to manage API usage limits.
    • Duplicate messages and pages are automatically deduplicated during sync.

…mail and Notion providers

- Enhanced the Gmail and Notion providers to support incremental synchronization with per-item persistence, improving data handling efficiency.
- Introduced a new `SyncState` module to manage persistent sync state, including cursor tracking, synced IDs, and daily request budget management.
- Updated sync logic to load state from a KV store, check daily budget limits, and handle paginated API requests, ensuring robust data retrieval and deduplication.
- Refactored existing sync methods to utilize the new state management, enhancing overall reliability and performance of the providers.
- Improved documentation for the sync process and state management, clarifying the operational flow and usage of the new features.
…otion providers

- Enhanced error handling in the Gmail provider to format error messages more clearly during email fetching.
- Streamlined debug logging in both Gmail and Notion providers to improve readability by consolidating multiline statements into single lines.
- Refactored the `extract_page_title` function in the Notion provider for better clarity in property extraction logic.
- Overall, these changes aim to enhance maintainability and improve the clarity of error reporting and logging across the providers.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Replaces snapshot-based syncing with incremental, stateful synchronization for Gmail and Notion providers. Introduces a new sync_state.rs module managing persistent per-connection state (cursor, deduplication set, daily request budget), pagination logic, and per-item persistence. Adds helper functions for token extraction and filter conversion.

Changes

Cohort / File(s) Summary
Sync State Foundation
src/openhuman/composio/providers/sync_state.rs
New module implementing SyncState and DailyBudget for persistent synchronization tracking, request budgeting, deduplication, and per-item persistence helpers; includes KV storage integration and unit tests.
Incremental Provider Syncing
src/openhuman/composio/providers/gmail.rs, src/openhuman/composio/providers/notion.rs
Refactored sync flows from snapshot-based to incremental pagination with state loading/saving, daily budget enforcement, per-item deduplication and persistence, cursor advancement, and updated reporting; removed batch persistence helpers.
Module Exports
src/openhuman/composio/providers/mod.rs
Exposed new sync_state public module.

Sequence Diagram

sequenceDiagram
    participant Provider as Gmail/Notion<br/>Provider
    participant Memory as Memory Store<br/>(KV)
    participant API as External API<br/>(Gmail/Notion)
    participant Persist as Memory Persistence
    
    Provider->>Memory: Load SyncState<br/>(cursor, synced_ids, budget)
    Memory-->>Provider: Return persisted state
    
    alt Budget Exhausted
        Provider-->>Provider: Early exit with<br/>budget_exhausted
    else Budget Available
        loop Paginate while budget available
            Provider->>API: Fetch items (paginated)<br/>with optional cursor filter
            API-->>Provider: Return page of items
            
            loop Process each item
                Provider->>Provider: Extract ID, compute sync_key
                alt Item already synced
                    Provider->>Provider: Skip (dedup)
                else New/Updated item
                    Provider->>Persist: Persist single item<br/>as memory document
                    Persist-->>Provider: Document persisted
                    Provider->>Provider: Mark in synced_ids,<br/>record request
                end
            end
            
            Provider->>Provider: Advance cursor to<br/>newest item timestamp
        end
    end
    
    Provider->>Memory: Save updated SyncState<br/>(cursor, synced_ids, budget)
    Memory-->>Provider: State persisted
    Provider-->>Provider: Return sync outcome
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • openhuman#509: Introduces provider-based Composio sync architecture that this PR builds upon with incremental state-tracked syncing and per-item persistence patterns applied to Gmail and Notion providers.

Poem

🐰 Hop, hop—the sync state hops along,
With cursor marks and budgets strong,
Each email, page, de-duped with care,
Incremental blessings everywhere! 📬

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and concisely summarizes the main change: adding incremental sync with per-item persistence for Gmail and Notion providers, which is the primary focus of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/composeio-sync

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openhuman/composio/providers/sync_state.rs (1)

53-57: Potential unbounded growth of synced_ids set.

For high-volume providers like Gmail, synced_ids could grow to tens of thousands of entries over time. Consider adding a pruning mechanism or a size cap with LRU eviction to prevent excessive memory/storage consumption.

This isn't blocking for the initial implementation, but should be tracked for follow-up.

Would you like me to open an issue to track implementing a pruning strategy for the synced_ids set?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/openhuman/composio/providers/sync_state.rs` around lines 53 - 57, The
synced_ids HashSet<String> can grow unbounded for high-volume providers; replace
or wrap it with a bounded, eviction-capable structure (e.g., an LRU cache) and
perform insertions via that API instead of direct HashSet inserts so old IDs are
pruned; specifically change the synced_ids field (or add a new field in the same
struct) to use a capacity-limited container (for example lru::LruCache<String,
()> or a custom RingBuffer+HashSet combo), update any methods that modify/access
synced_ids to use the new API (insert/check/evict), and ensure serde
(de)serialization is handled (use a serializable wrapper or convert on
save/load) so persistence semantics remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/openhuman/composio/providers/sync_state.rs`:
- Around line 53-57: The synced_ids HashSet<String> can grow unbounded for
high-volume providers; replace or wrap it with a bounded, eviction-capable
structure (e.g., an LRU cache) and perform insertions via that API instead of
direct HashSet inserts so old IDs are pruned; specifically change the synced_ids
field (or add a new field in the same struct) to use a capacity-limited
container (for example lru::LruCache<String, ()> or a custom RingBuffer+HashSet
combo), update any methods that modify/access synced_ids to use the new API
(insert/check/evict), and ensure serde (de)serialization is handled (use a
serializable wrapper or convert on save/load) so persistence semantics remain
intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2fe14d9-e289-43bd-bc82-c2debb5bff98

📥 Commits

Reviewing files that changed from the base of the PR and between a115758 and b34388a.

📒 Files selected for processing (4)
  • src/openhuman/composio/providers/gmail.rs
  • src/openhuman/composio/providers/mod.rs
  • src/openhuman/composio/providers/notion.rs
  • src/openhuman/composio/providers/sync_state.rs

@senamakel senamakel merged commit 08178b5 into main Apr 13, 2026
13 of 14 checks passed
@senamakel senamakel deleted the feat/composeio-sync branch May 2, 2026 13:05
@senamakel senamakel restored the feat/composeio-sync branch May 2, 2026 13:05
@senamakel senamakel deleted the feat/composeio-sync branch May 2, 2026 13:05
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
…and Notion (tinyhumansai#519)

* feat(composio): implement incremental sync and state management for Gmail and Notion providers

- Enhanced the Gmail and Notion providers to support incremental synchronization with per-item persistence, improving data handling efficiency.
- Introduced a new `SyncState` module to manage persistent sync state, including cursor tracking, synced IDs, and daily request budget management.
- Updated sync logic to load state from a KV store, check daily budget limits, and handle paginated API requests, ensuring robust data retrieval and deduplication.
- Refactored existing sync methods to utilize the new state management, enhancing overall reliability and performance of the providers.
- Improved documentation for the sync process and state management, clarifying the operational flow and usage of the new features.

* refactor(composio): improve error handling and logging in Gmail and Notion providers

- Enhanced error handling in the Gmail provider to format error messages more clearly during email fetching.
- Streamlined debug logging in both Gmail and Notion providers to improve readability by consolidating multiline statements into single lines.
- Refactored the `extract_page_title` function in the Notion provider for better clarity in property extraction logic.
- Overall, these changes aim to enhance maintainability and improve the clarity of error reporting and logging across the providers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant