Skip to content

feat(streaming): implement checkpoint coordination with catalog-backed crash recovery#223

Open
luoluoyuyu wants to merge 5 commits intoFunctionStream:mainfrom
luoluoyuyu:streaming-storage
Open

feat(streaming): implement checkpoint coordination with catalog-backed crash recovery#223
luoluoyuyu wants to merge 5 commits intoFunctionStream:mainfrom
luoluoyuyu:streaming-storage

Conversation

@luoluoyuyu
Copy link
Copy Markdown
Collaborator

No description provided.

…d crash recovery

Introduce a full Chandy-Lamport distributed snapshot mechanism with
catalog-persisted safe epochs for exactly-once crash recovery:

- Add LSM-Tree state engine with MVCC, Bloom filters, watermark GC,
  dedicated I/O thread pool with panic isolation (state/ module)
- Add JobMasterEvent protocol for checkpoint ACK/Decline signaling
- Integrate MemoryController and IoPool into JobManager lifecycle
- Implement checkpoint coordinator: periodic barrier injection,
  ACK collection, and CatalogManager.commit_job_checkpoint() callback
- Extend StreamingTableDefinition proto with checkpoint_interval_ms
  and latest_checkpoint_epoch for durable recovery metadata
- Support user-defined checkpoint interval via SQL WITH clause
- Restore streaming jobs from catalog with precise epoch recovery

Made-with: Cursor
…e watermark harvesting

- Extend TaskContext with state_dir, memory_controller, io_manager,
  and safe_epoch to bridge operators with the state engine
- Refactor JoinWithExpirationOperator: replace in-memory VecDeque with
  PersistentStateBuffer backed by OperatorStateStore, using composite
  keys [Side(1B) + Timestamp(8B BE)] and BTreeSet timeline index
- Refactor InstantJoinOperator: replace in-memory BTreeMap<SystemTime,
  JoinInstance> with LSM-Tree persistence, split process_watermark into
  3-phase pipeline (harvest -> compute -> cleanup) to eliminate
  interleaved mutable/immutable borrow conflicts
- Both operators now support on_start recovery via restore_metadata and
  snapshot_state via snapshot_epoch for exactly-once semantics

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant