fix(client): advance download watermark atomically per batch by dubadub · Pull Request #14 · cooklang/cooklang-sync

dubadub · 2026-04-14T19:53:50Z

Summary

Fixes a bug in the client's incremental-download watermark that permanently skips files after a partial sync.

Root cause: check_download_once used registry::latest_jid = max(file_records.jid) as the watermark. This advances every time a file is saved. The server's list(jid) filter returns only paths whose latest commit id > jid, so if the client saved a high-jid file (id=500) and then crashed before saving a lower-jid file from the same batch (id=300), the next poll requested list(500), got nothing for id=300, and that file was invisible forever.

Fix: Move the watermark into a dedicated sync_state table and advance it only once per batch, inside a DB transaction that also writes all file_records rows. Either everything lands or nothing does — partial crash leaves the watermark pointing at the previous safe value, the whole batch is re-requested on the next poll (chunks already on disk are reused via the cache).

Changes

New migration 2026-04-14-000000_add_sync_state — single-row-per-namespace table holding the download watermark.
registry::get_download_watermark / set_download_watermark + 4 unit tests.
check_download_once: reads from new watermark, disk saves happen up-front, then all file_record writes + watermark update run inside conn.transaction.

Upgrade behaviour

Fresh installs and upgrades start with watermark=0, triggering one full re-sync. That also heals any gaps that accumulated under the old mechanism (chunks already on disk are cache hits, so the re-download is cheap).

Test plan

cargo test --lib (38 tests pass, incl. 4 new)
cargo build --workspace (server still compiles)
cargo clippy --lib --all-features (clean)
Manual: delete device sync.db, cut fresh install, verify all recipes arrive and subsequent increments work
Simulate partial failure: ctrl-c during a download batch, resume, verify no files are lost

Observed before fix

On Android: device had 69 synced paths with jids scattered 261..561792, but was missing Breakfast/Easy Pancakes.cook, Tuscan Chicken, Shakshuka, and recipes referenced by 3 Day Plan XVIII.menu. Deleting the device's sync.db (forcing resync from jid=0) restored all files — confirming the server had them but the watermark was skipping them.

The incremental download watermark was derived from `max(file_records.jid)`, which advances every time a single file is saved. Combined with the server's `list(jid)` filter — which hides any path whose latest commit id is <= `jid` — this permanently skips files if a batch is interrupted partway. If the client saved a high-jid file (say id=500) and then crashed before saving a lower-jid file from the same batch (id=300), the next poll requested `list(500)`, got nothing for id=300, and the file was invisible forever. Move the watermark into a dedicated `sync_state` table and advance it only once per batch, inside a DB transaction that also writes all `file_records` rows. Either everything lands or nothing does; a partial crash leaves the watermark pointing at the previous safe value so the whole batch is re-requested on the next poll (chunks already on disk are reused via the local cache). Fresh installs and upgrades start with watermark=0, triggering one full re-sync. That also heals any gaps that accumulated under the old mechanism.

claude · 2026-04-14T19:56:00Z

Code Review

Overview

This PR fixes a real, well-documented crash-recovery bug: the old max(file_records.jid) watermark could race ahead of lower-jid files from the same batch, making those files permanently invisible to subsequent polls. The new dedicated sync_state table + single-transaction commit is the correct fix. The root-cause analysis and PR description are excellent.

Correctness / Logic

The transaction is correct. Disk writes happen first (lines 315–334), then a single conn.transaction atomically persists all file_records rows and the advanced watermark. A crash before the transaction leaves the watermark untouched; the next poll re-requests the same batch and chunk cache hits make it cheap. This is the right design.

max_id is correct. The watermark advances to max(d.id) across both created and deleted entries, which matches what the server's list(jid) filter expects.

Issues

1. unwrap_or(0) silently swallows errors (syncer.rs:263)

let latest_local = registry::get_download_watermark(conn, namespace_id).unwrap_or(0);

get_download_watermark already returns Ok(0) when no row exists, so the unwrap_or only fires on a real DB error — in which case it silently triggers a full re-sync instead of propagating the error. Every other registry call in this function uses ?. This should be:

let latest_local = registry::get_download_watermark(conn, namespace_id)?;

2. latest_jid is now dead code (registry.rs:76)

The only caller was the old check_download_once line that this PR replaces. latest_jid is no longer called anywhere. It will likely generate a dead_code warning (or Clippy will flag it). Either remove it or add #[allow(dead_code)] with a comment explaining why it's kept.

3. allow_tables_to_appear_in_same_query! is speculative (schema.rs)

diesel::allow_tables_to_appear_in_same_query!(file_records, sync_state,);

sync_state is only ever accessed through its own registry functions — there are no joins between these two tables in the codebase. This macro isn't needed yet. Add it when a cross-table query actually exists; adding it preemptively muddies the schema.

Minor Suggestions

set_download_watermark return type — It returns Result<usize> (Diesel's execute return), but the upsert always touches exactly one row and callers never inspect the count. Returning Result<()> (via .map(|_| ())) would give a cleaner interface and make accidental misuse (e.g. checking the count) impossible.

down.sql — Dropping the table is correct for rollback. Consider whether a migration down should also handle re-seeding file_records or if "re-sync from jid=0 on re-upgrade" is acceptable (it is, given the cheap chunk-cache re-download, but worth a brief comment).

Tests

The four new unit tests are well-structured (Given/When/Then), cover the important cases (missing row → 0, round-trip, upsert-not-duplicate, namespace isolation), and use real SQLite rather than mocks. Good coverage for the new registry layer.

Missing: an integration-level test for check_download_once that verifies the watermark is not advanced when the transaction fails mid-batch. This is hard to test without fault injection, but worth noting as a gap.

Summary


Bug fix correctness	✅
Transaction atomicity	✅
Tests	✅ (unit), ⚠️ no failure-path integration test
`unwrap_or(0)` swallows errors	❌ should be `?`
`latest_jid` dead code	⚠️ remove or annotate
Speculative `allow_tables_to_appear_in_same_query!`	⚠️ remove until needed

The core fix is solid. The unwrap_or(0) issue is the only one worth blocking on — the others are cleanup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): advance download watermark atomically per batch#14

fix(client): advance download watermark atomically per batch#14
dubadub wants to merge 1 commit into
mainfrom
fix/download-watermark

dubadub commented Apr 14, 2026

Uh oh!

claude Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dubadub commented Apr 14, 2026

Summary

Changes

Upgrade behaviour

Test plan

Observed before fix

Uh oh!

claude Bot commented Apr 14, 2026

Code Review

Overview

Correctness / Logic

Issues

Minor Suggestions

Tests

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant