Skip to content

fix(client): advance download watermark atomically per batch#14

Open
dubadub wants to merge 1 commit into
mainfrom
fix/download-watermark
Open

fix(client): advance download watermark atomically per batch#14
dubadub wants to merge 1 commit into
mainfrom
fix/download-watermark

Conversation

@dubadub
Copy link
Copy Markdown
Member

@dubadub dubadub commented Apr 14, 2026

Summary

Fixes a bug in the client's incremental-download watermark that permanently skips files after a partial sync.

Root cause: check_download_once used registry::latest_jid = max(file_records.jid) as the watermark. This advances every time a file is saved. The server's list(jid) filter returns only paths whose latest commit id > jid, so if the client saved a high-jid file (id=500) and then crashed before saving a lower-jid file from the same batch (id=300), the next poll requested list(500), got nothing for id=300, and that file was invisible forever.

Fix: Move the watermark into a dedicated sync_state table and advance it only once per batch, inside a DB transaction that also writes all file_records rows. Either everything lands or nothing does — partial crash leaves the watermark pointing at the previous safe value, the whole batch is re-requested on the next poll (chunks already on disk are reused via the cache).

Changes

  • New migration 2026-04-14-000000_add_sync_state — single-row-per-namespace table holding the download watermark.
  • registry::get_download_watermark / set_download_watermark + 4 unit tests.
  • check_download_once: reads from new watermark, disk saves happen up-front, then all file_record writes + watermark update run inside conn.transaction.

Upgrade behaviour

Fresh installs and upgrades start with watermark=0, triggering one full re-sync. That also heals any gaps that accumulated under the old mechanism (chunks already on disk are cache hits, so the re-download is cheap).

Test plan

  • cargo test --lib (38 tests pass, incl. 4 new)
  • cargo build --workspace (server still compiles)
  • cargo clippy --lib --all-features (clean)
  • Manual: delete device sync.db, cut fresh install, verify all recipes arrive and subsequent increments work
  • Simulate partial failure: ctrl-c during a download batch, resume, verify no files are lost

Observed before fix

On Android: device had 69 synced paths with jids scattered 261..561792, but was missing Breakfast/Easy Pancakes.cook, Tuscan Chicken, Shakshuka, and recipes referenced by 3 Day Plan XVIII.menu. Deleting the device's sync.db (forcing resync from jid=0) restored all files — confirming the server had them but the watermark was skipping them.

The incremental download watermark was derived from `max(file_records.jid)`,
which advances every time a single file is saved. Combined with the
server's `list(jid)` filter — which hides any path whose latest commit id
is <= `jid` — this permanently skips files if a batch is interrupted
partway. If the client saved a high-jid file (say id=500) and then crashed
before saving a lower-jid file from the same batch (id=300), the next
poll requested `list(500)`, got nothing for id=300, and the file was
invisible forever.

Move the watermark into a dedicated `sync_state` table and advance it
only once per batch, inside a DB transaction that also writes all
`file_records` rows. Either everything lands or nothing does; a partial
crash leaves the watermark pointing at the previous safe value so the
whole batch is re-requested on the next poll (chunks already on disk are
reused via the local cache).

Fresh installs and upgrades start with watermark=0, triggering one full
re-sync. That also heals any gaps that accumulated under the old
mechanism.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 14, 2026

Code Review

Overview

This PR fixes a real, well-documented crash-recovery bug: the old max(file_records.jid) watermark could race ahead of lower-jid files from the same batch, making those files permanently invisible to subsequent polls. The new dedicated sync_state table + single-transaction commit is the correct fix. The root-cause analysis and PR description are excellent.


Correctness / Logic

The transaction is correct. Disk writes happen first (lines 315–334), then a single conn.transaction atomically persists all file_records rows and the advanced watermark. A crash before the transaction leaves the watermark untouched; the next poll re-requests the same batch and chunk cache hits make it cheap. This is the right design.

max_id is correct. The watermark advances to max(d.id) across both created and deleted entries, which matches what the server's list(jid) filter expects.


Issues

1. unwrap_or(0) silently swallows errors (syncer.rs:263)

let latest_local = registry::get_download_watermark(conn, namespace_id).unwrap_or(0);

get_download_watermark already returns Ok(0) when no row exists, so the unwrap_or only fires on a real DB error — in which case it silently triggers a full re-sync instead of propagating the error. Every other registry call in this function uses ?. This should be:

let latest_local = registry::get_download_watermark(conn, namespace_id)?;

2. latest_jid is now dead code (registry.rs:76)

The only caller was the old check_download_once line that this PR replaces. latest_jid is no longer called anywhere. It will likely generate a dead_code warning (or Clippy will flag it). Either remove it or add #[allow(dead_code)] with a comment explaining why it's kept.

3. allow_tables_to_appear_in_same_query! is speculative (schema.rs)

diesel::allow_tables_to_appear_in_same_query!(file_records, sync_state,);

sync_state is only ever accessed through its own registry functions — there are no joins between these two tables in the codebase. This macro isn't needed yet. Add it when a cross-table query actually exists; adding it preemptively muddies the schema.


Minor Suggestions

set_download_watermark return type — It returns Result<usize> (Diesel's execute return), but the upsert always touches exactly one row and callers never inspect the count. Returning Result<()> (via .map(|_| ())) would give a cleaner interface and make accidental misuse (e.g. checking the count) impossible.

down.sql — Dropping the table is correct for rollback. Consider whether a migration down should also handle re-seeding file_records or if "re-sync from jid=0 on re-upgrade" is acceptable (it is, given the cheap chunk-cache re-download, but worth a brief comment).


Tests

The four new unit tests are well-structured (Given/When/Then), cover the important cases (missing row → 0, round-trip, upsert-not-duplicate, namespace isolation), and use real SQLite rather than mocks. Good coverage for the new registry layer.

Missing: an integration-level test for check_download_once that verifies the watermark is not advanced when the transaction fails mid-batch. This is hard to test without fault injection, but worth noting as a gap.


Summary

Bug fix correctness
Transaction atomicity
Tests ✅ (unit), ⚠️ no failure-path integration test
unwrap_or(0) swallows errors ❌ should be ?
latest_jid dead code ⚠️ remove or annotate
Speculative allow_tables_to_appear_in_same_query! ⚠️ remove until needed

The core fix is solid. The unwrap_or(0) issue is the only one worth blocking on — the others are cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant