Skip to content

perf(api): owner-token compare-and-clear for last_collection_started_at to handle overlapping cron + async runs #261

@cristim

Description

@cristim

Summary

The async-refresh handler (#257, PR #260) and the scheduler use last_collection_started_at as a single in-flight flag. Two callers can race and clear each other's marker:

T+0   cron-scheduler: ClearCollectionStarted (cleans up its previous run)
T+1   user clicks Refresh → handler: MarkCollectionStarted (ok=true, started_at = T+1)
T+2   cron-scheduler: enters NEW run, MarkCollectionStarted (ok=false, no-op)
T+3   cron-scheduler completes earlier batch (?), defer fires:
        ClearCollectionStarted ← wipes the user-triggered run's marker
T+4   user-triggered run still runs in the background, frontend stops
      polling because last_collection_started_at is null

The current safeguard is the 5-min auto-recovery window in MarkCollectionStarted: even if a marker is wiped, a fresh refresh after 5 min works again. So the practical impact is bounded — frontend banner clears too early, but the underlying state self-heals.

Proposal

Replace the single boolean with an owner token (UUID) per collection:

  1. MarkCollectionStarted returns (token uuid.UUID, ok bool, err error).
  2. ClearCollectionStarted(ctx, token) does a compare-and-clear: only clears the marker if the persisted owner matches token. Mismatches are silent no-ops.
  3. Defer-clear in scheduler.CollectRecommendations and the rollback in handler_recommendations_refresh.go::asyncInvokeSelf failure path both pass their own token.

Scope

  • New column last_collection_owner_id UUID on recommendations_state (migration 000048).
  • StoreInterface signature change: MarkCollectionStarted and ClearCollectionStarted both gain a token param.
  • 5 mock impls in internal/{analytics,api,purchase,scheduler,server} packages need the new signature.
  • Callers in internal/api/handler_recommendations_refresh.go and internal/scheduler/scheduler.go thread the token through.

Why deferred from #260

CR pass-1 raised this as priority/major on PR #260. The bookkeeping refactor is bigger than a CR-pass-1 fix should be (touches a migration, the StoreInterface, all 5 mocks, and both call sites), and the practical impact is bounded by the existing 5-min recovery window. Tracking here so it's not lost.

Why this matters

When CUDly grows beyond a single user / single account, concurrent refresh clicks become more common. The current behaviour silently confuses the polling banner; an owner-token model makes it deterministic.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions