[2.x] Bound retry and unique-lock lifetimes on UpdateApps and ProcessApps#1559
Open
JoshSalway wants to merge 6 commits intolinuxserver:2.xfrom
Open
[2.x] Bound retry and unique-lock lifetimes on UpdateApps and ProcessApps#1559JoshSalway wants to merge 6 commits intolinuxserver:2.xfrom
JoshSalway wants to merge 6 commits intolinuxserver:2.xfrom
Conversation
Both jobs implement ShouldBeUnique without a $uniqueFor value, which on
Redis and database drivers produces a lock that never expires. If the
worker is killed mid-fire (OOM, SIGKILL, server crash), the lock
persists and blocks all future dispatches of UpdateApps or ProcessApps
until the cache entry is manually cleared.
Neither job sets $tries, $backoff, or $timeout, so they inherit the
worker command's defaults (1 for queue:work, 0 for vapor:work), which
varies by platform and is brittle.
This change adds:
- $uniqueFor = 3600 lock expires after 1 hour
- $tries = 3 hard cap across worker restarts
- $backoff = [30, 60, 120] pace retries to reduce GitHub API load
- $timeout = 60 bound per-attempt wall-clock time
- failed(Throwable) log permanent failures (and preserve the
existing Cache::lock('updateApps')->forceRelease
on UpdateApps)
A new test (tests/Feature/QueueSafetyTest.php) asserts the retry
properties are present.
All existing tests still pass.
The test only asserted property values and method_exists. The diff itself shows the property values; the test added zero confidence beyond reading the diff. Heimdall is an app, not a package; its existing tests are HTTP feature tests against user behavior, not class-property assertions. Removing this file keeps the PR aligned with the existing test conventions.
…koff Most failures for these jobs are GitHub API rate-limit responses; retries inside the same window do not help, so one attempt is enough and $backoff has nothing to pace. $timeout would clip the intentionally throttled handle() loop (sleep(1) per app) below realistic workloads. Heavy users with 60+ apps would lose updates mid-cycle. The original code left $timeout unset, and `UpdateApps` is dispatched via `dispatchAfterResponse()` which doesn't go through `queue:work` at all (the queue $timeout is irrelevant in that path). Letting the operator's worker config govern is more honest. $uniqueFor reduced to 600 (10 min) since with $tries=1 + worker-governed timeout there is no long retry chain to outlive. Lock self-heals 10 minutes after a crashed worker.
Previously logged only the exception message. Adds: - exception_class: distinguishes ClientException vs ConnectException vs other Guzzle/PHP failure types at a glance - file: file path and line where the exception was raised, useful for distinguishing 'failed inside Guzzle' from 'failed inside our code path' The exception message itself often contains the GitHub API URL, which encodes the app identifier. Capturing the specific appid at the moment of failure would require touching handle() to track the current iteration; left as a follow-up.
Tests that UpdateApps::failed() and ProcessApps::failed():
- call Log::error with the 'permanently failed' message
- include exception_class, exception_message, and file context keys
- for UpdateApps, still call Cache::lock('updateApps')->forceRelease()
Uses Log::spy() and Mockery to capture facade calls. 2 tests, 4
assertions, green on PHP 8.4.20. See follow-up commit for why this
lands in history but not in the shipped suite.
Reproduction-style behavior test, not a regression test, so it does not belong in the main suite. Full test remains in this branch's commit history at 56c53ab for reviewers: git checkout 56c53ab -- tests/Feature/QueueFailedHandlerTest.php vendor/bin/phpunit --filter QueueFailedHandlerTest Keeps the PR aligned with Heimdall's existing test conventions (HTTP-feature tests, no facade-mocking unit tests).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Both
UpdateAppsandProcessAppsimplementShouldBeUniquebut do not set$uniqueFor, which on Redis and database cache drivers produces a lock with no TTL. If the worker is killed mid-handle()(OOM, SIGKILL, server crash, container restart), the lock persists and blocks every future dispatch of that job class until the cache entry is manually cleared.For Heimdall specifically: one low-memory container restart during an
UpdateAppsrun silently disables background app-update checks. The symptom visible to the user is "my dashboard stopped noticing app updates", with no obvious connection to the queue lock.Neither job sets
$tries, so they inherit whatever the worker command was started with. Onqueue:workthat is 1 attempt (safe). Homelab supervisors vary widely; relying on the inherited default is fragile.Proposed change
Add three lines on each job:
class UpdateApps implements ShouldQueue, ShouldBeUnique { use Dispatchable, InteractsWithQueue, Queueable, SerializesModels; + public int $tries = 1; + public int $uniqueFor = 600; public function handle(): void { /* unchanged */ } public function failed(Throwable $exception): void { Cache::lock('updateApps')->forceRelease(); + Log::error(static::class . ' permanently failed', [ + 'exception_class' => $exception::class, + 'exception_message' => $exception->getMessage(), + 'file' => $exception->getFile() . ':' . $exception->getLine(), + ]); } }Identical shape on
ProcessApps, plus a newfailed()method (it doesn't have one).Why these values
$tries = 1: Most failures here are GitHub API rate-limit responses; retries inside the same window do not help. One attempt is enough. ($backoffwould have nothing to pace.)$timeout:handle()is intentionally throttled at 1 second per app (sleep(1)in the loop). A 60-app user takes 60+ seconds. Setting$timeout = 60would clip mid-loop. Setting it higher than 90 conflicts with the defaultretry_after. The original code left$timeoutunset and that is correct for this design.UpdateAppsis also dispatched viadispatchAfterResponse()which doesn't go throughqueue:workat all, so the queue$timeoutis irrelevant in that path.$uniqueFor = 600(10 minutes): Long enough that the lock outlives any plausible single attempt, short enough that a dead-worker scenario self-heals in 10 minutes without operator intervention. With$tries = 1there is no retry chain to cover.failed(): Logs the permanent failure event with the exception class, message, and file:line so an operator can distinguish a GuzzleClientExceptionfrom aConnectExceptionat a glance. ForUpdateAppsthe existingCache::lock('updateApps')->forceRelease()is preserved; the log call is added on top.The log does not capture which specific app was being processed at the moment of failure, since
handle()iterates over all apps in a single call and the spec for this PR avoids touchinghandle(). The Guzzle exception message often contains the GitHub API URL, which encodes the appid, so the appid is usually recoverable from the message itself. If maintainers want explicit per-app context that is a small follow-up: track the current app on the instance in the loop, reference it infailed().Test plan
failed()method extension. Behaviour around$uniqueFor,$tries, and thefailed()lifecycle is already covered by Laravel framework's own tests.Context
This fix comes from a queue-safety audit of public Laravel projects: https://joshsalway.com/articles/improving-queue-safety-in-laravel. Laravel releases the
ShouldBeUniquelock when the job completes or finally fails. If neither happens (worker killed mid-fire), the lock has no TTL fallback unless$uniqueForis set.