Fix intermittent test `Test_Client_Maintenance/JobRescuer` by brandur · Pull Request #869 · riverqueue/river

brandur · 2025-04-28T07:34:50Z

Fix an intermittent test for the client maintenance services observed
from time to time [1].

--- FAIL: Test_Client_Maintenance (0.00s)
    --- FAIL: Test_Client_Maintenance/JobRescuer (4.56s)
        client_test.go:3866: Generated schema "river_test_2025_04_26t22_05_48_schema_21" with migrations [1 2 3 4 5 6] on line "main" in 1.349501729s [21 generated] [14 reused]
        logger.go:256: time=2025-04-26T22:05:52.463Z level=INFO msg="River client started" client_id=fv-az2241-783_2025_04_26T22_05_52_429607
        logger.go:256: time=2025-04-26T22:05:52.638Z level=ERROR msg="maintenance.JobRescuer: Attempted to rescue unhandled job kind, discarding" job_kind=fake_job job_id=6
        logger.go:256: time=2025-04-26T22:05:53.228Z level=INFO msg="maintenance.JobRescuer: Ran successfully" num_jobs_discarded=1 num_jobs_retry_scheduled=2
        logger.go:256: time=2025-04-26T22:05:53.253Z level=INFO msg="maintenance.JobScheduler: Ran successfully" num_jobs_scheduled=2
        client_test.go:3912:
                Error Trace:	/home/runner/work/river/river/client_test.go:3903
                                                        /home/runner/work/river/river/client_test.go:3912
                Error:      	Not equal:
                                expected: "retryable"
                                actual  : "completed"

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,2 +1,2 @@
                                -(rivertype.JobState) (len=9) "retryable"
                                +(rivertype.JobState) (len=9) "completed"

                Test:       	Test_Client_Maintenance/JobRescuer
        logger.go:256: time=2025-04-26T22:05:55.631Z level=INFO msg="River client stopped" client_id=fv-az2241-783_2025_04_26T22_05_52_429607
        riverdbtest.go:277: Checked in schema "river_test_2025_04_26t22_05_48_schema_21"; 1 idle schema(s) [22 generated] [15 reused]
FAIL
FAIL	github.com/riverqueue/river	32.635s

This one's similar to #852 in that it's possible for test case goroutine
to be paused long enough (really only happens in GitHub Actions CI) that
a job it moved to retryable actually becomes eligible to be worked,
and may actually be worked in the interim before the test case picks up
again to assert on it.

The solution is to use the new RetryPolicySlow introduced in #852 to
make sure that retryable jobs are scheduled far out in the future and
never at risk of really being retried in the test case.

We also make the slow retry interval's default a little longer. The test
sets job's attempted_at (which its next scheduled_at will be based
on) an hour in the past which is exactly what the slow retry interval
was, making jobs still potentially immediately retryable. Make the slow
interval two hours so contention is never possible.

The good news is that this is the last intermittent test I've been
tracking. I'm hoping that it along with the fixes from Saturday might
be enough to take care of our intermittency problems for a while.

[1] https://github.com/riverqueue/river/actions/runs/14685519857/job/41213612537

brandur · 2025-04-28T07:39:47Z

Sigh. Didn't fix it.

Fix an intermittent test for the client maintenance services observed from time to time [1]. --- FAIL: Test_Client_Maintenance (0.00s) --- FAIL: Test_Client_Maintenance/JobRescuer (4.56s) client_test.go:3866: Generated schema "river_test_2025_04_26t22_05_48_schema_21" with migrations [1 2 3 4 5 6] on line "main" in 1.349501729s [21 generated] [14 reused] logger.go:256: time=2025-04-26T22:05:52.463Z level=INFO msg="River client started" client_id=fv-az2241-783_2025_04_26T22_05_52_429607 logger.go:256: time=2025-04-26T22:05:52.638Z level=ERROR msg="maintenance.JobRescuer: Attempted to rescue unhandled job kind, discarding" job_kind=fake_job job_id=6 logger.go:256: time=2025-04-26T22:05:53.228Z level=INFO msg="maintenance.JobRescuer: Ran successfully" num_jobs_discarded=1 num_jobs_retry_scheduled=2 logger.go:256: time=2025-04-26T22:05:53.253Z level=INFO msg="maintenance.JobScheduler: Ran successfully" num_jobs_scheduled=2 client_test.go:3912: Error Trace: /home/runner/work/river/river/client_test.go:3903 /home/runner/work/river/river/client_test.go:3912 Error: Not equal: expected: "retryable" actual : "completed" Diff: --- Expected +++ Actual @@ -1,2 +1,2 @@ -(rivertype.JobState) (len=9) "retryable" +(rivertype.JobState) (len=9) "completed" Test: Test_Client_Maintenance/JobRescuer logger.go:256: time=2025-04-26T22:05:55.631Z level=INFO msg="River client stopped" client_id=fv-az2241-783_2025_04_26T22_05_52_429607 riverdbtest.go:277: Checked in schema "river_test_2025_04_26t22_05_48_schema_21"; 1 idle schema(s) [22 generated] [15 reused] FAIL FAIL github.com/riverqueue/river 32.635s This one's similar to #852 in that it's possible for test case goroutine to be paused long enough (really only happens in GitHub Actions CI) that a job it moved to `retryable` actually becomes eligible to be worked, and may actually be worked in the interim before the test case picks up again to assert on it. The solution is to use the new `RetryPolicySlow` introduced in #852 to make sure that retryable jobs are scheduled far out in the future and never at risk of really being retried in the test case. We also make the slow retry interval's default a little longer. The test sets job's `attempted_at` (which its next `scheduled_at` will be based on) an hour in the past which is exactly what the slow retry interval was, making jobs still potentially immediately retryable. Make the slow interval two hours so contention is never possible. The good news is that this is the last intermittent test I've been tracking. I'm hoping that it along with the fixes from Saturday might be enough to take care of our intermittency problems for a while. [1] https://github.com/riverqueue/river/actions/runs/14685519857/job/41213612537

brandur · 2025-04-29T00:59:40Z

@bgentry Okay, problem turned out to be because the test was setting attempted_at an hour in the past, so adding the slow retry policy's 1 hour interval was bringing the next retry back the current time, so the job would still have a chance of running immediately and failing the test. I changed the slow retry policy's interval to two hours and that should fix it. Seem okay?

bgentry

hah, nice work again 🕵️‍♂️

brandur · 2025-04-29T15:57:19Z

ty ty. Hopefully the last one!

brandur marked this pull request as draft April 28, 2025 07:39

brandur force-pushed the brandur-fix-maintenance-job-rescuer branch from 9a0390c to 860adeb Compare April 29, 2025 00:51

brandur marked this pull request as ready for review April 29, 2025 00:51

brandur force-pushed the brandur-fix-maintenance-job-rescuer branch from 860adeb to 307bbdb Compare April 29, 2025 00:54

brandur force-pushed the brandur-fix-maintenance-job-rescuer branch from 307bbdb to faafe6a Compare April 29, 2025 00:56

brandur requested a review from bgentry April 29, 2025 00:58

bgentry approved these changes Apr 29, 2025

View reviewed changes

brandur merged commit 68d6625 into master Apr 29, 2025
10 checks passed

brandur deleted the brandur-fix-maintenance-job-rescuer branch April 29, 2025 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent test `Test_Client_Maintenance/JobRescuer`#869

Fix intermittent test `Test_Client_Maintenance/JobRescuer`#869
brandur merged 1 commit intomasterfrom
brandur-fix-maintenance-job-rescuer

brandur commented Apr 28, 2025 •

edited

Loading

Uh oh!

brandur commented Apr 28, 2025

Uh oh!

brandur commented Apr 29, 2025

Uh oh!

bgentry left a comment

Uh oh!

brandur commented Apr 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brandur commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brandur commented Apr 28, 2025

Uh oh!

brandur commented Apr 29, 2025

Uh oh!

bgentry left a comment

Choose a reason for hiding this comment

Uh oh!

brandur commented Apr 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brandur commented Apr 28, 2025 •

edited

Loading