Skip to content

Fix intermittent test Test_Client_Maintenance/JobRescuer#869

Merged
brandur merged 1 commit intomasterfrom
brandur-fix-maintenance-job-rescuer
Apr 29, 2025
Merged

Fix intermittent test Test_Client_Maintenance/JobRescuer#869
brandur merged 1 commit intomasterfrom
brandur-fix-maintenance-job-rescuer

Conversation

@brandur
Copy link
Contributor

@brandur brandur commented Apr 28, 2025

Fix an intermittent test for the client maintenance services observed
from time to time [1].

--- FAIL: Test_Client_Maintenance (0.00s)
    --- FAIL: Test_Client_Maintenance/JobRescuer (4.56s)
        client_test.go:3866: Generated schema "river_test_2025_04_26t22_05_48_schema_21" with migrations [1 2 3 4 5 6] on line "main" in 1.349501729s [21 generated] [14 reused]
        logger.go:256: time=2025-04-26T22:05:52.463Z level=INFO msg="River client started" client_id=fv-az2241-783_2025_04_26T22_05_52_429607
        logger.go:256: time=2025-04-26T22:05:52.638Z level=ERROR msg="maintenance.JobRescuer: Attempted to rescue unhandled job kind, discarding" job_kind=fake_job job_id=6
        logger.go:256: time=2025-04-26T22:05:53.228Z level=INFO msg="maintenance.JobRescuer: Ran successfully" num_jobs_discarded=1 num_jobs_retry_scheduled=2
        logger.go:256: time=2025-04-26T22:05:53.253Z level=INFO msg="maintenance.JobScheduler: Ran successfully" num_jobs_scheduled=2
        client_test.go:3912:
                Error Trace:	/home/runner/work/river/river/client_test.go:3903
                                                        /home/runner/work/river/river/client_test.go:3912
                Error:      	Not equal:
                                expected: "retryable"
                                actual  : "completed"

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,2 +1,2 @@
                                -(rivertype.JobState) (len=9) "retryable"
                                +(rivertype.JobState) (len=9) "completed"

                Test:       	Test_Client_Maintenance/JobRescuer
        logger.go:256: time=2025-04-26T22:05:55.631Z level=INFO msg="River client stopped" client_id=fv-az2241-783_2025_04_26T22_05_52_429607
        riverdbtest.go:277: Checked in schema "river_test_2025_04_26t22_05_48_schema_21"; 1 idle schema(s) [22 generated] [15 reused]
FAIL
FAIL	github.com/riverqueue/river	32.635s

This one's similar to #852 in that it's possible for test case goroutine
to be paused long enough (really only happens in GitHub Actions CI) that
a job it moved to retryable actually becomes eligible to be worked,
and may actually be worked in the interim before the test case picks up
again to assert on it.

The solution is to use the new RetryPolicySlow introduced in #852 to
make sure that retryable jobs are scheduled far out in the future and
never at risk of really being retried in the test case.

We also make the slow retry interval's default a little longer. The test
sets job's attempted_at (which its next scheduled_at will be based
on) an hour in the past which is exactly what the slow retry interval
was, making jobs still potentially immediately retryable. Make the slow
interval two hours so contention is never possible.

The good news is that this is the last intermittent test I've been
tracking. I'm hoping that it along with the fixes from Saturday might
be enough to take care of our intermittency problems for a while.

[1] https://github.com/riverqueue/river/actions/runs/14685519857/job/41213612537

@brandur brandur marked this pull request as draft April 28, 2025 07:39
@brandur
Copy link
Contributor Author

brandur commented Apr 28, 2025

Sigh. Didn't fix it.

@brandur brandur force-pushed the brandur-fix-maintenance-job-rescuer branch from 9a0390c to 860adeb Compare April 29, 2025 00:51
@brandur brandur marked this pull request as ready for review April 29, 2025 00:51
@brandur brandur force-pushed the brandur-fix-maintenance-job-rescuer branch from 860adeb to 307bbdb Compare April 29, 2025 00:54
Fix an intermittent test for the client maintenance services observed
from time to time [1].

    --- FAIL: Test_Client_Maintenance (0.00s)
        --- FAIL: Test_Client_Maintenance/JobRescuer (4.56s)
            client_test.go:3866: Generated schema "river_test_2025_04_26t22_05_48_schema_21" with migrations [1 2 3 4 5 6] on line "main" in 1.349501729s [21 generated] [14 reused]
            logger.go:256: time=2025-04-26T22:05:52.463Z level=INFO msg="River client started" client_id=fv-az2241-783_2025_04_26T22_05_52_429607
            logger.go:256: time=2025-04-26T22:05:52.638Z level=ERROR msg="maintenance.JobRescuer: Attempted to rescue unhandled job kind, discarding" job_kind=fake_job job_id=6
            logger.go:256: time=2025-04-26T22:05:53.228Z level=INFO msg="maintenance.JobRescuer: Ran successfully" num_jobs_discarded=1 num_jobs_retry_scheduled=2
            logger.go:256: time=2025-04-26T22:05:53.253Z level=INFO msg="maintenance.JobScheduler: Ran successfully" num_jobs_scheduled=2
            client_test.go:3912:
                    Error Trace:	/home/runner/work/river/river/client_test.go:3903
                                                            /home/runner/work/river/river/client_test.go:3912
                    Error:      	Not equal:
                                    expected: "retryable"
                                    actual  : "completed"

                                    Diff:
                                    --- Expected
                                    +++ Actual
                                    @@ -1,2 +1,2 @@
                                    -(rivertype.JobState) (len=9) "retryable"
                                    +(rivertype.JobState) (len=9) "completed"

                    Test:       	Test_Client_Maintenance/JobRescuer
            logger.go:256: time=2025-04-26T22:05:55.631Z level=INFO msg="River client stopped" client_id=fv-az2241-783_2025_04_26T22_05_52_429607
            riverdbtest.go:277: Checked in schema "river_test_2025_04_26t22_05_48_schema_21"; 1 idle schema(s) [22 generated] [15 reused]
    FAIL
    FAIL	github.com/riverqueue/river	32.635s

This one's similar to #852 in that it's possible for test case goroutine
to be paused long enough (really only happens in GitHub Actions CI) that
a job it moved to `retryable` actually becomes eligible to be worked,
and may actually be worked in the interim before the test case picks up
again to assert on it.

The solution is to use the new `RetryPolicySlow` introduced in #852 to
make sure that retryable jobs are scheduled far out in the future and
never at risk of really being retried in the test case.

We also make the slow retry interval's default a little longer. The test
sets job's `attempted_at` (which its next `scheduled_at` will be based
on) an hour in the past which is exactly what the slow retry interval
was, making jobs still potentially immediately retryable. Make the slow
interval two hours so contention is never possible.

The good news is that this is the last intermittent test I've been
tracking. I'm hoping that it along with the fixes from Saturday might
be enough to take care of our intermittency problems for a while.

[1] https://github.com/riverqueue/river/actions/runs/14685519857/job/41213612537
@brandur brandur force-pushed the brandur-fix-maintenance-job-rescuer branch from 307bbdb to faafe6a Compare April 29, 2025 00:56
@brandur brandur requested a review from bgentry April 29, 2025 00:58
@brandur
Copy link
Contributor Author

brandur commented Apr 29, 2025

@bgentry Okay, problem turned out to be because the test was setting attempted_at an hour in the past, so adding the slow retry policy's 1 hour interval was bringing the next retry back the current time, so the job would still have a chance of running immediately and failing the test. I changed the slow retry policy's interval to two hours and that should fix it. Seem okay?

Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, nice work again 🕵️‍♂️

@brandur
Copy link
Contributor Author

brandur commented Apr 29, 2025

ty ty. Hopefully the last one!

@brandur brandur merged commit 68d6625 into master Apr 29, 2025
10 checks passed
@brandur brandur deleted the brandur-fix-maintenance-job-rescuer branch April 29, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants