out_s3: fix retry_limit semantics and default#11661
out_s3: fix retry_limit semantics and default#11661singholt wants to merge 2 commits intofluent:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughChanged S3 output plugin retry/discard semantics (comparison from >= to >) across upload flows, added an environment-variable-backed per-API mock call counter, conditioned timer_ms clamping when in test mode, removed a unit-test-only flush path, and added a runtime test that verifies PutObject retry counting. Changes
Sequence Diagram(s)(Skipped — changes are localized plugin logic and tests; no new multi-component sequential flow requiring visualization.) Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ff256a6 to
deb8f18
Compare
deb8f18 to
a2e917e
Compare
a2e917e to
980b66d
Compare
980b66d to
776e7e3
Compare
776e7e3 to
e8566d6
Compare
e8566d6 to
1d35352
Compare
1d35352 to
fb42b46
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c7614b4c0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
9c7614b to
df38a2f
Compare
df38a2f to
dc935ee
Compare
0b19300 to
c8fccb6
Compare
c8fccb6 to
f2ffe3c
Compare
The S3 plugin used >= to compare chunk failures against retry_limit, which meant retry_limit=1 resulted in 0 retries (1 total attempt). This is inconsistent with the engine's retry semantics where retry_limit=N means N retries after the initial attempt. Change all five failure/error comparisons from >= to > so that retry_limit=N correctly allows N retries (N+1 total attempts). Remove the unit_test_flush bypass so tests exercise the real flush path. Add a mock call counter and skip the 6s timer floor in test mode for faster test execution. Co-authored-by: Thean Lim <theanlim@amazon.com> Signed-off-by: Anuj Singh <singholt@amazon.com>
f2ffe3c to
d740fef
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugins/out_s3/s3.c (1)
3843-3848:⚠️ Potential issue | 🔴 CriticalThe default ordered-upload path still drops after the first failure.
This hunk fixes the direct discard check, but
cb_s3_flush()still routes ready files intos3_upload_queue()whenpreserve_data_orderingis enabled. Since that option defaults to true at Line 4127, this is still the common path.add_to_queue()startsretry_counterat 0 (Line 1792), Line 1946 increments it before the limit check, and Line 1947 still uses>=. Withretry_limit=1, the first failed upload sets the counter to 1 and the queue entry is discarded immediately, so ordered uploads still get zero retries. The new runtime test intests/runtime/out_s3.c:457-519only exercises the timer-drivencb_s3_upload()path, so it will not catch this.Suggested follow-up in
s3_upload_queue()- if (upload_contents->retry_counter >= ctx->ins->retry_limit) { + if (upload_contents->retry_counter > ctx->ins->retry_limit) {A regression test that forces the queued path would help keep this fixed.
As per coding guidelines, "Before patching: trace one full path for affected signals (input -> chunk -> task -> output -> engine completion) ... verify metrics/counters for success, retry, and drop paths."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/out_s3/s3.c` around lines 3843 - 3848, The ordered-upload path still discards queued entries after the first failure because add_to_queue() initializes retry_counter to 0, s3_upload_queue()/cb_s3_flush() route files into the queue when preserve_data_ordering is true, and s3_upload_queue() increments retry_counter before checking against retry_limit using >=; update s3_upload_queue() (and/or add_to_queue() logic) so that retry counting matches the intended semantics: initialize or treat retry_counter such that the first retry is allowed (e.g., check > rather than >= or increment after the limit check), ensure the check uses retry_limit consistently, and adjust cb_s3_flush()/cb_s3_upload() to route failed files to s3_upload_queue() without causing an immediate drop; also add/update a regression test that forces the queued path when preserve_data_ordering is true to verify success/retry/drop metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@plugins/out_s3/s3.c`:
- Around line 3843-3848: The ordered-upload path still discards queued entries
after the first failure because add_to_queue() initializes retry_counter to 0,
s3_upload_queue()/cb_s3_flush() route files into the queue when
preserve_data_ordering is true, and s3_upload_queue() increments retry_counter
before checking against retry_limit using >=; update s3_upload_queue() (and/or
add_to_queue() logic) so that retry counting matches the intended semantics:
initialize or treat retry_counter such that the first retry is allowed (e.g.,
check > rather than >= or increment after the limit check), ensure the check
uses retry_limit consistently, and adjust cb_s3_flush()/cb_s3_upload() to route
failed files to s3_upload_queue() without causing an immediate drop; also
add/update a regression test that forces the queued path when
preserve_data_ordering is true to verify success/retry/drop metrics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e19e405c-9eba-438f-9837-e1cba2e7cc9d
📒 Files selected for processing (2)
plugins/out_s3/s3.ctests/runtime/out_s3.c
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/runtime/out_s3.c
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/runtime/out_s3.c (1)
38-50:⚠️ Potential issue | 🟠 MajorThese waits are still shorter than the configured timeout.
plugins/out_s3/s3.conly uploads timed-out buffered files after the check at Line 3320 passes. Withupload_timeout="6s", the first timer-driven attempt cannot happen until after roughly 6 seconds, so thesleep(2)cases never reach the mocked S3 path, and the new retry-limit test'ssleep(4)is still too short to observe two backgroundPutObjectattempts. Right now these cases can pass without exercising the intended flush/retry logic, andputobject_retry_limit_semanticsmay only be observing shutdown behavior fromcb_s3_exit(). Please wait past the timeout/retry window or add a shorter test-only timeout path. As per coding guidelines, "Usetests/runtimetests for validating plugin-level behavior and end-to-end semantics including encoder/decoder paths".Also applies to: 76-88, 115-127, 154-165, 191-202, 228-239, 265-276, 303-314, 339-350, 377-388, 413-424, 451-462, 495-526
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/runtime/out_s3.c` around lines 38 - 50, The test waits in tests/runtime/out_s3.c are shorter than the configured upload_timeout ("6s") so the S3 timer-driven upload/retry logic never runs; update the test to wait long enough to observe the timer and retry window (e.g., sleep past 6s plus retry intervals) or add a test-only shorter timeout path; specifically adjust the sleeps after flb_lib_push (and the analogous sleep() calls in the other cases) so the sequence starting from flb_output_set(...,"upload_timeout","6s",...) triggers the plugin's timer-driven upload and the putobject retry attempts assessed by the putobject_retry_limit_semantics test, rather than relying on shutdown behavior in cb_s3_exit().plugins/out_s3/s3.c (1)
3345-3350:⚠️ Potential issue | 🟠 MajorThe ordered-queue path still uses the old off-by-one.
These
>fixes cover the direct/timer-driven paths, buts3_upload_queue()still discards onupload_contents->retry_counter >= ctx->ins->retry_limitat Line 1944. When a timed-out/full file goes through that preserve-order queue,Retry_Limit=1still means “drop after the first failed upload” instead of “allow one retry.” The new single-chunk runtime test never enters this path, so it stays uncovered.🐛 Minimal follow-up
- if (upload_contents->retry_counter >= ctx->ins->retry_limit) { + if (upload_contents->retry_counter > ctx->ins->retry_limit) {Also applies to: 3840-3845
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugins/out_s3/s3.c` around lines 3345 - 3350, The ordered-queue path in s3_upload_queue() is using an off-by-one check (upload_contents->retry_counter >= ctx->ins->retry_limit) which drops uploads one attempt too early; change that comparison to use '>' so it matches the other paths (e.g., chunk->failures > ctx->ins->retry_limit) and thereby allow exactly ctx->ins->retry_limit retries. Update the analogous checks in s3_upload_queue() and the other occurrence noted around the 3840–3845 region to use '>' against ctx->ins->retry_limit (refer to upload_contents->retry_counter, s3_upload_queue(), and ctx->ins->retry_limit to locate the fixes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@plugins/out_s3/s3.c`:
- Around line 3345-3350: The ordered-queue path in s3_upload_queue() is using an
off-by-one check (upload_contents->retry_counter >= ctx->ins->retry_limit) which
drops uploads one attempt too early; change that comparison to use '>' so it
matches the other paths (e.g., chunk->failures > ctx->ins->retry_limit) and
thereby allow exactly ctx->ins->retry_limit retries. Update the analogous checks
in s3_upload_queue() and the other occurrence noted around the 3840–3845 region
to use '>' against ctx->ins->retry_limit (refer to
upload_contents->retry_counter, s3_upload_queue(), and ctx->ins->retry_limit to
locate the fixes).
In `@tests/runtime/out_s3.c`:
- Around line 38-50: The test waits in tests/runtime/out_s3.c are shorter than
the configured upload_timeout ("6s") so the S3 timer-driven upload/retry logic
never runs; update the test to wait long enough to observe the timer and retry
window (e.g., sleep past 6s plus retry intervals) or add a test-only shorter
timeout path; specifically adjust the sleeps after flb_lib_push (and the
analogous sleep() calls in the other cases) so the sequence starting from
flb_output_set(...,"upload_timeout","6s",...) triggers the plugin's timer-driven
upload and the putobject retry attempts assessed by the
putobject_retry_limit_semantics test, rather than relying on shutdown behavior
in cb_s3_exit().
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a50eeeb6-2726-4cc9-bca0-18525bf1988d
📒 Files selected for processing (2)
plugins/out_s3/s3.ctests/runtime/out_s3.c
Verify retry_limit=1 results in exactly 2 PutObject attempts (1 initial + 1 retry) using the plugin's internal retry path. Add unsetenv for FLB_S3_PLUGIN_UNDER_TEST in all existing tests to prevent env var leaking across tests in the same process. Set upload_timeout to 6s in all tests for consistent 1s timer ticks in test mode. Co-authored-by: Thean Lim <theanlim@amazon.com> Signed-off-by: Anuj Singh <singholt@amazon.com>
|
Superseded by - #11669 |
Summary
The S3 plugin's internal retry tracking uses >= to compare chunk->failures against retry_limit, while the engine uses > semantics (where retry_limit=N means N retries after the initial attempt). This mismatch means retry_limit=1 (the engine default) results in 0 retries — chunks are discarded after a single failure with no retry.
This was introduced in f4108db which changed the hardcoded MAX_UPLOAD_ERRORS (5) to use ctx->ins->retry_limit. Since the engine default for retry_limit is 1 (not -1), the S3 plugin's override to 5 never kicks in, and the >= comparison means chunks get only 1 total attempt.
Details
Testing
Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Bug Fixes
Tests