Skip to content

out_s3: fix retry_limit semantics and default#11661

Closed
singholt wants to merge 2 commits intofluent:masterfrom
singholt:fix/s3-retry-limit-off-by-one
Closed

out_s3: fix retry_limit semantics and default#11661
singholt wants to merge 2 commits intofluent:masterfrom
singholt:fix/s3-retry-limit-off-by-one

Conversation

@singholt
Copy link
Copy Markdown
Contributor

@singholt singholt commented Apr 2, 2026

Summary

The S3 plugin's internal retry tracking uses >= to compare chunk->failures against retry_limit, while the engine uses > semantics (where retry_limit=N means N retries after the initial attempt). This mismatch means retry_limit=1 (the engine default) results in 0 retries — chunks are discarded after a single failure with no retry.

This was introduced in f4108db which changed the hardcoded MAX_UPLOAD_ERRORS (5) to use ctx->ins->retry_limit. Since the engine default for retry_limit is 1 (not -1), the S3 plugin's override to 5 never kicks in, and the >= comparison means chunks get only 1 total attempt.

Details

  • Change all five >= comparisons to > in put_all_chunks, get_upload, cb_s3_upload, and cb_s3_flush so retry_limit=N allows N+1 total attempts, matching engine semantics
  • Add s3_plugin_use_mocks() to decouple mock HTTP responses from the unit_test_flush bypass, enabling tests that exercise the real flush path with mock S3 calls
  • Add mock call counter (mock_s3_call_increment_counter) for test observability
  • Skip the 6-second timer floor in mock mode for faster test execution
  • Add putobject_retry_limit_semantics test that asserts exactly 2 PutObject attempts with retry_limit=1

Testing

  • Example configuration: any S3 output with default Retry_Limit 1 (engine default)
  • New unit test putobject_retry_limit_semantics verifies exact retry count
  • Existing S3 tests continue to pass (they use FLB_S3_PLUGIN_UNDER_TEST which implies s3_plugin_use_mocks())

Documentation

  • [N/A]

Backporting

  • [N/A]

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes

    • Tightened S3 retry/discard thresholds so uploads are retried and given up at more precise points.
    • Adjusted initialization timer clamping so test-mode no longer overrides configured timings.
    • Removed a test-only flush path to standardize upload/flush behavior.
  • Tests

    • Added runtime test validating S3 put-object retry-limit semantics; enabled upload timeout variants and ensured test-mode env cleanup.
    • Added a test-visible counter to observe mocked S3 call attempts.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Changed S3 output plugin retry/discard semantics (comparison from >= to >) across upload flows, added an environment-variable-backed per-API mock call counter, conditioned timer_ms clamping when in test mode, removed a unit-test-only flush path, and added a runtime test that verifies PutObject retry counting.

Changes

Cohort / File(s) Summary
S3 Plugin Core
plugins/out_s3/s3.c
Added mock_s3_call_increment_counter() to update TEST_<API>_CALL_COUNT; made ctx->timer_ms min/max clamping conditional when s3_plugin_under_test() is false; changed retry/discard checks from >= retry_limit to > retry_limit; removed unit_test_flush() and its special-case branch.
Runtime Tests
tests/runtime/out_s3.c
Added flb_test_s3_putobject_retry_limit_semantics() (mock mode + forced PutObject errors, Retry_Limit=1, upload_timeout=6s) asserting TEST_PutObject_CALL_COUNT == 2; added teardowns unsetting FLB_S3_PLUGIN_UNDER_TEST/TEST_* env vars; registered new test in TEST_LIST.

Sequence Diagram(s)

(Skipped — changes are localized plugin logic and tests; no new multi-component sequential flow requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

  • #10825: Modifies out_s3 retry/discard comparison logic and per-instance retry behavior — touches similar retry thresholds in plugins/out_s3/s3.c.

Suggested reviewers

  • PettitWesley
  • cosmo0920
  • sparrc

Poem

"I nibble counters one by one, 🥕
Mocked PutObject until it's done,
From >= to > I hop with glee,
Retries counted — two, we see!
A rabbit cheers: 'Tests all green for me!'" 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'out_s3: fix retry_limit semantics and default' directly addresses the main objective: fixing the off-by-one error in retry_limit comparisons by changing >= to > operators, which aligns the plugin behavior with engine semantics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from ff256a6 to deb8f18 Compare April 2, 2026 23:28
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from deb8f18 to a2e917e Compare April 2, 2026 23:29
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from a2e917e to 980b66d Compare April 2, 2026 23:30
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 980b66d to 776e7e3 Compare April 2, 2026 23:32
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 776e7e3 to e8566d6 Compare April 2, 2026 23:35
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from e8566d6 to 1d35352 Compare April 2, 2026 23:38
@singholt singholt changed the title Fix/s3 retry limit off by one out_s3: fix off-by-one in retry_limit comparison Apr 2, 2026
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 1d35352 to fb42b46 Compare April 2, 2026 23:46
@singholt singholt marked this pull request as ready for review April 3, 2026 01:50
@singholt singholt requested a review from a team as a code owner April 3, 2026 01:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c7614b4c0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 9c7614b to df38a2f Compare April 3, 2026 15:59
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from df38a2f to dc935ee Compare April 3, 2026 16:02
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch 2 times, most recently from 0b19300 to c8fccb6 Compare April 3, 2026 16:04
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from c8fccb6 to f2ffe3c Compare April 3, 2026 16:07
The S3 plugin used >= to compare chunk failures against retry_limit,
which meant retry_limit=1 resulted in 0 retries (1 total attempt).
This is inconsistent with the engine's retry semantics where
retry_limit=N means N retries after the initial attempt.

Change all five failure/error comparisons from >= to > so that
retry_limit=N correctly allows N retries (N+1 total attempts).

Remove the unit_test_flush bypass so tests exercise the real
flush path. Add a mock call counter and skip the 6s timer floor
in test mode for faster test execution.

Co-authored-by: Thean Lim <theanlim@amazon.com>
Signed-off-by: Anuj Singh <singholt@amazon.com>
@singholt singholt force-pushed the fix/s3-retry-limit-off-by-one branch from f2ffe3c to d740fef Compare April 3, 2026 16:12
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/out_s3/s3.c (1)

3843-3848: ⚠️ Potential issue | 🔴 Critical

The default ordered-upload path still drops after the first failure.

This hunk fixes the direct discard check, but cb_s3_flush() still routes ready files into s3_upload_queue() when preserve_data_ordering is enabled. Since that option defaults to true at Line 4127, this is still the common path. add_to_queue() starts retry_counter at 0 (Line 1792), Line 1946 increments it before the limit check, and Line 1947 still uses >=. With retry_limit=1, the first failed upload sets the counter to 1 and the queue entry is discarded immediately, so ordered uploads still get zero retries. The new runtime test in tests/runtime/out_s3.c:457-519 only exercises the timer-driven cb_s3_upload() path, so it will not catch this.

Suggested follow-up in s3_upload_queue()
-            if (upload_contents->retry_counter >= ctx->ins->retry_limit) {
+            if (upload_contents->retry_counter > ctx->ins->retry_limit) {

A regression test that forces the queued path would help keep this fixed.

As per coding guidelines, "Before patching: trace one full path for affected signals (input -> chunk -> task -> output -> engine completion) ... verify metrics/counters for success, retry, and drop paths."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/out_s3/s3.c` around lines 3843 - 3848, The ordered-upload path still
discards queued entries after the first failure because add_to_queue()
initializes retry_counter to 0, s3_upload_queue()/cb_s3_flush() route files into
the queue when preserve_data_ordering is true, and s3_upload_queue() increments
retry_counter before checking against retry_limit using >=; update
s3_upload_queue() (and/or add_to_queue() logic) so that retry counting matches
the intended semantics: initialize or treat retry_counter such that the first
retry is allowed (e.g., check > rather than >= or increment after the limit
check), ensure the check uses retry_limit consistently, and adjust
cb_s3_flush()/cb_s3_upload() to route failed files to s3_upload_queue() without
causing an immediate drop; also add/update a regression test that forces the
queued path when preserve_data_ordering is true to verify success/retry/drop
metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@plugins/out_s3/s3.c`:
- Around line 3843-3848: The ordered-upload path still discards queued entries
after the first failure because add_to_queue() initializes retry_counter to 0,
s3_upload_queue()/cb_s3_flush() route files into the queue when
preserve_data_ordering is true, and s3_upload_queue() increments retry_counter
before checking against retry_limit using >=; update s3_upload_queue() (and/or
add_to_queue() logic) so that retry counting matches the intended semantics:
initialize or treat retry_counter such that the first retry is allowed (e.g.,
check > rather than >= or increment after the limit check), ensure the check
uses retry_limit consistently, and adjust cb_s3_flush()/cb_s3_upload() to route
failed files to s3_upload_queue() without causing an immediate drop; also
add/update a regression test that forces the queued path when
preserve_data_ordering is true to verify success/retry/drop metrics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e19e405c-9eba-438f-9837-e1cba2e7cc9d

📥 Commits

Reviewing files that changed from the base of the PR and between df38a2f and f2ffe3c.

📒 Files selected for processing (2)
  • plugins/out_s3/s3.c
  • tests/runtime/out_s3.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/runtime/out_s3.c

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/runtime/out_s3.c (1)

38-50: ⚠️ Potential issue | 🟠 Major

These waits are still shorter than the configured timeout.

plugins/out_s3/s3.c only uploads timed-out buffered files after the check at Line 3320 passes. With upload_timeout="6s", the first timer-driven attempt cannot happen until after roughly 6 seconds, so the sleep(2) cases never reach the mocked S3 path, and the new retry-limit test's sleep(4) is still too short to observe two background PutObject attempts. Right now these cases can pass without exercising the intended flush/retry logic, and putobject_retry_limit_semantics may only be observing shutdown behavior from cb_s3_exit(). Please wait past the timeout/retry window or add a shorter test-only timeout path. As per coding guidelines, "Use tests/runtime tests for validating plugin-level behavior and end-to-end semantics including encoder/decoder paths".

Also applies to: 76-88, 115-127, 154-165, 191-202, 228-239, 265-276, 303-314, 339-350, 377-388, 413-424, 451-462, 495-526

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/runtime/out_s3.c` around lines 38 - 50, The test waits in
tests/runtime/out_s3.c are shorter than the configured upload_timeout ("6s") so
the S3 timer-driven upload/retry logic never runs; update the test to wait long
enough to observe the timer and retry window (e.g., sleep past 6s plus retry
intervals) or add a test-only shorter timeout path; specifically adjust the
sleeps after flb_lib_push (and the analogous sleep() calls in the other cases)
so the sequence starting from flb_output_set(...,"upload_timeout","6s",...)
triggers the plugin's timer-driven upload and the putobject retry attempts
assessed by the putobject_retry_limit_semantics test, rather than relying on
shutdown behavior in cb_s3_exit().
plugins/out_s3/s3.c (1)

3345-3350: ⚠️ Potential issue | 🟠 Major

The ordered-queue path still uses the old off-by-one.

These > fixes cover the direct/timer-driven paths, but s3_upload_queue() still discards on upload_contents->retry_counter >= ctx->ins->retry_limit at Line 1944. When a timed-out/full file goes through that preserve-order queue, Retry_Limit=1 still means “drop after the first failed upload” instead of “allow one retry.” The new single-chunk runtime test never enters this path, so it stays uncovered.

🐛 Minimal follow-up
-            if (upload_contents->retry_counter >= ctx->ins->retry_limit) {
+            if (upload_contents->retry_counter > ctx->ins->retry_limit) {

Also applies to: 3840-3845

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/out_s3/s3.c` around lines 3345 - 3350, The ordered-queue path in
s3_upload_queue() is using an off-by-one check (upload_contents->retry_counter
>= ctx->ins->retry_limit) which drops uploads one attempt too early; change that
comparison to use '>' so it matches the other paths (e.g., chunk->failures >
ctx->ins->retry_limit) and thereby allow exactly ctx->ins->retry_limit retries.
Update the analogous checks in s3_upload_queue() and the other occurrence noted
around the 3840–3845 region to use '>' against ctx->ins->retry_limit (refer to
upload_contents->retry_counter, s3_upload_queue(), and ctx->ins->retry_limit to
locate the fixes).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@plugins/out_s3/s3.c`:
- Around line 3345-3350: The ordered-queue path in s3_upload_queue() is using an
off-by-one check (upload_contents->retry_counter >= ctx->ins->retry_limit) which
drops uploads one attempt too early; change that comparison to use '>' so it
matches the other paths (e.g., chunk->failures > ctx->ins->retry_limit) and
thereby allow exactly ctx->ins->retry_limit retries. Update the analogous checks
in s3_upload_queue() and the other occurrence noted around the 3840–3845 region
to use '>' against ctx->ins->retry_limit (refer to
upload_contents->retry_counter, s3_upload_queue(), and ctx->ins->retry_limit to
locate the fixes).

In `@tests/runtime/out_s3.c`:
- Around line 38-50: The test waits in tests/runtime/out_s3.c are shorter than
the configured upload_timeout ("6s") so the S3 timer-driven upload/retry logic
never runs; update the test to wait long enough to observe the timer and retry
window (e.g., sleep past 6s plus retry intervals) or add a test-only shorter
timeout path; specifically adjust the sleeps after flb_lib_push (and the
analogous sleep() calls in the other cases) so the sequence starting from
flb_output_set(...,"upload_timeout","6s",...) triggers the plugin's timer-driven
upload and the putobject retry attempts assessed by the
putobject_retry_limit_semantics test, rather than relying on shutdown behavior
in cb_s3_exit().

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a50eeeb6-2726-4cc9-bca0-18525bf1988d

📥 Commits

Reviewing files that changed from the base of the PR and between f2ffe3c and 4ae1b15.

📒 Files selected for processing (2)
  • plugins/out_s3/s3.c
  • tests/runtime/out_s3.c

Verify retry_limit=1 results in exactly 2 PutObject attempts
(1 initial + 1 retry) using the plugin's internal retry path.

Add unsetenv for FLB_S3_PLUGIN_UNDER_TEST in all existing tests
to prevent env var leaking across tests in the same process.

Set upload_timeout to 6s in all tests for consistent 1s timer
ticks in test mode.

Co-authored-by: Thean Lim <theanlim@amazon.com>
Signed-off-by: Anuj Singh <singholt@amazon.com>
@singholt
Copy link
Copy Markdown
Contributor Author

singholt commented Apr 3, 2026

Superseded by - #11669

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants