out_s3: fix retry_limit semantics and default by singholt · Pull Request #11661 · fluent/fluent-bit

singholt · 2026-04-02T23:27:59Z

Summary

The S3 plugin's internal retry tracking uses >= to compare chunk->failures against retry_limit, while the engine uses > semantics (where retry_limit=N means N retries after the initial attempt). This mismatch means retry_limit=1 (the engine default) results in 0 retries — chunks are discarded after a single failure with no retry.

This was introduced in f4108db which changed the hardcoded MAX_UPLOAD_ERRORS (5) to use ctx->ins->retry_limit. Since the engine default for retry_limit is 1 (not -1), the S3 plugin's override to 5 never kicks in, and the >= comparison means chunks get only 1 total attempt.

Details

Change all five >= comparisons to > in put_all_chunks, get_upload, cb_s3_upload, and cb_s3_flush so retry_limit=N allows N+1 total attempts, matching engine semantics
Add s3_plugin_use_mocks() to decouple mock HTTP responses from the unit_test_flush bypass, enabling tests that exercise the real flush path with mock S3 calls
Add mock call counter (mock_s3_call_increment_counter) for test observability
Skip the 6-second timer floor in mock mode for faster test execution
Add putobject_retry_limit_semantics test that asserts exactly 2 PutObject attempts with retry_limit=1

Testing

Example configuration: any S3 output with default Retry_Limit 1 (engine default)
New unit test putobject_retry_limit_semantics verifies exact retry count
Existing S3 tests continue to pass (they use FLB_S3_PLUGIN_UNDER_TEST which implies s3_plugin_use_mocks())

Documentation

[N/A]

Backporting

[N/A]

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Bug Fixes
- Tightened S3 retry/discard thresholds so uploads are retried and given up at more precise points.
- Adjusted initialization timer clamping so test-mode no longer overrides configured timings.
- Removed a test-only flush path to standardize upload/flush behavior.
Tests
- Added runtime test validating S3 put-object retry-limit semantics; enabled upload timeout variants and ensured test-mode env cleanup.
- Added a test-visible counter to observe mocked S3 call attempts.

coderabbitai · 2026-04-02T23:28:08Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Changed S3 output plugin retry/discard semantics (comparison from >= to >) across upload flows, added an environment-variable-backed per-API mock call counter, conditioned timer_ms clamping when in test mode, removed a unit-test-only flush path, and added a runtime test that verifies PutObject retry counting.

Changes

Cohort / File(s)	Summary
S3 Plugin Core `plugins/out_s3/s3.c`	Added `mock_s3_call_increment_counter()` to update `TEST_<API>_CALL_COUNT`; made `ctx->timer_ms` min/max clamping conditional when `s3_plugin_under_test()` is false; changed retry/discard checks from `>= retry_limit` to `> retry_limit`; removed `unit_test_flush()` and its special-case branch.
Runtime Tests `tests/runtime/out_s3.c`	Added `flb_test_s3_putobject_retry_limit_semantics()` (mock mode + forced PutObject errors, `Retry_Limit=1`, `upload_timeout=6s`) asserting `TEST_PutObject_CALL_COUNT == 2`; added teardowns unsetting `FLB_S3_PLUGIN_UNDER_TEST`/TEST_* env vars; registered new test in `TEST_LIST`.

Sequence Diagram(s)

(Skipped — changes are localized plugin logic and tests; no new multi-component sequential flow requiring visualization.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

#10825: Modifies out_s3 retry/discard comparison logic and per-instance retry behavior — touches similar retry thresholds in plugins/out_s3/s3.c.

Suggested reviewers

PettitWesley
cosmo0920
sparrc

Poem

"I nibble counters one by one, 🥕
Mocked PutObject until it's done,
From >= to > I hop with glee,
Retries counted — two, we see!
A rabbit cheers: 'Tests all green for me!'" 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'out_s3: fix retry_limit semantics and default' directly addresses the main objective: fixing the off-by-one error in retry_limit comparisons by changing >= to > operators, which aligns the plugin behavior with engine semantics.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c7614b4c0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

tests/runtime/out_s3.c

plugins/out_s3/s3.c

The S3 plugin used >= to compare chunk failures against retry_limit, which meant retry_limit=1 resulted in 0 retries (1 total attempt). This is inconsistent with the engine's retry semantics where retry_limit=N means N retries after the initial attempt. Change all five failure/error comparisons from >= to > so that retry_limit=N correctly allows N retries (N+1 total attempts). Remove the unit_test_flush bypass so tests exercise the real flush path. Add a mock call counter and skip the 6s timer floor in test mode for faster test execution. Co-authored-by: Thean Lim <theanlim@amazon.com> Signed-off-by: Anuj Singh <singholt@amazon.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

plugins/out_s3/s3.c (1)
3843-3848: ⚠️ Potential issue | 🔴 Critical

The default ordered-upload path still drops after the first failure.

This hunk fixes the direct discard check, but cb_s3_flush() still routes ready files into s3_upload_queue() when preserve_data_ordering is enabled. Since that option defaults to true at Line 4127, this is still the common path. add_to_queue() starts retry_counter at 0 (Line 1792), Line 1946 increments it before the limit check, and Line 1947 still uses >=. With retry_limit=1, the first failed upload sets the counter to 1 and the queue entry is discarded immediately, so ordered uploads still get zero retries. The new runtime test in tests/runtime/out_s3.c:457-519 only exercises the timer-driven cb_s3_upload() path, so it will not catch this.
Suggested follow-up in s3_upload_queue()
-            if (upload_contents->retry_counter >= ctx->ins->retry_limit) {
+            if (upload_contents->retry_counter > ctx->ins->retry_limit) {
A regression test that forces the queued path would help keep this fixed.
As per coding guidelines, "Before patching: trace one full path for affected signals (input -> chunk -> task -> output -> engine completion) ... verify metrics/counters for success, retry, and drop paths."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/out_s3/s3.c` around lines 3843 - 3848, The ordered-upload path still
discards queued entries after the first failure because add_to_queue()
initializes retry_counter to 0, s3_upload_queue()/cb_s3_flush() route files into
the queue when preserve_data_ordering is true, and s3_upload_queue() increments
retry_counter before checking against retry_limit using >=; update
s3_upload_queue() (and/or add_to_queue() logic) so that retry counting matches
the intended semantics: initialize or treat retry_counter such that the first
retry is allowed (e.g., check > rather than >= or increment after the limit
check), ensure the check uses retry_limit consistently, and adjust
cb_s3_flush()/cb_s3_upload() to route failed files to s3_upload_queue() without
causing an immediate drop; also add/update a regression test that forces the
queued path when preserve_data_ordering is true to verify success/retry/drop
metrics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@plugins/out_s3/s3.c`:
- Around line 3843-3848: The ordered-upload path still discards queued entries
after the first failure because add_to_queue() initializes retry_counter to 0,
s3_upload_queue()/cb_s3_flush() route files into the queue when
preserve_data_ordering is true, and s3_upload_queue() increments retry_counter
before checking against retry_limit using >=; update s3_upload_queue() (and/or
add_to_queue() logic) so that retry counting matches the intended semantics:
initialize or treat retry_counter such that the first retry is allowed (e.g.,
check > rather than >= or increment after the limit check), ensure the check
uses retry_limit consistently, and adjust cb_s3_flush()/cb_s3_upload() to route
failed files to s3_upload_queue() without causing an immediate drop; also
add/update a regression test that forces the queued path when
preserve_data_ordering is true to verify success/retry/drop metrics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e19e405c-9eba-438f-9837-e1cba2e7cc9d

📥 Commits

Reviewing files that changed from the base of the PR and between df38a2f and f2ffe3c.

📒 Files selected for processing (2)

plugins/out_s3/s3.c
tests/runtime/out_s3.c

🚧 Files skipped from review as they are similar to previous changes (1)

tests/runtime/out_s3.c

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/runtime/out_s3.c (1)
38-50: ⚠️ Potential issue | 🟠 Major

These waits are still shorter than the configured timeout.

plugins/out_s3/s3.c only uploads timed-out buffered files after the check at Line 3320 passes. With upload_timeout="6s", the first timer-driven attempt cannot happen until after roughly 6 seconds, so the sleep(2) cases never reach the mocked S3 path, and the new retry-limit test's sleep(4) is still too short to observe two background PutObject attempts. Right now these cases can pass without exercising the intended flush/retry logic, and putobject_retry_limit_semantics may only be observing shutdown behavior from cb_s3_exit(). Please wait past the timeout/retry window or add a shorter test-only timeout path. As per coding guidelines, "Use tests/runtime tests for validating plugin-level behavior and end-to-end semantics including encoder/decoder paths".

Also applies to: 76-88, 115-127, 154-165, 191-202, 228-239, 265-276, 303-314, 339-350, 377-388, 413-424, 451-462, 495-526
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/runtime/out_s3.c` around lines 38 - 50, The test waits in
tests/runtime/out_s3.c are shorter than the configured upload_timeout ("6s") so
the S3 timer-driven upload/retry logic never runs; update the test to wait long
enough to observe the timer and retry window (e.g., sleep past 6s plus retry
intervals) or add a test-only shorter timeout path; specifically adjust the
sleeps after flb_lib_push (and the analogous sleep() calls in the other cases)
so the sequence starting from flb_output_set(...,"upload_timeout","6s",...)
triggers the plugin's timer-driven upload and the putobject retry attempts
assessed by the putobject_retry_limit_semantics test, rather than relying on
shutdown behavior in cb_s3_exit().
plugins/out_s3/s3.c (1)
3345-3350: ⚠️ Potential issue | 🟠 Major

The ordered-queue path still uses the old off-by-one.

These > fixes cover the direct/timer-driven paths, but s3_upload_queue() still discards on upload_contents->retry_counter >= ctx->ins->retry_limit at Line 1944. When a timed-out/full file goes through that preserve-order queue, Retry_Limit=1 still means “drop after the first failed upload” instead of “allow one retry.” The new single-chunk runtime test never enters this path, so it stays uncovered.
🐛 Minimal follow-up
-            if (upload_contents->retry_counter >= ctx->ins->retry_limit) {
+            if (upload_contents->retry_counter > ctx->ins->retry_limit) {
Also applies to: 3840-3845
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/out_s3/s3.c` around lines 3345 - 3350, The ordered-queue path in
s3_upload_queue() is using an off-by-one check (upload_contents->retry_counter
>= ctx->ins->retry_limit) which drops uploads one attempt too early; change that
comparison to use '>' so it matches the other paths (e.g., chunk->failures >
ctx->ins->retry_limit) and thereby allow exactly ctx->ins->retry_limit retries.
Update the analogous checks in s3_upload_queue() and the other occurrence noted
around the 3840–3845 region to use '>' against ctx->ins->retry_limit (refer to
upload_contents->retry_counter, s3_upload_queue(), and ctx->ins->retry_limit to
locate the fixes).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@plugins/out_s3/s3.c`:
- Around line 3345-3350: The ordered-queue path in s3_upload_queue() is using an
off-by-one check (upload_contents->retry_counter >= ctx->ins->retry_limit) which
drops uploads one attempt too early; change that comparison to use '>' so it
matches the other paths (e.g., chunk->failures > ctx->ins->retry_limit) and
thereby allow exactly ctx->ins->retry_limit retries. Update the analogous checks
in s3_upload_queue() and the other occurrence noted around the 3840–3845 region
to use '>' against ctx->ins->retry_limit (refer to
upload_contents->retry_counter, s3_upload_queue(), and ctx->ins->retry_limit to
locate the fixes).

In `@tests/runtime/out_s3.c`:
- Around line 38-50: The test waits in tests/runtime/out_s3.c are shorter than
the configured upload_timeout ("6s") so the S3 timer-driven upload/retry logic
never runs; update the test to wait long enough to observe the timer and retry
window (e.g., sleep past 6s plus retry intervals) or add a test-only shorter
timeout path; specifically adjust the sleeps after flb_lib_push (and the
analogous sleep() calls in the other cases) so the sequence starting from
flb_output_set(...,"upload_timeout","6s",...) triggers the plugin's timer-driven
upload and the putobject retry attempts assessed by the
putobject_retry_limit_semantics test, rather than relying on shutdown behavior
in cb_s3_exit().

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a50eeeb6-2726-4cc9-bca0-18525bf1988d

📥 Commits

Reviewing files that changed from the base of the PR and between f2ffe3c and 4ae1b15.

📒 Files selected for processing (2)

plugins/out_s3/s3.c
tests/runtime/out_s3.c

Verify retry_limit=1 results in exactly 2 PutObject attempts (1 initial + 1 retry) using the plugin's internal retry path. Add unsetenv for FLB_S3_PLUGIN_UNDER_TEST in all existing tests to prevent env var leaking across tests in the same process. Set upload_timeout to 6s in all tests for consistent 1s timer ticks in test mode. Co-authored-by: Thean Lim <theanlim@amazon.com> Signed-off-by: Anuj Singh <singholt@amazon.com>

singholt · 2026-04-03T17:13:25Z

Superseded by - #11669

github-actions bot added the docs-required label Apr 2, 2026

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from ff256a6 to deb8f18 Compare April 2, 2026 23:28

singholt temporarily deployed to pr April 2, 2026 23:28 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from deb8f18 to a2e917e Compare April 2, 2026 23:29

singholt temporarily deployed to pr April 2, 2026 23:29 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from a2e917e to 980b66d Compare April 2, 2026 23:30

singholt temporarily deployed to pr April 2, 2026 23:30 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 980b66d to 776e7e3 Compare April 2, 2026 23:32

singholt temporarily deployed to pr April 2, 2026 23:32 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 776e7e3 to e8566d6 Compare April 2, 2026 23:35

singholt temporarily deployed to pr April 2, 2026 23:35 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from e8566d6 to 1d35352 Compare April 2, 2026 23:38

singholt temporarily deployed to pr April 2, 2026 23:39 — with GitHub Actions Inactive

singholt changed the title ~~Fix/s3 retry limit off by one~~ out_s3: fix off-by-one in retry_limit comparison Apr 2, 2026

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 1d35352 to fb42b46 Compare April 2, 2026 23:46

singholt temporarily deployed to pr April 2, 2026 23:47 — with GitHub Actions Inactive

singholt had a problem deploying to pr April 3, 2026 01:27 — with GitHub Actions Failure

singholt marked this pull request as ready for review April 3, 2026 01:50

singholt requested a review from a team as a code owner April 3, 2026 01:50

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

tests/runtime/out_s3.c Outdated Show resolved Hide resolved

sparrc reviewed Apr 3, 2026

View reviewed changes

plugins/out_s3/s3.c Show resolved Hide resolved

sparrc reviewed Apr 3, 2026

View reviewed changes

plugins/out_s3/s3.c Outdated Show resolved Hide resolved

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from 9c7614b to df38a2f Compare April 3, 2026 15:59

singholt temporarily deployed to pr April 3, 2026 16:02 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from df38a2f to dc935ee Compare April 3, 2026 16:02

singholt temporarily deployed to pr April 3, 2026 16:02 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch 2 times, most recently from 0b19300 to c8fccb6 Compare April 3, 2026 16:04

singholt temporarily deployed to pr April 3, 2026 16:05 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from c8fccb6 to f2ffe3c Compare April 3, 2026 16:07

singholt temporarily deployed to pr April 3, 2026 16:08 — with GitHub Actions Inactive

singholt force-pushed the fix/s3-retry-limit-off-by-one branch from f2ffe3c to d740fef Compare April 3, 2026 16:12

coderabbitai bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out_s3: fix retry_limit semantics and default#11661

out_s3: fix retry_limit semantics and default#11661
singholt wants to merge 2 commits intofluent:masterfrom
singholt:fix/s3-retry-limit-off-by-one

singholt commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

singholt commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

singholt commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Documentation

Backporting

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

singholt commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

singholt commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading