-
Notifications
You must be signed in to change notification settings - Fork 8k
retry network errors when s3 library parse xml response #90216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Workflow [PR], commit [d04dc50] Summary: ⏳
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds retry logic for network errors that occur when the S3 library parses XML responses, specifically addressing timeout scenarios during S3 listing operations. The changes enable the system to recover from partial XML parsing failures by retrying the request instead of failing immediately.
Key Changes:
- Extended S3 client error handling to catch and retry
Poco::TimeoutExceptionin addition to existingPoco::Net::NetException - Added support for simulating timeout errors during S3 listing operations in the test infrastructure
- Implemented comprehensive integration test to verify retry behavior when timeouts occur during S3 listing
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/IO/S3/Client.cpp |
Refactored network error handling into a reusable lambda and added timeout exception handling |
tests/integration/helpers/s3_mocks/broken_s3.py |
Added TimeoutAction class and at_listing support to simulate timeouts during S3 list operations |
tests/integration/test_checking_s3_blobs_paranoid/test.py |
Added new test case and cluster instance configuration to verify timeout retry behavior |
tests/integration/test_checking_s3_blobs_paranoid/configs/s3_retries_with_adaptive_timeout.xml |
Added configuration file enabling S3 retries with adaptive timeouts |
| if self.count: | ||
| self.count -= 1 | ||
| return True | ||
| elif self.count: |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactored condition at line 402 changes behavior: previously, the count would decrement only when after == 0. Now it decrements when after is any falsy value (0, None, False). This could cause unintended behavior if after is explicitly set to None or False, as the count will decrement immediately instead of waiting for after to reach 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that robot sees such changes.
It is intentionally, previous behavior is not correct
|
|
||
| if (!outcome.IsSuccess() | ||
| /// AWS SDK's built-in per-thread retry logic is disabled. | ||
| && client_configuration.s3_slow_all_threads_after_retryable_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If seems like when s3_slow_all_threads_after_retryable_error is `false, we do not do any retries at all here.
| std::invoke_result_t<RequestFn, RequestType &> | ||
| Client::doRequestWithRetryNetworkErrors(RequestType & request, RequestFn request_fn) const | ||
| { | ||
| /// S3 does retries network errors actually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made more changes in that function, just refactoring in a way that I like more.
f64c601 to
7acceb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
| outcome = Aws::Client::AWSError<Aws::Client::CoreErrors>( | ||
| Aws::Client::CoreErrors::NETWORK_CONNECTION, | ||
| /*name*/ "", | ||
| /*message*/ fmt::format("All {} retry attempts failed. Last exception: {}", max_attempts, getCurrentExceptionMessage(false)), | ||
| /*retryable*/ true); |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The outcome variable is being assigned an error outcome in the exception handler, but this assignment may be unnecessary if the lambda returns immediately after. Consider whether this assignment is used when returning from the exception handler, or if it's only used when continuing the retry loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It acts as last error, if we run out of attempt we return the error stored in that variable, that is the last error.
7acceb2 to
5604e0a
Compare
5604e0a to
f167748
Compare
8fdce17 to
d04dc50
Compare
|
|
Cherry pick #90216 to 25.10: retry network errors when s3 library parse xml response
Cherry pick #90216 to 25.11: retry network errors when s3 library parse xml response
Backport #90216 to 25.10: retry network errors when s3 library parse xml response
Backport #90216 to 25.11: retry network errors when s3 library parse xml response
Cherry pick #90216 to 25.9: retry network errors when s3 library parse xml response
Cherry pick #90216 to 25.8: retry network errors when s3 library parse xml response
Backport #90216 to 25.9: retry network errors when s3 library parse xml response
Backport #90216 to 25.8: retry network errors when s3 library parse xml response
Now errors with such stacks would be retried by Clickhouse
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Retry network errors when S3 library parses XML response.
Documentation entry for user-facing changes