Skip to content

Conversation

@CheSema
Copy link
Member

@CheSema CheSema commented Nov 17, 2025

Now errors with such stacks would be retried by Clickhouse

[28948] 1f2b1ae5-92cc-45d1-90aa-418d085c3454 <Error>: DynamicQueryHandler Poco::Exception. Code: 1000, e.code() = 0, Timeout, Stack trace (when copying this message, always include the lines below):

0. ./ci/tmp/build/./base/poco/NetSSL_OpenSSL/src/SecureSocketImpl.cpp:0: Poco::Net::SecureSocketImpl::mustRetry(int, Poco::Timespan&) @ 0x00000000196f048e
1. ./ci/tmp/build/./base/poco/NetSSL_OpenSSL/src/SecureSocketImpl.cpp:357: Poco::Net::SecureSocketImpl::receiveBytes(void*, int, int) @ 0x00000000196f1631
2. ./base/poco/Net/src/StreamSocket.cpp:135: Poco::Net::HTTPSession::receive(char*, int) @ 0x00000000196c4291
3. ./base/poco/Net/src/HTTPSession.cpp:161: Poco::Net::HTTPSClientSession::read(char*, long) @ 0x00000000196e345e
4. ./ci/tmp/build/./base/poco/Net/src/HTTPChunkedStream.cpp:122: Poco::Net::HTTPChunkedStreamBuf::readFromDevice(char*, long) @ 0x00000000196b693b
5. ./base/poco/Foundation/include/Poco/BufferedStreamBuf.h:102: Poco::BasicBufferedStreamBuf<char, std::char_traits<char>, Poco::BufferAllocator<char>>::underflow() @ 0x00000000195be4d1
6. ./contrib/llvm-project/libcxx/include/streambuf:194: void String::__init_with_sentinel[abi:ne190107]<std::istreambuf_iterator<char, std::char_traits<char>>, std::istreambuf_iterator<char, std::char_traits<char>>>(std::istreambuf_iterator<char, std::char_traits<char>>, std::istreambuf_iterator<char, std::char_traits<char>>) @ 0x0000000012b12209
7. Aws::Utils::Xml::XmlDocument::CreateFromXmlStream(std::basic_iostream<char, std::char_traits<char>>&) @ 0x000000001983f383
8. Aws::Utils::Outcome<Aws::AmazonWebServiceResult<Aws::Utils::Xml::XmlDocument>, Aws::Client::AWSError<Aws::Client::CoreErrors>> std::__function::__policy_invoker<Aws::Utils::Outcome<Aws::AmazonWebServiceResult<Aws::Utils::Xml::XmlDocument>, Aws::Client::AWSError<Aws::Client::CoreErrors>> ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const::$_1, Aws::Utils::Outcome<Aws::AmazonWebServiceResult<Aws::Utils::Xml::XmlDocument>, Aws::Client::AWSError<Aws::Client::CoreErrors>> ()>>(std::__function::__policy_storage const*) @ 0x00000000197dad10
9. Aws::Utils::Outcome<Aws::AmazonWebServiceResult<Aws::Utils::Xml::XmlDocument>, Aws::Client::AWSError<Aws::Client::CoreErrors>> smithy::components::tracing::TracingUtils::MakeCallWithTiming<Aws::Utils::Outcome<Aws::AmazonWebServiceResult<Aws::Utils::Xml::XmlDocument>, Aws::Client::AWSError<Aws::Client::CoreErrors>>>(std::function<Aws::Utils::Outcome<Aws::AmazonWebServiceResult<Aws::Utils::Xml::XmlDocument>, Aws::Client::AWSError<Aws::Client::CoreErrors>> ()>, String const&, smithy::components::tracing::Meter const&, std::map<String, String, std::less<String>, std::allocator<std::pair<String const, String>>>&&, String const&) @ 0x00000000197d991c
10. Aws::Client::AWSXMLClient::MakeRequest(Aws::Http::URI const&, Aws::AmazonWebServiceRequest const&, Aws::Http::HttpMethod, char const*, char const*, char const*) const @ 0x00000000197d812e
11. Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> std::__function::__policy_invoker<Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<Aws::S3::S3Client::ListObjectsV2(Aws::S3::Model::ListObjectsV2Request const&) const::$_0, Aws::Utils::Outcome<Aws::S3::Model::ListObjectsV2Result, Aws::S3::S3Error> ()>>(std::__function::__policy_storage const*) @ 0x000000001996e86a
12. Aws::S3::S3Client::ListObjectsV2(Aws::S3::Model::ListObjectsV2Request const&) const @ 0x00000000198e7fdd
13. ./ci/tmp/build/./src/IO/S3/Client.cpp:453: DB::S3::Client::ListObjectsV2(DB::S3::ExtendedRequest<Aws::S3::Model::ListObjectsV2Request>&) const @ 0x0000000012abe385
14. ./ci/tmp/build/./src/Disks/ObjectStorages/S3/S3ObjectStorage.cpp:132: DB::(anonymous namespace)::S3IteratorAsync::getBatchAndCheckNext(std::vector<std::shared_ptr<DB::RelativePathWithMetadata>, std::allocator<std::shared_ptr<DB::RelativePathWithMetadata>>>&) (.3387b9197a4356d21ff7f364428ca5ad) @ 0x0000000012cb4ea3
15. ./ci/tmp/build/./src/Disks/ObjectStorages/ObjectStorageIteratorAsync.cpp:100: DB::IObjectStorageIteratorAsync::BatchAndHasNext std::__function::__policy_invoker<DB::IObjectStorageIteratorAsync::BatchAndHasNext ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::IObjectStorageIteratorAsync::scheduleBatch()::$_0, DB::IObjectStorageIteratorAsync::BatchAndHasNext ()>>(std::__function::__policy_storage const*) @ 0x0000000012b822e1
16. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x0000000012b81aed
17. ./contrib/llvm-project/libcxx/include/future:1589: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<std::function<std::future<DB::IObjectStorageIteratorAsync::BatchAndHasNext> (std::function<DB::IObjectStorageIteratorAsync::BatchAndHasNext ()>&&, Priority)> DB::threadPoolCallbackRunnerUnsafe<DB::IObjectStorageIteratorAsync::BatchAndHasNext, std::function<DB::IObjectStorageIteratorAsync::BatchAndHasNext ()>>(ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>&, String const&)::'lambda'(std::function<DB::IObjectStorageIteratorAsync::BatchAndHasNext ()>&&, Priority)::operator()(std::function<DB::IObjectStorageIteratorAsync::BatchAndHasNext ()>&&, Priority)::'lambda0'(), void ()>>(std::__function::__policy_storage const*) @ 0x0000000012b81dc0
18. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x000000000ff486eb
19. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000ff4f2fd
20. ./contrib/llvm-project/libcxx/include/__functional/function.h:716: ? @ 0x000000000ff45912
21. ./contrib/llvm-project/libcxx/include/__type_traits/invoke.h:117: void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x000000000ff4cdda
22. ? @ 0x0000000000094ac3
23. ? @ 0x00000000001268c0
 (version 25.6.2.6414 (official build))

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Retry network errors when S3 library parses XML response.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link

clickhouse-gh bot commented Nov 17, 2025

Workflow [PR], commit [d04dc50]

Summary:

job_name test_name status info comment
Build (amd_compat) failure
Cmake configuration failure cidb
Stateless tests (arm_asan, targeted) failure
03032_dynamically_resize_filesystem_cache_2 FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: FAIL cidb
BuzzHouse (amd_ubsan) failure
/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/src/IO/VarInt.h:32:5: runtime error: store to null pointer of type 'char' FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Nov 17, 2025
@CheSema CheSema requested a review from Copilot November 17, 2025 17:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds retry logic for network errors that occur when the S3 library parses XML responses, specifically addressing timeout scenarios during S3 listing operations. The changes enable the system to recover from partial XML parsing failures by retrying the request instead of failing immediately.

Key Changes:

  • Extended S3 client error handling to catch and retry Poco::TimeoutException in addition to existing Poco::Net::NetException
  • Added support for simulating timeout errors during S3 listing operations in the test infrastructure
  • Implemented comprehensive integration test to verify retry behavior when timeouts occur during S3 listing

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/IO/S3/Client.cpp Refactored network error handling into a reusable lambda and added timeout exception handling
tests/integration/helpers/s3_mocks/broken_s3.py Added TimeoutAction class and at_listing support to simulate timeouts during S3 list operations
tests/integration/test_checking_s3_blobs_paranoid/test.py Added new test case and cluster instance configuration to verify timeout retry behavior
tests/integration/test_checking_s3_blobs_paranoid/configs/s3_retries_with_adaptive_timeout.xml Added configuration file enabling S3 retries with adaptive timeouts

if self.count:
self.count -= 1
return True
elif self.count:
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactored condition at line 402 changes behavior: previously, the count would decrement only when after == 0. Now it decrements when after is any falsy value (0, None, False). This could cause unintended behavior if after is explicitly set to None or False, as the count will decrement immediately instead of waiting for after to reach 0.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@CheSema CheSema Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that robot sees such changes.
It is intentionally, previous behavior is not correct

@jkartseva jkartseva self-assigned this Nov 18, 2025

if (!outcome.IsSuccess()
/// AWS SDK's built-in per-thread retry logic is disabled.
&& client_configuration.s3_slow_all_threads_after_retryable_error
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If seems like when s3_slow_all_threads_after_retryable_error is `false, we do not do any retries at all here.

std::invoke_result_t<RequestFn, RequestType &>
Client::doRequestWithRetryNetworkErrors(RequestType & request, RequestFn request_fn) const
{
/// S3 does retries network errors actually.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made more changes in that function, just refactoring in a way that I like more.

@CheSema CheSema requested a review from Copilot November 20, 2025 10:47
@CheSema CheSema force-pushed the chesema-retry-list-s3 branch from f64c601 to 7acceb2 Compare November 20, 2025 10:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment on lines +777 to +781
outcome = Aws::Client::AWSError<Aws::Client::CoreErrors>(
Aws::Client::CoreErrors::NETWORK_CONNECTION,
/*name*/ "",
/*message*/ fmt::format("All {} retry attempts failed. Last exception: {}", max_attempts, getCurrentExceptionMessage(false)),
/*retryable*/ true);
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outcome variable is being assigned an error outcome in the exception handler, but this assignment may be unnecessary if the lambda returns immediately after. Consider whether this assignment is used when returning from the exception handler, or if it's only used when continuing the retry loop.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It acts as last error, if we run out of attempt we return the error stored in that variable, that is the last error.

@CheSema CheSema force-pushed the chesema-retry-list-s3 branch from 7acceb2 to 5604e0a Compare November 20, 2025 10:51
@CheSema CheSema force-pushed the chesema-retry-list-s3 branch from 5604e0a to f167748 Compare November 20, 2025 11:24
@CheSema CheSema force-pushed the chesema-retry-list-s3 branch from 8fdce17 to d04dc50 Compare November 21, 2025 14:33
@CheSema
Copy link
Member Author

CheSema commented Nov 21, 2025

03276_database_backup_merge_tree_table_file_engine -- should not be run twice in the job.

@CheSema CheSema added this pull request to the merge queue Nov 21, 2025
Merged via the queue into master with commit 10225a3 Nov 21, 2025
248 of 259 checks passed
@CheSema CheSema deleted the chesema-retry-list-s3 branch November 21, 2025 22:02
@CheSema CheSema added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label Nov 21, 2025
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 21, 2025
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Nov 21, 2025
robot-ch-test-poll2 added a commit that referenced this pull request Nov 21, 2025
Cherry pick #90216 to 25.10: retry network errors when s3 library parse xml response
robot-clickhouse added a commit that referenced this pull request Nov 21, 2025
robot-ch-test-poll2 added a commit that referenced this pull request Nov 21, 2025
Cherry pick #90216 to 25.11: retry network errors when s3 library parse xml response
robot-clickhouse added a commit that referenced this pull request Nov 21, 2025
clickhouse-gh bot added a commit that referenced this pull request Nov 22, 2025
Backport #90216 to 25.10: retry network errors when s3 library parse xml response
CheSema added a commit that referenced this pull request Nov 24, 2025
Backport #90216 to 25.11: retry network errors when s3 library parse xml response
CheSema added a commit that referenced this pull request Nov 24, 2025
Cherry pick #90216 to 25.9: retry network errors when s3 library parse xml response
CheSema added a commit that referenced this pull request Nov 24, 2025
Cherry pick #90216 to 25.8: retry network errors when s3 library parse xml response
robot-clickhouse added a commit that referenced this pull request Nov 24, 2025
robot-clickhouse added a commit that referenced this pull request Nov 24, 2025
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Nov 24, 2025
clickhouse-gh bot added a commit that referenced this pull request Nov 24, 2025
Backport #90216 to 25.9: retry network errors when s3 library parse xml response
CheSema added a commit that referenced this pull request Nov 25, 2025
Backport #90216 to 25.8: retry network errors when s3 library parse xml response
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-improvement Pull request with some product improvements pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants