-
Notifications
You must be signed in to change notification settings - Fork 14
Description
✅ I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.
Type of problem
Bug report - something's broken
Describe the situation
A regression was introduced in 25.8.15 affecting ReplicatedMergeTree tables using S3 / MinIO with zero-copy replication.
When a replica is dropped and recreated during concurrent inserts, the recreated replica fails to fetch one data part from another replica.
As a result, the replica remains permanently out of sync, with fewer rows than expected.
This issue is:
- reproducible with Altinity Stable 25.8.15
- reproducible with official ClickHouse Docker image 25.8.15
- not reproducible on 25.8.14
Yes, the issue can be reproduced using an official ClickHouse build of the same version.
How to reproduce the behavior
The issue can be reproduced in two ways: via automation or manually.
Option 1: Reproduce using automation
Altinity Stable Build:
- ❌ 25.8.15 → fails
- ✅ 25.8.14 → passes
Command (example):
python3 -u s3/regression.py \
--clickhouse https://altinity-build-artifacts.s3.amazonaws.com/PRs/1331/80a50080a2dddad4ef2fc02d90e0ef1d2d5182d5/build_amd_binary/clickhouse \
--storage minio \
--only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
--log log.logOficcial Clickhouse docker
- ❌ 25.8.15.35 → fails
- ✅ 25.8.14.17 → passes
Command (example):
python3 -u s3/regression.py \
--local \
--clickhouse docker://clickhouse/clickhouse-server:25.8.15.35 \
--clickhouse-version 25.8.15.35 \
--storage minio \
--only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
--log log3.logThe test add remove one replica consistently fails on 25.8.15 and passes on 25.8.14.
Option 2: Manual reproduction (Docker environment)
A ZIP file is attached containing:
docker-compose.yml- ClickHouse config files (
cluster.xml,storage.xml,macros*.xml)
Start the environment:
docker-compose up -dSteps
-
Open a ClickHouse client on each node:
docker exec -it s3_env-clickhouse1-1 clickhouse-client docker exec -it s3_env-clickhouse2-1 clickhouse-client docker exec -it s3_env-clickhouse3-1 clickhouse-client
-
On all three nodes, create the database and replicated table:
DROP DATABASE IF EXISTS s3test SYNC; CREATE DATABASE s3test; DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC; CREATE TABLE s3test.add_remove_one_replica ( d UInt64 ) ENGINE = ReplicatedMergeTree( '/clickhouse/tables/{shard}/s3test.add_remove_one_replica', '{replica}' ) ORDER BY d SETTINGS storage_policy = 'external', allow_remote_fs_zero_copy_replication = 1;
-
On node2, insert the first batch of data:
INSERT INTO s3test.add_remove_one_replica SELECT number FROM numbers(1000000);
-
On node3, delete the replica:
DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;
-
On node3, recreate the replicated table:
CREATE TABLE s3test.add_remove_one_replica ( d UInt64 ) ENGINE = ReplicatedMergeTree( '/clickhouse/tables/{shard}/s3test.add_remove_one_replica', '{replica}' ) ORDER BY d SETTINGS storage_policy = 'external', allow_remote_fs_zero_copy_replication = 1;
-
On node1, insert the second batch of data:
INSERT INTO s3test.add_remove_one_replica SELECT number + 1000000 FROM numbers(1000000);
-
Verify row count on each node:
SELECT count(*) FROM s3test.add_remove_one_replica;
Expected behavior
All replicas should eventually converge and return:
2000000
The recreated replica should fetch all missing parts and fully synchronize.
Actual behavior
The third replica remains permanently out of sync:
1000000
Replication does not recover even after waiting.
Logs, error messages, stacktraces
Replication queue error (node3)
Query executed:
SELECT
count() AS queue_items,
anyIf(last_exception, last_exception != '') AS last_exception
FROM system.replication_queue
WHERE database='s3test' AND table='add_remove_one_replica';Result:
┌─queue_items─┬─last_exception─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ 1 │ Poco::Exception. Code: 1000, e.code() = 0, Malformed message: Unexpected EOF, Stack trace (when copying this message, always include the lines below): ↴│
│ │↳ ↴│
│ │↳0. Poco::Net::HTTPChunkedStreamBuf::readFromDevice(char*, long) @ 0x000000001f174834 ↴│
│ │↳1. DB::ReadBufferFromIStream::nextImpl() @ 0x00000000158d5a10 ↴│
│ │↳2. DB::ReadBuffer::next() @ 0x0000000013684bed ↴│
│ │↳3. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::ReadWriteBufferFromHTTP::nextImpl()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x00000000158d457d ↴│
│ │↳4. DB::ReadWriteBufferFromHTTP::doWithRetries(std::function<void ()>&&, std::function<void ()>, bool) const @ 0x00000000158cae3b ↴│
│ │↳5. DB::ReadWriteBufferFromHTTP::nextImpl() @ 0x00000000158cf7ff ↴│
│ │↳6. DB::ReadBuffer::next() @ 0x0000000013684bed ↴│
│ │↳7. DB::BuilderRWBufferFromHTTP::create(Poco::Net::HTTPBasicCredentials const&) @ 0x00000000158d1821 ↴│
│ │↳8. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x000000001924a018↴│
│ │↳9. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x00000000192513de↴│
│ │↳10. std::shared_ptr<DB::IMergeTreeDataPart> std::__function::__policy_invoker<std::shared_ptr<DB::IMergeTreeDataPart> ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool)::$_4, std::shared_ptr<DB::IMergeTreeDataPart> ()>>(std::__function::__policy_storage const*) @ 0x0000000018f1dd09↴│
│ │↳11. DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool) @ 0x0000000018de290f ↴│
│ │↳12. DB::StorageReplicatedMergeTree::executeFetch(DB::ReplicatedMergeTreeLogEntry&, bool) @ 0x0000000018dce760 ↴│
│ │↳13. DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&) @ 0x0000000018dba56a ↴│
│ │↳14. bool std::__function::__policy_invoker<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_1, bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>>(std::__function::__policy_storage const*, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x0000000018f1a71b↴│
│ │↳15. DB::ReplicatedMergeTreeQueue::processEntry(std::function<std::shared_ptr<zkutil::ZooKeeper> ()>, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&, std::function<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>) @ 0x00000000197f7968 ↴│
│ │↳16. DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>) @ 0x0000000018e1381c ↴│
│ │↳17. DB::ExecutableLambdaAdapter::executeStep() @ 0x00000000192a6c12 ↴│
│ │↳18. DB::TaskRuntimeData::executeStep() const @ 0x0000000019389a0c ↴│
│ │↳19. DB::MergeTreeBackgroundExecutor<DB::RoundRobinRuntimeQueue>::threadFunction() @ 0x000000001938bc0d ↴│
│ │↳20. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000136f5c2b ↴│
│ │↳21. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x00000000136fcfa6↴│
│ │↳22. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000136f2c12 ↴│
│ │↳23. void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000136fa6da↴│
│ │↳24. ? @ 0x0000000000094ac3 ↴│
│ │↳25. ? @ 0x0000000000125a74 ↴│
│ │↳ (version 25.8.15.35 (official build)) │
└─────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Additional context
Data parts comparison between replicas
After the reproduction steps, the active parts differ between replicas.
Query executed:
SELECT name, rows
FROM system.parts
WHERE database='s3test'
AND table='add_remove_one_replica'
AND active
ORDER BY name;Node 1 (source replica)
Result:
┌─name──────┬────rows─┐
1. │ all_0_0_0 │ 1000000 │
2. │ all_1_1_0 │ 1000000 │
└───────────┴─────────┘
Node 3 (affected replica)
Result:
┌─name──────┬────rows─┐
1. │ all_1_1_0 │ 1000000 │
└───────────┴─────────┘