Skip to content

Regression: S3 zero-copy replication leaves replica out of sync in 25.8.15 #1338

@CarlosFelipeOR

Description

@CarlosFelipeOR

I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.

Type of problem

Bug report - something's broken

Describe the situation

A regression was introduced in 25.8.15 affecting ReplicatedMergeTree tables using S3 / MinIO with zero-copy replication.

When a replica is dropped and recreated during concurrent inserts, the recreated replica fails to fetch one data part from another replica.
As a result, the replica remains permanently out of sync, with fewer rows than expected.

This issue is:

  • reproducible with Altinity Stable 25.8.15
  • reproducible with official ClickHouse Docker image 25.8.15
  • not reproducible on 25.8.14

Yes, the issue can be reproduced using an official ClickHouse build of the same version.


How to reproduce the behavior

The issue can be reproduced in two ways: via automation or manually.


Option 1: Reproduce using automation

Altinity Stable Build:

  • ❌ 25.8.15 → fails
  • ✅ 25.8.14 → passes

Command (example):

python3 -u s3/regression.py \
  --clickhouse https://altinity-build-artifacts.s3.amazonaws.com/PRs/1331/80a50080a2dddad4ef2fc02d90e0ef1d2d5182d5/build_amd_binary/clickhouse \
  --storage minio \
  --only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
  --log log.log

Oficcial Clickhouse docker

  • ❌ 25.8.15.35 → fails
  • ✅ 25.8.14.17 → passes
    Command (example):
python3 -u s3/regression.py \
  --local \
  --clickhouse docker://clickhouse/clickhouse-server:25.8.15.35 \
  --clickhouse-version 25.8.15.35 \
  --storage minio \
  --only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
  --log log3.log

The test add remove one replica consistently fails on 25.8.15 and passes on 25.8.14.


Option 2: Manual reproduction (Docker environment)

repro_env.zip

A ZIP file is attached containing:

  • docker-compose.yml
  • ClickHouse config files (cluster.xml, storage.xml, macros*.xml)

Start the environment:

docker-compose up -d

Steps

  1. Open a ClickHouse client on each node:

    docker exec -it s3_env-clickhouse1-1 clickhouse-client
    docker exec -it s3_env-clickhouse2-1 clickhouse-client
    docker exec -it s3_env-clickhouse3-1 clickhouse-client
  2. On all three nodes, create the database and replicated table:

    DROP DATABASE IF EXISTS s3test SYNC;
    CREATE DATABASE s3test;
    
    DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;
    
    CREATE TABLE s3test.add_remove_one_replica
    (
        d UInt64
    )
    ENGINE = ReplicatedMergeTree(
        '/clickhouse/tables/{shard}/s3test.add_remove_one_replica',
        '{replica}'
    )
    ORDER BY d
    SETTINGS
        storage_policy = 'external',
        allow_remote_fs_zero_copy_replication = 1;
  3. On node2, insert the first batch of data:

    INSERT INTO s3test.add_remove_one_replica
    SELECT number FROM numbers(1000000);
  4. On node3, delete the replica:

    DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;
  5. On node3, recreate the replicated table:

    CREATE TABLE s3test.add_remove_one_replica
    (
        d UInt64
    )
    ENGINE = ReplicatedMergeTree(
        '/clickhouse/tables/{shard}/s3test.add_remove_one_replica',
        '{replica}'
    )
    ORDER BY d
    SETTINGS
        storage_policy = 'external',
        allow_remote_fs_zero_copy_replication = 1;
  6. On node1, insert the second batch of data:

    INSERT INTO s3test.add_remove_one_replica
    SELECT number + 1000000 FROM numbers(1000000);
  7. Verify row count on each node:

    SELECT count(*)
    FROM s3test.add_remove_one_replica;

Expected behavior

All replicas should eventually converge and return:

2000000

The recreated replica should fetch all missing parts and fully synchronize.


Actual behavior

The third replica remains permanently out of sync:

1000000

Replication does not recover even after waiting.


Logs, error messages, stacktraces

Replication queue error (node3)

Query executed:

SELECT
    count() AS queue_items,
    anyIf(last_exception, last_exception != '') AS last_exception
FROM system.replication_queue
WHERE database='s3test' AND table='add_remove_one_replica';

Result:

   ┌─queue_items─┬─last_exception─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │           1 │ Poco::Exception. Code: 1000, e.code() = 0, Malformed message: Unexpected EOF, Stack trace (when copying this message, always include the lines below):                                                                                                    ↴│
   │             │↳                                                                                                                                                                                                                                                          ↴│
   │             │↳0. Poco::Net::HTTPChunkedStreamBuf::readFromDevice(char*, long) @ 0x000000001f174834                                                                                                                                                                      ↴│
   │             │↳1. DB::ReadBufferFromIStream::nextImpl() @ 0x00000000158d5a10                                                                                                                                                                                             ↴│
   │             │↳2. DB::ReadBuffer::next() @ 0x0000000013684bed                                                                                                                                                                                                            ↴│
   │             │↳3. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::ReadWriteBufferFromHTTP::nextImpl()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x00000000158d457d                ↴│
   │             │↳4. DB::ReadWriteBufferFromHTTP::doWithRetries(std::function<void ()>&&, std::function<void ()>, bool) const @ 0x00000000158cae3b                                                                                                                          ↴│
   │             │↳5. DB::ReadWriteBufferFromHTTP::nextImpl() @ 0x00000000158cf7ff                                                                                                                                                                                           ↴│
   │             │↳6. DB::ReadBuffer::next() @ 0x0000000013684bed                                                                                                                                                                                                            ↴│
   │             │↳7. DB::BuilderRWBufferFromHTTP::create(Poco::Net::HTTPBasicCredentials const&) @ 0x00000000158d1821                                                                                                                                                       ↴│
   │             │↳8. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x000000001924a018↴│
   │             │↳9. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x00000000192513de↴│
   │             │↳10. std::shared_ptr<DB::IMergeTreeDataPart> std::__function::__policy_invoker<std::shared_ptr<DB::IMergeTreeDataPart> ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool)::$_4, std::shared_ptr<DB::IMergeTreeDataPart> ()>>(std::__function::__policy_storage const*) @ 0x0000000018f1dd09↴│
   │             │↳11. DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool) @ 0x0000000018de290f                 ↴│
   │             │↳12. DB::StorageReplicatedMergeTree::executeFetch(DB::ReplicatedMergeTreeLogEntry&, bool) @ 0x0000000018dce760                                                                                                                                             ↴│
   │             │↳13. DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&) @ 0x0000000018dba56a                                                                                                                                                ↴│
   │             │↳14. bool std::__function::__policy_invoker<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_1, bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>>(std::__function::__policy_storage const*, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x0000000018f1a71b↴│
   │             │↳15. DB::ReplicatedMergeTreeQueue::processEntry(std::function<std::shared_ptr<zkutil::ZooKeeper> ()>, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&, std::function<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>) @ 0x00000000197f7968     ↴│
   │             │↳16. DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>) @ 0x0000000018e1381c                                                                                                                  ↴│
   │             │↳17. DB::ExecutableLambdaAdapter::executeStep() @ 0x00000000192a6c12                                                                                                                                                                                       ↴│
   │             │↳18. DB::TaskRuntimeData::executeStep() const @ 0x0000000019389a0c                                                                                                                                                                                         ↴│
   │             │↳19. DB::MergeTreeBackgroundExecutor<DB::RoundRobinRuntimeQueue>::threadFunction() @ 0x000000001938bc0d                                                                                                                                                    ↴│
   │             │↳20. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000136f5c2b                                                                                                                                            ↴│
   │             │↳21. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x00000000136fcfa6↴│
   │             │↳22. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000136f2c12                                                                                                                                                                      ↴│
   │             │↳23. void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000136fa6da↴│
   │             │↳24. ? @ 0x0000000000094ac3                                                                                                                                                                                                                                ↴│
   │             │↳25. ? @ 0x0000000000125a74                                                                                                                                                                                                                                ↴│
   │             │↳ (version 25.8.15.35 (official build))                                                                                                                                                                                                                     │
   └─────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Additional context

Data parts comparison between replicas

After the reproduction steps, the active parts differ between replicas.


Query executed:

SELECT name, rows
FROM system.parts
WHERE database='s3test'
  AND table='add_remove_one_replica'
  AND active
ORDER BY name;

Node 1 (source replica)

Result:

   ┌─name──────┬────rows─┐
1. │ all_0_0_0 │ 1000000 │
2. │ all_1_1_0 │ 1000000 │
   └───────────┴─────────┘

Node 3 (affected replica)

Result:

   ┌─name──────┬────rows─┐
1. │ all_1_1_0 │ 1000000 │
   └───────────┴─────────┘

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions