Regression: S3 zero-copy replication leaves replica out of sync in 25.8.15

✅  *I checked [the Altinity Stable Builds lifecycle table](https://docs.altinity.com/altinitystablebuilds/#altinity-stable-builds-life-cycle-table), and the Altinity Stable Build version I'm using is still supported.*

## Type of problem

**Bug report** - something's broken

## Describe the situation

A regression was introduced in **25.8.15** affecting `ReplicatedMergeTree` tables using **S3 / MinIO with zero-copy replication**.

When a replica is **dropped and recreated during concurrent inserts**, the recreated replica fails to fetch one data part from another replica.  
As a result, the replica remains permanently **out of sync**, with fewer rows than expected.

This issue is:
- reproducible with **Altinity Stable 25.8.15**
- reproducible with **official ClickHouse Docker image 25.8.15**
- **not reproducible** on **25.8.14**

Yes, the issue can be reproduced using an official ClickHouse build of the same version.

---

## How to reproduce the behavior

The issue can be reproduced in **two ways**: via automation or manually.

---

### Option 1: Reproduce using automation

**Altinity Stable Build:**
- ❌ 25.8.15 → fails
- ✅ 25.8.14 → passes

Command (example):
```bash
python3 -u s3/regression.py \
  --clickhouse https://altinity-build-artifacts.s3.amazonaws.com/PRs/1331/80a50080a2dddad4ef2fc02d90e0ef1d2d5182d5/build_amd_binary/clickhouse \
  --storage minio \
  --only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
  --log log.log
````

**Oficcial Clickhouse docker**
- ❌ 25.8.15.35 → fails
- ✅ 25.8.14.17 → passes
Command (example):
```bash
python3 -u s3/regression.py \
  --local \
  --clickhouse docker://clickhouse/clickhouse-server:25.8.15.35 \
  --clickhouse-version 25.8.15.35 \
  --storage minio \
  --only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
  --log log3.log
````

The test `add remove one replica` consistently fails on 25.8.15 and passes on 25.8.14.

---

### Option 2: Manual reproduction (Docker environment)

[repro_env.zip](https://github.com/user-attachments/files/24833358/repro_env.zip)

A ZIP file is attached containing:

* `docker-compose.yml`
* ClickHouse config files (`cluster.xml`, `storage.xml`, `macros*.xml`)

Start the environment:

```bash
docker-compose up -d
```

#### Steps

1. Open a ClickHouse client on each node:
   ```bash
   docker exec -it s3_env-clickhouse1-1 clickhouse-client
   docker exec -it s3_env-clickhouse2-1 clickhouse-client
   docker exec -it s3_env-clickhouse3-1 clickhouse-client
   ```

2. On **all three nodes**, create the database and replicated table:

   ```sql
   DROP DATABASE IF EXISTS s3test SYNC;
   CREATE DATABASE s3test;

   DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;

   CREATE TABLE s3test.add_remove_one_replica
   (
       d UInt64
   )
   ENGINE = ReplicatedMergeTree(
       '/clickhouse/tables/{shard}/s3test.add_remove_one_replica',
       '{replica}'
   )
   ORDER BY d
   SETTINGS
       storage_policy = 'external',
       allow_remote_fs_zero_copy_replication = 1;
   ```

3. On **node2**, insert the first batch of data:

   ```sql
   INSERT INTO s3test.add_remove_one_replica
   SELECT number FROM numbers(1000000);
   ```

4. On **node3**, delete the replica:

   ```sql
   DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;
   ```

5. On **node3**, recreate the replicated table:

   ```sql
   CREATE TABLE s3test.add_remove_one_replica
   (
       d UInt64
   )
   ENGINE = ReplicatedMergeTree(
       '/clickhouse/tables/{shard}/s3test.add_remove_one_replica',
       '{replica}'
   )
   ORDER BY d
   SETTINGS
       storage_policy = 'external',
       allow_remote_fs_zero_copy_replication = 1;
   ```

6. On **node1**, insert the second batch of data:

   ```sql
   INSERT INTO s3test.add_remove_one_replica
   SELECT number + 1000000 FROM numbers(1000000);
   ```

7. Verify row count on each node:

   ```sql
   SELECT count(*)
   FROM s3test.add_remove_one_replica;
   ```

---

## Expected behavior

All replicas should eventually converge and return:

```
2000000
```

The recreated replica should fetch all missing parts and fully synchronize.

---

## Actual behavior

The third replica remains permanently out of sync:

```
1000000
```

Replication does not recover even after waiting.

---

## Logs, error messages, stacktraces

### Replication queue error (node3)

Query executed:
```sql
SELECT
    count() AS queue_items,
    anyIf(last_exception, last_exception != '') AS last_exception
FROM system.replication_queue
WHERE database='s3test' AND table='add_remove_one_replica';
````

Result:

```
   ┌─queue_items─┬─last_exception─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │           1 │ Poco::Exception. Code: 1000, e.code() = 0, Malformed message: Unexpected EOF, Stack trace (when copying this message, always include the lines below):                                                                                                    ↴│
   │             │↳                                                                                                                                                                                                                                                          ↴│
   │             │↳0. Poco::Net::HTTPChunkedStreamBuf::readFromDevice(char*, long) @ 0x000000001f174834                                                                                                                                                                      ↴│
   │             │↳1. DB::ReadBufferFromIStream::nextImpl() @ 0x00000000158d5a10                                                                                                                                                                                             ↴│
   │             │↳2. DB::ReadBuffer::next() @ 0x0000000013684bed                                                                                                                                                                                                            ↴│
   │             │↳3. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::ReadWriteBufferFromHTTP::nextImpl()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x00000000158d457d                ↴│
   │             │↳4. DB::ReadWriteBufferFromHTTP::doWithRetries(std::function<void ()>&&, std::function<void ()>, bool) const @ 0x00000000158cae3b                                                                                                                          ↴│
   │             │↳5. DB::ReadWriteBufferFromHTTP::nextImpl() @ 0x00000000158cf7ff                                                                                                                                                                                           ↴│
   │             │↳6. DB::ReadBuffer::next() @ 0x0000000013684bed                                                                                                                                                                                                            ↴│
   │             │↳7. DB::BuilderRWBufferFromHTTP::create(Poco::Net::HTTPBasicCredentials const&) @ 0x00000000158d1821                                                                                                                                                       ↴│
   │             │↳8. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x000000001924a018↴│
   │             │↳9. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x00000000192513de↴│
   │             │↳10. std::shared_ptr<DB::IMergeTreeDataPart> std::__function::__policy_invoker<std::shared_ptr<DB::IMergeTreeDataPart> ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool)::$_4, std::shared_ptr<DB::IMergeTreeDataPart> ()>>(std::__function::__policy_storage const*) @ 0x0000000018f1dd09↴│
   │             │↳11. DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool) @ 0x0000000018de290f                 ↴│
   │             │↳12. DB::StorageReplicatedMergeTree::executeFetch(DB::ReplicatedMergeTreeLogEntry&, bool) @ 0x0000000018dce760                                                                                                                                             ↴│
   │             │↳13. DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&) @ 0x0000000018dba56a                                                                                                                                                ↴│
   │             │↳14. bool std::__function::__policy_invoker<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_1, bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>>(std::__function::__policy_storage const*, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x0000000018f1a71b↴│
   │             │↳15. DB::ReplicatedMergeTreeQueue::processEntry(std::function<std::shared_ptr<zkutil::ZooKeeper> ()>, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&, std::function<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>) @ 0x00000000197f7968     ↴│
   │             │↳16. DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>) @ 0x0000000018e1381c                                                                                                                  ↴│
   │             │↳17. DB::ExecutableLambdaAdapter::executeStep() @ 0x00000000192a6c12                                                                                                                                                                                       ↴│
   │             │↳18. DB::TaskRuntimeData::executeStep() const @ 0x0000000019389a0c                                                                                                                                                                                         ↴│
   │             │↳19. DB::MergeTreeBackgroundExecutor<DB::RoundRobinRuntimeQueue>::threadFunction() @ 0x000000001938bc0d                                                                                                                                                    ↴│
   │             │↳20. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000136f5c2b                                                                                                                                            ↴│
   │             │↳21. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x00000000136fcfa6↴│
   │             │↳22. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000136f2c12                                                                                                                                                                      ↴│
   │             │↳23. void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000136fa6da↴│
   │             │↳24. ? @ 0x0000000000094ac3                                                                                                                                                                                                                                ↴│
   │             │↳25. ? @ 0x0000000000125a74                                                                                                                                                                                                                                ↴│
   │             │↳ (version 25.8.15.35 (official build))                                                                                                                                                                                                                     │
   └─────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

```
## Additional context
### Data parts comparison between replicas

After the reproduction steps, the active parts differ between replicas.

---

Query executed:

```sql
SELECT name, rows
FROM system.parts
WHERE database='s3test'
  AND table='add_remove_one_replica'
  AND active
ORDER BY name;
```

### Node 1 (source replica)

Result:

```
   ┌─name──────┬────rows─┐
1. │ all_0_0_0 │ 1000000 │
2. │ all_1_1_0 │ 1000000 │
   └───────────┴─────────┘
```

---

### Node 3 (affected replica)

Result:

```
   ┌─name──────┬────rows─┐
1. │ all_1_1_0 │ 1000000 │
   └───────────┴─────────┘

```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: S3 zero-copy replication leaves replica out of sync in 25.8.15 #1338

Type of problem

Describe the situation

How to reproduce the behavior

Option 1: Reproduce using automation

Option 2: Manual reproduction (Docker environment)

Steps

Expected behavior

Actual behavior

Logs, error messages, stacktraces

Replication queue error (node3)

Additional context

Data parts comparison between replicas

Node 1 (source replica)

Node 3 (affected replica)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression: S3 zero-copy replication leaves replica out of sync in 25.8.15 #1338

Description

Type of problem

Describe the situation

How to reproduce the behavior

Option 1: Reproduce using automation

Option 2: Manual reproduction (Docker environment)

Steps

Expected behavior

Actual behavior

Logs, error messages, stacktraces

Replication queue error (node3)

Additional context

Data parts comparison between replicas

Node 1 (source replica)

Node 3 (affected replica)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions