Skip to content

Conversation

@BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Jun 13, 2022

Motivation

When a partitioned topic was created concurrently, especially when
automatically created by many producers. This case can be reproduced
easily by configuring allowAutoTopicCreationType=non-partitioned and
starting a Pulsar standalone. Then, run the following code:

try (PulsarClient client = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650").build()) {
    for (int i = 0; i < 10; i++) {
        client.newProducer().topic("topic").createAsync();
    }
    Thread.sleep(1000);
}

We can see a lot of "Could not get connection while
getPartitionedTopicMetadata" warning logs at client side, while there
were more warning logs with full stack traces at broker side:

2022-06-14T02:04:20,522+0800 [metadata-store-22-1] WARN  org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:64846] persistent://public/default/topic: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
org.apache.pulsar.metadata.api.MetadataStoreException$AlreadyExistsException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic

It's because when broker handles the partitioned metadata command, it
calls fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync and
will try creating a partitioned topic if it doesn't exist. It's a race
condition that if many connections are established during a short time
interval and one of them created successfully, the following will fail
with the AlreadyExistsException.

Modifications

Handles the MetadataStoreException.AlreadyExistsException in
unsafeGetPartitionedTopicMetadataAsync. In this case, invoke
fetchPartitionedTopicMetadataAsync to get the partitioned metadata
again.

Verifying this change

Even if without this patch, the creation of producers could also succeed
because they will reconnect to broker again after 100 ms because broker
will return a ServiceNotReady error in thiss case. The only way to
verify this fix is reproducing the bug again with this patch, we can
see no reconnection will happen from the logs.

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

…ated concurrently

### Motivation

When a partitioned topic was created concurrently, especially when
automatically created by many producers. This case can be reproduced
easily by configuring `allowAutoTopicCreationType=non-partitioned` and
starting a Pulsar standalone. Then, run the following code:

```java
try (PulsarClient client = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650").build()) {
    for (int i = 0; i < 10; i++) {
        client.newProducer().topic("topic").createAsync();
    }
    Thread.sleep(1000);
}
```

We can see a lot of "Could not get connection while
getPartitionedTopicMetadata" warning logs at client side, while there
were more warning logs with full stack traces at broker side:

```
2022-06-14T02:04:20,522+0800 [metadata-store-22-1] WARN  org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:64846] persistent://public/default/topic: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
org.apache.pulsar.metadata.api.MetadataStoreException$AlreadyExistsException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
```

It's because when broker handles the partitioned metadata command, it
calls `fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync` and
will try creating a partitioned topic if it doesn't exist. It's a race
condition that if many connections are established during a short time
interval and one of them created successfully, the following will fail
with the `AlreadyExistsException`.

### Modifications

Handles the `MetadataStoreException.AlreadyExistsException` in
`unsafeGetPartitionedTopicMetadataAsync`. In this case, invoke
`fetchPartitionedTopicMetadataAsync` to get the partitioned metadata
again.

### Verifying this change

Even if without this patch, the creation of producers could also succeed
because they will reconnect to broker again after 100 ms because broker
will return a `ServiceNotReady` error in thiss case. The only way to
verify this fix is reproducing the bug again with this patch, we can
see no reconnection will happen from the logs.
@BewareMyPower BewareMyPower added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages release/2.8.4 release/2.10.2 release/2.9.4 labels Jun 13, 2022
@BewareMyPower BewareMyPower added this to the 2.11.0 milestone Jun 13, 2022
@BewareMyPower BewareMyPower self-assigned this Jun 13, 2022
@BewareMyPower BewareMyPower changed the title [improve][broker] Avoid reconnection when a partitioned topic was cre… [improve][broker] Avoid reconnection when a partitioned topic was created concurrently Jun 13, 2022
@dave2wave
Copy link
Member

To me it would help if this PR were tested using OMB which creates many topic-partitions at once from several clients.

@Anonymitaet Anonymitaet added the doc-not-needed Your PR changes do not impact docs label Jun 14, 2022
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good catch @BewareMyPower

@BewareMyPower BewareMyPower merged commit 2a7a855 into apache:master Jun 14, 2022
@BewareMyPower BewareMyPower deleted the bewaremypower/reduce-warn-logs branch June 14, 2022 09:03
codelipenghui pushed a commit that referenced this pull request Jun 28, 2022
…ated concurrently (#16043)

* [improve][broker] Avoid reconnection when a partitioned topic was created concurrently

### Motivation

When a partitioned topic was created concurrently, especially when
automatically created by many producers. This case can be reproduced
easily by configuring `allowAutoTopicCreationType=non-partitioned` and
starting a Pulsar standalone. Then, run the following code:

```java
try (PulsarClient client = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650").build()) {
    for (int i = 0; i < 10; i++) {
        client.newProducer().topic("topic").createAsync();
    }
    Thread.sleep(1000);
}
```

We can see a lot of "Could not get connection while
getPartitionedTopicMetadata" warning logs at client side, while there
were more warning logs with full stack traces at broker side:

```
2022-06-14T02:04:20,522+0800 [metadata-store-22-1] WARN  org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:64846] persistent://public/default/topic: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
org.apache.pulsar.metadata.api.MetadataStoreException$AlreadyExistsException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
```

It's because when broker handles the partitioned metadata command, it
calls `fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync` and
will try creating a partitioned topic if it doesn't exist. It's a race
condition that if many connections are established during a short time
interval and one of them created successfully, the following will fail
with the `AlreadyExistsException`.

### Modifications

Handles the `MetadataStoreException.AlreadyExistsException` in
`unsafeGetPartitionedTopicMetadataAsync`. In this case, invoke
`fetchPartitionedTopicMetadataAsync` to get the partitioned metadata
again.

### Verifying this change

Even if without this patch, the creation of producers could also succeed
because they will reconnect to broker again after 100 ms because broker
will return a `ServiceNotReady` error in thiss case. The only way to
verify this fix is reproducing the bug again with this patch, we can
see no reconnection will happen from the logs.

* Revert "[improve][broker] Avoid reconnection when a partitioned topic was created concurrently"

This reverts commit c259c0f.

* Handle AlreadyExistsException in fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync

(cherry picked from commit 2a7a855)
mattisonchao pushed a commit that referenced this pull request Jul 2, 2022
…ated concurrently (#16043)

* [improve][broker] Avoid reconnection when a partitioned topic was created concurrently

### Motivation

When a partitioned topic was created concurrently, especially when
automatically created by many producers. This case can be reproduced
easily by configuring `allowAutoTopicCreationType=non-partitioned` and
starting a Pulsar standalone. Then, run the following code:

```java
try (PulsarClient client = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650").build()) {
    for (int i = 0; i < 10; i++) {
        client.newProducer().topic("topic").createAsync();
    }
    Thread.sleep(1000);
}
```

We can see a lot of "Could not get connection while
getPartitionedTopicMetadata" warning logs at client side, while there
were more warning logs with full stack traces at broker side:

```
2022-06-14T02:04:20,522+0800 [metadata-store-22-1] WARN  org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:64846] persistent://public/default/topic: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
org.apache.pulsar.metadata.api.MetadataStoreException$AlreadyExistsException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
```

It's because when broker handles the partitioned metadata command, it
calls `fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync` and
will try creating a partitioned topic if it doesn't exist. It's a race
condition that if many connections are established during a short time
interval and one of them created successfully, the following will fail
with the `AlreadyExistsException`.

### Modifications

Handles the `MetadataStoreException.AlreadyExistsException` in
`unsafeGetPartitionedTopicMetadataAsync`. In this case, invoke
`fetchPartitionedTopicMetadataAsync` to get the partitioned metadata
again.

### Verifying this change

Even if without this patch, the creation of producers could also succeed
because they will reconnect to broker again after 100 ms because broker
will return a `ServiceNotReady` error in thiss case. The only way to
verify this fix is reproducing the bug again with this patch, we can
see no reconnection will happen from the logs.

* Revert "[improve][broker] Avoid reconnection when a partitioned topic was created concurrently"

This reverts commit c259c0f.

* Handle AlreadyExistsException in fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync

(cherry picked from commit 2a7a855)
@mattisonchao mattisonchao added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Jul 2, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 4, 2022
…ated concurrently (apache#16043)

* [improve][broker] Avoid reconnection when a partitioned topic was created concurrently

### Motivation

When a partitioned topic was created concurrently, especially when
automatically created by many producers. This case can be reproduced
easily by configuring `allowAutoTopicCreationType=non-partitioned` and
starting a Pulsar standalone. Then, run the following code:

```java
try (PulsarClient client = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650").build()) {
    for (int i = 0; i < 10; i++) {
        client.newProducer().topic("topic").createAsync();
    }
    Thread.sleep(1000);
}
```

We can see a lot of "Could not get connection while
getPartitionedTopicMetadata" warning logs at client side, while there
were more warning logs with full stack traces at broker side:

```
2022-06-14T02:04:20,522+0800 [metadata-store-22-1] WARN  org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:64846] persistent://public/default/topic: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
org.apache.pulsar.metadata.api.MetadataStoreException$AlreadyExistsException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
```

It's because when broker handles the partitioned metadata command, it
calls `fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync` and
will try creating a partitioned topic if it doesn't exist. It's a race
condition that if many connections are established during a short time
interval and one of them created successfully, the following will fail
with the `AlreadyExistsException`.

### Modifications

Handles the `MetadataStoreException.AlreadyExistsException` in
`unsafeGetPartitionedTopicMetadataAsync`. In this case, invoke
`fetchPartitionedTopicMetadataAsync` to get the partitioned metadata
again.

### Verifying this change

Even if without this patch, the creation of producers could also succeed
because they will reconnect to broker again after 100 ms because broker
will return a `ServiceNotReady` error in thiss case. The only way to
verify this fix is reproducing the bug again with this patch, we can
see no reconnection will happen from the logs.

* Revert "[improve][broker] Avoid reconnection when a partitioned topic was created concurrently"

This reverts commit c259c0f.

* Handle AlreadyExistsException in fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync

(cherry picked from commit 2a7a855)
(cherry picked from commit cc9ff59)
BewareMyPower added a commit that referenced this pull request Jul 27, 2022
…ated concurrently (#16043)

* [improve][broker] Avoid reconnection when a partitioned topic was created concurrently

### Motivation

When a partitioned topic was created concurrently, especially when
automatically created by many producers. This case can be reproduced
easily by configuring `allowAutoTopicCreationType=non-partitioned` and
starting a Pulsar standalone. Then, run the following code:

```java
try (PulsarClient client = PulsarClient.builder()
        .serviceUrl("pulsar://localhost:6650").build()) {
    for (int i = 0; i < 10; i++) {
        client.newProducer().topic("topic").createAsync();
    }
    Thread.sleep(1000);
}
```

We can see a lot of "Could not get connection while
getPartitionedTopicMetadata" warning logs at client side, while there
were more warning logs with full stack traces at broker side:

```
2022-06-14T02:04:20,522+0800 [metadata-store-22-1] WARN  org.apache.pulsar.broker.service.ServerCnx - Failed to get Partitioned Metadata [/127.0.0.1:64846] persistent://public/default/topic: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
org.apache.pulsar.metadata.api.MetadataStoreException$AlreadyExistsException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/partitioned-topics/public/default/persistent/topic
```

It's because when broker handles the partitioned metadata command, it
calls `fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync` and
will try creating a partitioned topic if it doesn't exist. It's a race
condition that if many connections are established during a short time
interval and one of them created successfully, the following will fail
with the `AlreadyExistsException`.

### Modifications

Handles the `MetadataStoreException.AlreadyExistsException` in
`unsafeGetPartitionedTopicMetadataAsync`. In this case, invoke
`fetchPartitionedTopicMetadataAsync` to get the partitioned metadata
again.

### Verifying this change

Even if without this patch, the creation of producers could also succeed
because they will reconnect to broker again after 100 ms because broker
will return a `ServiceNotReady` error in thiss case. The only way to
verify this fix is reproducing the bug again with this patch, we can
see no reconnection will happen from the logs.

* Revert "[improve][broker] Avoid reconnection when a partitioned topic was created concurrently"

This reverts commit c259c0f.

* Handle AlreadyExistsException in fetchPartitionedTopicMetadataCheckAllowAutoCreationAsync

(cherry picked from commit 2a7a855)
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Jul 27, 2022
BewareMyPower added a commit that referenced this pull request Jul 29, 2022
@BewareMyPower BewareMyPower removed the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Jul 29, 2022
BewareMyPower added a commit that referenced this pull request Jul 29, 2022
…ated concurrently (#16043)

(cherry picked from commit 2a7a855)

In addition to #16043, this PR fixes #16861
@BewareMyPower
Copy link
Contributor Author

Move the release/2.8.4 label to #16867

BewareMyPower added a commit that referenced this pull request Aug 1, 2022
…ated concurrently (#16043) (#16867)

(cherry picked from commit 2a7a855)

In addition to #16043, this PR fixes #16861
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs release/2.9.4 release/2.10.2 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants