Skip to content

Conversation

@BewareMyPower
Copy link
Contributor

Fixes #8813

Motivation

Actually #8818 didn't fix the issue. The reason is when PulsarAdmin tries to get partition metadata, the checkAllowAutoCreation query param is false, so the topic existence check will never be performed.

Modifications

  • Check if the topic exists when the partition's metadata is 0 and checkAllowAutoCreation is false. See comments for detail explanations.
  • Change the current tests to make sure the check works.

Verifying this change

  • Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as AdminApiTest#partitionedTopics.

@BewareMyPower BewareMyPower changed the title Fix getting partition metadata of a nonexistent topic returns 0 [WIP] Fix getting partition metadata of a nonexistent topic returns 0 May 16, 2021
@BewareMyPower
Copy link
Contributor Author

Mark it as WIP first because there may be still some failed tests.

@BewareMyPower BewareMyPower changed the title [WIP] Fix getting partition metadata of a nonexistent topic returns 0 Fix getting partition metadata of a nonexistent topic returns 0 May 17, 2021
@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented May 17, 2021

It looks like there're still two flaky tests that can always pass in my local env.

  • LoadBalancerTest.testBrokerRanking
  • MessagePublishBufferThrottleTest.testBlockByPublishRateLimiting

@BewareMyPower
Copy link
Contributor Author

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work.
can you please also verify the behaviour for non-persistent topics ?
it happens quite often that we make a fix for persistent topics but we let non-persistent topics behave in a inconsistent way

@BewareMyPower
Copy link
Contributor Author

Thanks @eolivelli for your advice. I tried to add tests for non-persistent topics and found some difference between persistent topics and non-persistent topics. In this PR, I'll only fix the getPartitionedTopicMetadata implementation of non-persistent topics and mark the difference as comments.

@eolivelli
Copy link
Contributor

@BewareMyPower can you please follow up with a new PR for non persistent topics ?
I am fine with merging this patch (but please check my comments)

@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented May 17, 2021

@eolivelli I'll push a commit to fix the getPartitionedTopicMetadata for non-persistent topics in this PR soon. The other problems are subscription related admin operations. I may not have time to fix them recently, maybe we can open an issue to track it first.

@BewareMyPower
Copy link
Contributor Author

@codelipenghui Please include this fix in 2.8.0 if it's merged

@BewareMyPower
Copy link
Contributor Author

@eolivelli PTAL again

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work

@eolivelli eolivelli merged commit 403b57a into apache:master May 18, 2021
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-get-partition-metadata branch May 18, 2021 09:32
BewareMyPower added a commit to streamnative/kop that referenced this pull request May 22, 2021
Upgrade pulsar dependency to a newer version (2.8.0-rc-202105182205) that contains apache/pulsar#10601, which corrected the wrong semantics of `getPartitionedTopicMetadataAsync` that returns 0 for a non-existed topic.
@codelipenghui codelipenghui modified the milestones: 2.9.0, 2.8.0 Jul 1, 2021
BewareMyPower added a commit that referenced this pull request Aug 8, 2021
…11557)

Fixes #11551 

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from #10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from #11029 and #11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. 

Since the CI test of C++ client would never fail after #10309 (will be fixed by #11575), all PRs about C++ or Python client are not verified even if CI passed. Before #11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.
LeBW pushed a commit to LeBW/pulsar that referenced this pull request Aug 9, 2021
…pache#11557)

Fixes apache#11551 

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from apache#10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from apache#11029 and apache#11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. 

Since the CI test of C++ client would never fail after apache#10309 (will be fixed by apache#11575), all PRs about C++ or Python client are not verified even if CI passed. Before apache#11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.
hangc0276 pushed a commit that referenced this pull request Aug 12, 2021
…11557)

Fixes #11551

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from #10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from #11029 and #11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed.

Since the CI test of C++ client would never fail after #10309 (will be fixed by #11575), all PRs about C++ or Python client are not verified even if CI passed. Before #11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.

(cherry picked from commit 4919a82)
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
…pache#11557)

Fixes apache#11551 

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from apache#10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from apache#11029 and apache#11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. 

Since the CI test of C++ client would never fail after apache#10309 (will be fixed by apache#11575), all PRs about C++ or Python client are not verified even if CI passed. Before apache#11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.
@rdhabalia
Copy link
Contributor

This PR breaks all previous clients who would like to use HTTP lookup. HTTP lookup at client side expects partitioned metadata with 0 partition and this PR gives TopicNotFoundException to client and old client couldn't handle it and won't be able to create a new topic which breaks topic creation semantics of Pulsar.
I have seen multiple such instances in past where we broke the compatibility and it's not acceptable.

@eolivelli
Copy link
Contributor

@rdhabalia
I agree that this is a behaviour change in the API (although we introduced it in a major release, 2.8.0)

I think that this is fixing a bug, if the topic does not exist we should return "not found".

I understand that this is painful for clients who expected "0" instead of "Not Found", we should have advertised this change better or dealt with it in a different way.

Actually at the time of 2.8.0 we didn't have good tools/processes to deal with this kind of breaking changes (generally speaking breaking changes shouldn't be allowed at all!) .

Now the community evolved and before doing this kind of change we have to follow a more strict process (PIP) and that's should allow us to not fall anymore in this kind of problems for existing users.

@rdhabalia
Copy link
Contributor

I am going to revert this change as old clients are just failing. let me know if anyone has any objections. I have experienced such intentional changes in past as well and I would like to check if anyone thinks that it's fine to do destructive changes just because their PRs will be merged then it's not a good practice. So, this time I want to set the message clear for these intentionally breaking changes.

@eolivelli
Copy link
Contributor

This change has been released with 2.8 that now is a very old release (very close to be considered obsolete). I am aware of many new users since 2
8 and for them we will introduce another breaking change.

I think that you could send out PR and start a discussion on dev@

@rdhabalia
Copy link
Contributor

before this change, the previous behavior was there for 8 years so, I definitely reject the claim that it's been there for the last 2 years. And again: this change is breaking the lookup for an old clients. it's a P0 and it's not acceptable with any justification.

@rdhabalia
Copy link
Contributor

also I don't understand why do we defend breaking changes.? what's the main motivation here? does anyone want me to find out all past breaking change list and we have to live with those changes by patching on top of open source?

@BewareMyPower
Copy link
Contributor Author

This PR was included in 2.8.0 and didn't break the compatibility with 2.7.x because #4963 introduced a query param checkAllowAutoCreation in HTTP client. Actually #4963 introduced a breaking change (allowAutoTopicCreationType=partitioned) so that the query param was added. The breaking change has been reverted in #5308 (allowAutoTopicCreationType=non-partitioned), but the client side changes are not reverted.

The C++ client tests should expose the breaking change because there was no checkAllowAutoCreation in C++ client before #11557. Unfortunately, the CI in master was accidentally broken at that time (see #11575) so the C++ client tests passed.

I agree that it's a breaking change. But this new behavior is also needed in some cases to differ a non-partitioned topic and a non-existent topic.

I prefer to add a new config to keep the compatibility for old version clients. Reverting this PR would also bring many breaking changes for those external systems that depend on Pulsar 2.8.0 or later. @rdhabalia @eolivelli

I will send an mail to dev and reference the discussion here soon.

@BewareMyPower
Copy link
Contributor Author

Let's continue the discussion here: https://lists.apache.org/thread/88t1xxf68j092k09srdwyzj1tk4ml5n9

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Nov 24, 2022
…of a nonexistent topic

### Motivation

apache#10601 (comment)

apache#10601 changes the behavior when querying partitions of a topic that is
not created. Before apache#10601, 0 is returned. After apache#10601, an exception
will be thrown to indicate the topic does not exist. It leads to the
incompatibility with some old Pulsar clients that do not add the
"checkAllowAutoCreation=true" query param. If they use HTTP service URL
like "http://localhost:8080", when accessing a topic that does not
exist, the client will fail.

The affected Pulsar clients include Java client <= 2.4.2 and C++/Python
client <= 2.8.0.

### Modifications

Add an option `checkTopicExistsWhenQueryPartitions` (default: true) to
determine the behavior. Disable this option to keep the original
behavior that 0 will be returned when querying partitions of a
nonexistent topic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get partition metadata for a non-existed topic returns 0

5 participants