-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Fix getting partition metadata of a nonexistent topic returns 0 #10601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix getting partition metadata of a nonexistent topic returns 0 #10601
Conversation
|
Mark it as WIP first because there may be still some failed tests. |
|
It looks like there're still two flaky tests that can always pass in my local env.
|
|
/pulsarbot run-failure-checks |
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work.
can you please also verify the behaviour for non-persistent topics ?
it happens quite often that we make a fix for persistent topics but we let non-persistent topics behave in a inconsistent way
pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminApiTest.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminApiTest.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/AdminApiTest.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/v1/V1_AdminApiTest.java
Outdated
Show resolved
Hide resolved
pulsar-broker/src/test/java/org/apache/pulsar/broker/admin/v1/V1_AdminApiTest.java
Outdated
Show resolved
Hide resolved
...roker/src/test/java/org/apache/pulsar/broker/service/BrokerServiceAutoTopicCreationTest.java
Outdated
Show resolved
Hide resolved
|
Thanks @eolivelli for your advice. I tried to add tests for non-persistent topics and found some difference between persistent topics and non-persistent topics. In this PR, I'll only fix the |
|
@BewareMyPower can you please follow up with a new PR for non persistent topics ? |
|
@eolivelli I'll push a commit to fix the |
|
@codelipenghui Please include this fix in 2.8.0 if it's merged |
|
@eolivelli PTAL again |
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work
Upgrade pulsar dependency to a newer version (2.8.0-rc-202105182205) that contains apache/pulsar#10601, which corrected the wrong semantics of `getPartitionedTopicMetadataAsync` that returns 0 for a non-existed topic.
…11557) Fixes #11551 ### Motivation Currently there're some bugs of C++ client and some tests cannot pass: 1. Introduced from #10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail. - AuthPluginTest.testTlsDetectHttps - AuthPluginToken.testTokenWithHttpUrl - BasicEndToEndTest.testHandlerReconnectionLogic - BasicEndToEndTest.testV2TopicHttp - ClientDeduplicationTest.testProducerDeduplication 2. Introduced from #11029 and #11486 , the implementation will iterate more than once even there's only one valid resolved IP address. - ClientTest.testConnectTimeout In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling. Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed. 1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers. 2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. Since the CI test of C++ client would never fail after #10309 (will be fixed by #11575), all PRs about C++ or Python client are not verified even if CI passed. Before #11575 is merged, we need to fix all existed bugs of C++ client. ### Modifications Corresponding to the above tests group, this PR adds following modifications: 1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically. 2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP. Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug. This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed. Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139 but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2. ### Verifying this change We can only check the workflow output to verify this change.
…pache#11557) Fixes apache#11551 ### Motivation Currently there're some bugs of C++ client and some tests cannot pass: 1. Introduced from apache#10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail. - AuthPluginTest.testTlsDetectHttps - AuthPluginToken.testTokenWithHttpUrl - BasicEndToEndTest.testHandlerReconnectionLogic - BasicEndToEndTest.testV2TopicHttp - ClientDeduplicationTest.testProducerDeduplication 2. Introduced from apache#11029 and apache#11486 , the implementation will iterate more than once even there's only one valid resolved IP address. - ClientTest.testConnectTimeout In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling. Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed. 1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers. 2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. Since the CI test of C++ client would never fail after apache#10309 (will be fixed by apache#11575), all PRs about C++ or Python client are not verified even if CI passed. Before apache#11575 is merged, we need to fix all existed bugs of C++ client. ### Modifications Corresponding to the above tests group, this PR adds following modifications: 1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically. 2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP. Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug. This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed. Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139 but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2. ### Verifying this change We can only check the workflow output to verify this change.
…11557) Fixes #11551 ### Motivation Currently there're some bugs of C++ client and some tests cannot pass: 1. Introduced from #10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail. - AuthPluginTest.testTlsDetectHttps - AuthPluginToken.testTokenWithHttpUrl - BasicEndToEndTest.testHandlerReconnectionLogic - BasicEndToEndTest.testV2TopicHttp - ClientDeduplicationTest.testProducerDeduplication 2. Introduced from #11029 and #11486 , the implementation will iterate more than once even there's only one valid resolved IP address. - ClientTest.testConnectTimeout In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling. Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed. 1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers. 2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. Since the CI test of C++ client would never fail after #10309 (will be fixed by #11575), all PRs about C++ or Python client are not verified even if CI passed. Before #11575 is merged, we need to fix all existed bugs of C++ client. ### Modifications Corresponding to the above tests group, this PR adds following modifications: 1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically. 2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP. Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug. This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed. Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139 but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2. ### Verifying this change We can only check the workflow output to verify this change. (cherry picked from commit 4919a82)
…pache#11557) Fixes apache#11551 ### Motivation Currently there're some bugs of C++ client and some tests cannot pass: 1. Introduced from apache#10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail. - AuthPluginTest.testTlsDetectHttps - AuthPluginToken.testTokenWithHttpUrl - BasicEndToEndTest.testHandlerReconnectionLogic - BasicEndToEndTest.testV2TopicHttp - ClientDeduplicationTest.testProducerDeduplication 2. Introduced from apache#11029 and apache#11486 , the implementation will iterate more than once even there's only one valid resolved IP address. - ClientTest.testConnectTimeout In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling. Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed. 1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers. 2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. Since the CI test of C++ client would never fail after apache#10309 (will be fixed by apache#11575), all PRs about C++ or Python client are not verified even if CI passed. Before apache#11575 is merged, we need to fix all existed bugs of C++ client. ### Modifications Corresponding to the above tests group, this PR adds following modifications: 1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically. 2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP. Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug. This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed. Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139 but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2. ### Verifying this change We can only check the workflow output to verify this change.
|
This PR breaks all previous clients who would like to use HTTP lookup. HTTP lookup at client side expects partitioned metadata with 0 partition and this PR gives |
|
@rdhabalia I think that this is fixing a bug, if the topic does not exist we should return "not found". I understand that this is painful for clients who expected "0" instead of "Not Found", we should have advertised this change better or dealt with it in a different way. Actually at the time of 2.8.0 we didn't have good tools/processes to deal with this kind of breaking changes (generally speaking breaking changes shouldn't be allowed at all!) . Now the community evolved and before doing this kind of change we have to follow a more strict process (PIP) and that's should allow us to not fall anymore in this kind of problems for existing users. |
|
I am going to revert this change as old clients are just failing. let me know if anyone has any objections. I have experienced such intentional changes in past as well and I would like to check if anyone thinks that it's fine to do destructive changes just because their PRs will be merged then it's not a good practice. So, this time I want to set the message clear for these intentionally breaking changes. |
|
This change has been released with 2.8 that now is a very old release (very close to be considered obsolete). I am aware of many new users since 2 I think that you could send out PR and start a discussion on dev@ |
|
before this change, the previous behavior was there for 8 years so, I definitely reject the claim that it's been there for the last 2 years. And again: this change is breaking the lookup for an old clients. it's a P0 and it's not acceptable with any justification. |
|
also I don't understand why do we defend breaking changes.? what's the main motivation here? does anyone want me to find out all past breaking change list and we have to live with those changes by patching on top of open source? |
|
This PR was included in 2.8.0 and didn't break the compatibility with 2.7.x because #4963 introduced a query param The C++ client tests should expose the breaking change because there was no I agree that it's a breaking change. But this new behavior is also needed in some cases to differ a non-partitioned topic and a non-existent topic. I prefer to add a new config to keep the compatibility for old version clients. Reverting this PR would also bring many breaking changes for those external systems that depend on Pulsar 2.8.0 or later. @rdhabalia @eolivelli I will send an mail to dev and reference the discussion here soon. |
|
Let's continue the discussion here: https://lists.apache.org/thread/88t1xxf68j092k09srdwyzj1tk4ml5n9 |
…of a nonexistent topic ### Motivation apache#10601 (comment) apache#10601 changes the behavior when querying partitions of a topic that is not created. Before apache#10601, 0 is returned. After apache#10601, an exception will be thrown to indicate the topic does not exist. It leads to the incompatibility with some old Pulsar clients that do not add the "checkAllowAutoCreation=true" query param. If they use HTTP service URL like "http://localhost:8080", when accessing a topic that does not exist, the client will fail. The affected Pulsar clients include Java client <= 2.4.2 and C++/Python client <= 2.8.0. ### Modifications Add an option `checkTopicExistsWhenQueryPartitions` (default: true) to determine the behavior. Disable this option to keep the original behavior that 0 will be returned when querying partitions of a nonexistent topic.
Fixes #8813
Motivation
Actually #8818 didn't fix the issue. The reason is when
PulsarAdmintries to get partition metadata, thecheckAllowAutoCreationquery param is false, so the topic existence check will never be performed.Modifications
checkAllowAutoCreationis false. See comments for detail explanations.Verifying this change
This change is already covered by existing tests, such as AdminApiTest#partitionedTopics.