Skip to content

Conversation

@Vanlightly
Copy link
Contributor

@Vanlightly Vanlightly commented Jul 28, 2021

Fixes #11485

Motivation

Multiple endpoints, for example "localhost", "127.0.0.1", may need to be iterated over before a connection succeeds. In each iteration the connectTimeoutTask (PeriodicTask) is stopped and started. Stopping it cancels the timer, which fires the handler (handleTimeout) which in turn calls the connect timeout callback which closes the socket. This causes the createProducerAsync to return ResultConnectError. This causes a large number of tests to fail.

Modifications

This change checks the error_code in the PeriodicTask timer handler and returns if it is operation_cancelled, avoiding erroneously calling the callback.

Verifying this change

All existing tests pass.

Does this pull request potentially affect one of the following parts:

None

Documentation

For contributor

For this PR, do we need to update docs?

  • No, because this is a minor fix to an internal operation.

Master issue: apache#11485

When the PeriodicTask timer is cancelled it calls the handler
which in turn ends up invoking the callback which closes
the socket. This prevents the iteration over multiple endpoints
until the connection succeeds. This change checks the error_code
in the handler and returns if it is operation_cancelled,
avoiding erroneously calling the callback.
@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label Jul 30, 2021
@merlimat merlimat merged commit 8070a82 into apache:master Jul 30, 2021
BewareMyPower added a commit that referenced this pull request Aug 8, 2021
…11557)

Fixes #11551 

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from #10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from #11029 and #11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. 

Since the CI test of C++ client would never fail after #10309 (will be fixed by #11575), all PRs about C++ or Python client are not verified even if CI passed. Before #11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.
LeBW pushed a commit to LeBW/pulsar that referenced this pull request Aug 9, 2021
…pache#11557)

Fixes apache#11551 

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from apache#10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from apache#11029 and apache#11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. 

Since the CI test of C++ client would never fail after apache#10309 (will be fixed by apache#11575), all PRs about C++ or Python client are not verified even if CI passed. Before apache#11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.
hangc0276 pushed a commit that referenced this pull request Aug 12, 2021
…11557)

Fixes #11551

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from #10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from #11029 and #11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed.

Since the CI test of C++ client would never fail after #10309 (will be fixed by #11575), all PRs about C++ or Python client are not verified even if CI passed. Before #11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.

(cherry picked from commit 4919a82)
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
…he#11486)

Master issue: apache#11485

When the PeriodicTask timer is cancelled it calls the handler
which in turn ends up invoking the callback which closes
the socket. This prevents the iteration over multiple endpoints
until the connection succeeds. This change checks the error_code
in the handler and returns if it is operation_cancelled,
avoiding erroneously calling the callback.

Co-authored-by: Jack Vanlightly <jvanlightly@splunk.com>
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
…pache#11557)

Fixes apache#11551 

### Motivation

Currently there're some bugs of C++ client and some tests cannot pass:

1. Introduced from apache#10601 because it changed the behavior of the admin API to get partition metadata while the C++ implementation relies on the original behavior to create topics automatically. So any test that uses HTTP lookup will fail.
    - AuthPluginTest.testTlsDetectHttps
    - AuthPluginToken.testTokenWithHttpUrl
    - BasicEndToEndTest.testHandlerReconnectionLogic
    - BasicEndToEndTest.testV2TopicHttp
    - ClientDeduplicationTest.testProducerDeduplication
2. Introduced from apache#11029 and apache#11486 , the implementation will iterate more than once even there's only one valid resolved IP address.
    - ClientTest.testConnectTimeout

In addition, there's an existed flaky test from very early time: ClientTest.testLookupThrottling.

Python tests are also broken. Because it must run after all C++ tests passed, they're also not exposed.
1. Some tests in `pulsar_test.py` might encounter `Timeout` error when creating producers or consumers.
2. Some tests in `schema_test.py` failed because some comparisons between two `ComplexRecord`s failed. 

Since the CI test of C++ client would never fail after apache#10309 (will be fixed by apache#11575), all PRs about C++ or Python client are not verified even if CI passed. Before apache#11575 is merged, we need to fix all existed bugs of C++ client.

### Modifications

Corresponding to the above tests group, this PR adds following modifications:
1. Add the `?checkAllowAutoCreation=true` URL suffix to allow HTTP lookup to create topics automatically.
2. When iterating through a resolved IP list, increase the iterator first, then run the connection timer and try to connect the next IP.

Regarding to the flaky `testLookupThrottling`, this PR adds a `client.close()` at the end of test and fix the `ClientImpl::close` implementation. Before this PR, if there're no producers or consumers in a client, the `close()` method wouldn't call `shutdown()` to close connection poll and executors. Only after the `Client` instance was destructed would the `shutdown()` method be called. In this case, this PR calls `handleClose` instead of invoking callback directly. In addition, change the log level of this test to debug.

This PR also fixes the failed timeout Python tests, some are caused by incorrect import of classes, some are caused by `client` was not closed.

Regarding to Python schema tests, in Python2, `self.__ne__(other)` is not equivalent to `not self.__eq__(other)` when the default `__eq__` implementation is overwritten. If a `Record` object has a field whose type is also `Record`, the `Record.__ne__` method will be called, see

https://github.com/apache/pulsar/blob/ddb5fb0e062c2fe0967efce2a443a31f9cd12c07/pulsar-client-cpp/python/pulsar/schema/definition.py#L138-L139

but it just uses the default implementation to check whether they're not equal. The custom `__eq__` method won't be called. Therefore, this PR implement `Record.__ne__` explicitly to call `Record.__eq__` so that the comparison will work for Python2.

### Verifying this change

We can only check the workflow output to verify this change.
BewareMyPower pushed a commit that referenced this pull request Sep 16, 2022
Master issue: #11485

When the PeriodicTask timer is cancelled it calls the handler
which in turn ends up invoking the callback which closes
the socket. This prevents the iteration over multiple endpoints
until the connection succeeds. This change checks the error_code
in the handler and returns if it is operation_cancelled,
avoiding erroneously calling the callback.

Co-authored-by: Jack Vanlightly <jvanlightly@splunk.com>
(cherry picked from commit 8070a82)
@github-actions
Copy link

@Vanlightly Please provide a correct documentation label for your PR.
Instructions see Pulsar Documentation Label Guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-label-missing release/2.8.5 type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

C++ client connect timeout timer cancellation not handled correctly

5 participants