Skip to content

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Jan 22, 2021

Motivation

MessageIdTest class contains 2 of the flaky test cases in the Pulsar code base:

  • org.apache.pulsar.client.impl.MessageIdTest.testChecksumVersionComptability and
  • org.apache.pulsar.client.impl.MessageIdTest.testChecksumReconnection

These test cases don't have much to do with MessageId, but are tests for validating message checksum handling in cases where there are pre 1.15 version brokers and post 1.15 version brokers in a mixed broker environment. The tests might not be very relevant any more. However it was taken as a learning experiment to fix these tests and refactor them so that the flakiness of the test code would be eliminated. Similar patterns might be needed in other tests to eliminate flakiness.

Modifications

The changes aren't only to fix MessageIdTest. Most changes could help reduce flakiness of other tests as well.

Improve shutdown of the broker and related services to reduce test flakiness

  • await for termination of executors
  • close the listen channel synchronously
  • use shutdown instead of shutdownNow in AbstractMetadataStore.close
    so that in-flight tasks get processed

Handle special case where the executor rejects the task and the callback was never called

Improve logging in MockedPulsarServiceBaseTest related to stopping and starting

Refactor PulsarClient initialization and lifecycle management in tests

Add getter and setter to access remoteEndpointProtocolVersion field

  • it makes it easier to override for tests

Add hooks for overriding the producer implementation in PulsarClientImpl

  • useful for tests. Instead of relying on Mockito, there's a pure Java
    way to inject behavior to producer implementations for testing purposes

Introduce PulsarTestClient that contains ways to prevent race conditions and test flakiness

  • provides features for simulating failure conditions, for example
    the case of the broker connection disconnecting

Add solution for using Enums classes as source for TestNG DataProvider

Fix flaky MessageIdTest and move checksum related tests to new class

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work !

@eolivelli
Copy link
Contributor

@codelipenghui @zymap @sijie @rdhabalia this patch is going to fix the most Flaky test in the suite.

please take a look

thank you @lhotari for contributing this work

@lhotari lhotari force-pushed the lh-fix-flaky-messageidtest branch from 44ccd34 to 06f8229 Compare January 22, 2021 16:16
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, just a couple of comments on executors

@merlimat merlimat added area/test type/flaky-tests type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages labels Jan 22, 2021
@merlimat merlimat added this to the 2.8.0 milestone Jan 22, 2021
Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

- useful for tests. Instead of relying on Mockito, there's a pure Java
  way to inject behavior to producer implementations for testing purposes
…ons and test flakiness

- provides features for simulating failure conditions, for example
  the case of the broker connection disconnecting
@lhotari lhotari force-pushed the lh-fix-flaky-messageidtest branch from 06f8229 to 59d29f0 Compare January 25, 2021 10:13
@lhotari lhotari requested review from merlimat and sijie January 25, 2021 10:56
@merlimat
Copy link
Contributor

@lhotari It seems there's a genuine test failure:

Error:  Failures: 
Error:  org.apache.pulsar.client.impl.PartitionedProducerImplTest.testCustomMessageRouterInstance(org.apache.pulsar.client.impl.PartitionedProducerImplTest)
[INFO]   Run 1: PASS
Error:    Run 2: PartitionedProducerImplTest.testCustomMessageRouterInstance:99->getMessageRouter:105 ? NullPointer
[INFO] 
Error:  org.apache.pulsar.client.impl.PartitionedProducerImplTest.testRoundRobinPartitionMessageRouterImplInstance(org.apache.pulsar.client.impl.PartitionedProducerImplTest)
[INFO]   Run 1: PASS
Error:    Run 2: PartitionedProducerImplTest.testRoundRobinPartitionMessageRouterImplInstance:89->getMessageRouter:105 ? NullPointer
[INFO] 
Error:  org.apache.pulsar.client.impl.PartitionedProducerImplTest.testSinglePartitionMessageRouterImplInstance(org.apache.pulsar.client.impl.PartitionedProducerImplTest)
[INFO]   Run 1: PASS
Error:    Run 2: PartitionedProducerImplTest.testSinglePartitionMessageRouterImplInstance:80->getMessageRouter:105 ? NullPointer
[INFO] 

@lhotari
Copy link
Member Author

lhotari commented Jan 25, 2021

It seems there's a genuine test failure:

@merlimat Thanks for the heads up. I'll address it tomorrow.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari
Copy link
Member Author

lhotari commented Jan 26, 2021

/pulsarbot run-failure-checks

@merlimat merlimat merged commit 19e6546 into apache:master Jan 26, 2021
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021
…he#9286)

* Refactor PulsarClient initialization and lifecycle management in tests

* Add getter and setter to access remoteEndpointProtocolVersion field

- it makes it easier to override for tests

* Add hooks for overriding the producer implementation in PulsarClientImpl

- useful for tests. Instead of relying on Mockito, there's a pure Java
  way to inject behavior to producer implementations for testing purposes

* Introduce PulsarTestClient that contains ways to prevent race conditions and test flakiness

- provides features for simulating failure conditions, for example
  the case of the broker connection disconnecting

* Add solution for using Enums classes as source for TestNG DataProvider

* Fix flaky MessageIdTest and move checksum related tests to new class

* Fix NPE in PartitionedProducerImplTest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages type/flaky-tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants