KAFKA-10274; Consistent timeouts in transactions_test by hachikuji · Pull Request #9026 · apache/kafka

hachikuji · 2020-07-15T00:02:55Z

KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

abbccdda

LGTM

abbccdda · 2020-07-15T02:08:02Z

Could you also kick off a system test?

chia7712

@hachikuji thanks for this PR! I neglected the hard_bounce of broker before. my bad :(

chia7712 · 2020-07-15T08:04:24Z

+        # long as the request timeout to get a `Produce` response and we do not
+        # want the coordinator timing out the transaction.
+        self.transaction_timeout = 40000
+        self.progress_timeout_sec = 60


How about using different transaction_timeout for different mode? For example, lower timeout for hard_bounce of client and higher timeout for broker. I try to avoid higher waiting time (progress_timeout_sec) when encountering other error.

What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases. Unfortunately we didn't give this class an easy way to override producer configurations, so we would need another argument. I decided to hold off on this, but I can reconsider it if you think it's worthwhile. This service (as well as VerifiableConsumer and VerifiableProducer) are a bit of a grey area as far as whether they are public APIs or not, but I have tended to take the position that they are not 🙂 .

I decided to hold off on this, but I can reconsider it if you think it's worthwhile.

it is fine to me as your approach is more simple :)

What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases.

the root cause I observed is different to https://issues.apache.org/jira/browse/KAFKA-9802. On my local, TransactionalMessageCopier fails due to ProducerFencedException which is caused by that broker increases the producer epoch when aborting transaction.

} catch (ProducerFencedException | OutOfOrderSequenceException e) { // We cannot recover from these errors, so just rethrow them and let the process fail throw e; } catch (KafkaException e) { producer.abortTransaction(); resetToLastCommittedPositions(consumer); }

Perhaps we should make TransactionalMessageCopier recoverable from transaction timeout before KIP-558 is addressed.

BTW, group_mode_transactions_test.py has similar issue. Could you fix it also? Or we can apply your approach to fix group_mode_transactions_test.py in another PR.

hachikuji · 2020-07-16T01:06:42Z

@abbccdda Here is a link to the test results: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-16--001.1594861014--hachikuji--KAFKA-10274--dea96a4ff/report.html. There was one failure, but it was due to an unrelated environmental issue:

00:48:36 [INFO:2020-07-16 00:48:36,487]: RunnerClient: kafkatest.tests.core.transactions_test.TransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=clients.check_order=False.use_group_metadata=True: Summary: Unable to open channel.
00:48:36 Traceback (most recent call last):
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 132, in run
00:48:36     self.setup_test()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 183, in setup_test
00:48:36     self.test.setup()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/test.py", line 91, in setup
00:48:36     self.setUp()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/core/transactions_test.py", line 67, in setUp
00:48:36     self.zk.start()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/services/service.py", line 234, in start
00:48:36     self.start_node(node)
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/services/zookeeper.py", line 97, in start_node
00:48:36     node.account.ssh("echo %d > %s/myid" % (idx, ZookeeperService.DATA))
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/cluster/remoteaccount.py", line 266, in ssh
00:48:36     stdin, stdout, stderr = client.exec_command(cmd)
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/paramiko-2.6.0-py2.7.egg/paramiko/client.py", line 508, in exec_command
00:48:36     chan = self._transport.open_session(timeout=timeout)
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/paramiko-2.6.0-py2.7.egg/paramiko/transport.py", line 879, in open_session
00:48:36     timeout=timeout,
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/paramiko-2.6.0-py2.7.egg/paramiko/transport.py", line 1006, in open_channel
00:48:36     raise e
00:48:36 SSHException: Unable to open channel.

ijuma · 2020-07-21T13:35:57Z

Is this ready to be merged?

junrao

@hachikuji : Thanks for the PR. LGTM

KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>

the root cause is same to #9026 so I copy the approach of #9026 to resolve core/group_mode_transactions_test.py Reviewers: Jun Rao <junrao@gmail.com>

KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>

KAFKA-10274; Consistent timeouts in transactions_test

dea96a4

abbccdda approved these changes Jul 15, 2020

View reviewed changes

chia7712 approved these changes Jul 15, 2020

View reviewed changes

chia7712 mentioned this pull request Jul 18, 2020

KAFKA-8334 Make sure the thread which tries to complete delayed reque… #8657

Merged

3 tasks

junrao approved these changes Jul 22, 2020

View reviewed changes

junrao merged commit 67f5b5d into apache:trunk Jul 22, 2020

chia7712 mentioned this pull request Jul 23, 2020

KAFKA-10300 fix flaky core/group_mode_transactions_test.py #9059

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-10274; Consistent timeouts in transactions_test#9026

KAFKA-10274; Consistent timeouts in transactions_test#9026
junrao merged 1 commit intoapache:trunkfrom
hachikuji:KAFKA-10274

hachikuji commented Jul 15, 2020

Uh oh!

abbccdda left a comment

Uh oh!

abbccdda commented Jul 15, 2020

Uh oh!

chia7712 left a comment

Uh oh!

chia7712 Jul 15, 2020

Uh oh!

hachikuji Jul 16, 2020 •

edited

Loading

Uh oh!

chia7712 Jul 16, 2020

Uh oh!

hachikuji commented Jul 16, 2020

Uh oh!

ijuma commented Jul 21, 2020

Uh oh!

junrao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

hachikuji commented Jul 15, 2020

Committer Checklist (excluded from commit message)

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

abbccdda commented Jul 15, 2020

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 Jul 15, 2020

Choose a reason for hiding this comment

Uh oh!

hachikuji Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chia7712 Jul 16, 2020

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Jul 16, 2020

Uh oh!

ijuma commented Jul 21, 2020

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hachikuji Jul 16, 2020 •

edited

Loading