Skip to content

KAFKA-10274; Consistent timeouts in transactions_test#9026

Merged
junrao merged 1 commit intoapache:trunkfrom
hachikuji:KAFKA-10274
Jul 22, 2020
Merged

KAFKA-10274; Consistent timeouts in transactions_test#9026
junrao merged 1 commit intoapache:trunkfrom
hachikuji:KAFKA-10274

Conversation

@hachikuji
Copy link
Copy Markdown
Contributor

KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown

@abbccdda abbccdda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abbccdda
Copy link
Copy Markdown

Could you also kick off a system test?

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji thanks for this PR! I neglected the hard_bounce of broker before. my bad :(

# long as the request timeout to get a `Produce` response and we do not
# want the coordinator timing out the transaction.
self.transaction_timeout = 40000
self.progress_timeout_sec = 60
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using different transaction_timeout for different mode? For example, lower timeout for hard_bounce of client and higher timeout for broker. I try to avoid higher waiting time (progress_timeout_sec) when encountering other error.

Copy link
Copy Markdown
Contributor Author

@hachikuji hachikuji Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases. Unfortunately we didn't give this class an easy way to override producer configurations, so we would need another argument. I decided to hold off on this, but I can reconsider it if you think it's worthwhile. This service (as well as VerifiableConsumer and VerifiableProducer) are a bit of a grey area as far as whether they are public APIs or not, but I have tended to take the position that they are not 🙂 .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to hold off on this, but I can reconsider it if you think it's worthwhile.

it is fine to me as your approach is more simple :)

What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases.

the root cause I observed is different to https://issues.apache.org/jira/browse/KAFKA-9802. On my local, TransactionalMessageCopier fails due to ProducerFencedException which is caused by that broker increases the producer epoch when aborting transaction.

                    } catch (ProducerFencedException | OutOfOrderSequenceException e) {
                        // We cannot recover from these errors, so just rethrow them and let the process fail
                        throw e;
                    } catch (KafkaException e) {
                        producer.abortTransaction();
                        resetToLastCommittedPositions(consumer);
                    }

Perhaps we should make TransactionalMessageCopier recoverable from transaction timeout before KIP-558 is addressed.

BTW, group_mode_transactions_test.py has similar issue. Could you fix it also? Or we can apply your approach to fix group_mode_transactions_test.py in another PR.

@hachikuji
Copy link
Copy Markdown
Contributor Author

@abbccdda Here is a link to the test results: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-16--001.1594861014--hachikuji--KAFKA-10274--dea96a4ff/report.html. There was one failure, but it was due to an unrelated environmental issue:

00:48:36 [INFO:2020-07-16 00:48:36,487]: RunnerClient: kafkatest.tests.core.transactions_test.TransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=clients.check_order=False.use_group_metadata=True: Summary: Unable to open channel.
00:48:36 Traceback (most recent call last):
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 132, in run
00:48:36     self.setup_test()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/runner_client.py", line 183, in setup_test
00:48:36     self.test.setup()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/tests/test.py", line 91, in setup
00:48:36     self.setUp()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/tests/core/transactions_test.py", line 67, in setUp
00:48:36     self.zk.start()
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/services/service.py", line 234, in start
00:48:36     self.start_node(node)
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/tests/kafkatest/services/zookeeper.py", line 97, in start_node
00:48:36     node.account.ssh("echo %d > %s/myid" % (idx, ZookeeperService.DATA))
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/ducktape-0.7.8-py2.7.egg/ducktape/cluster/remoteaccount.py", line 266, in ssh
00:48:36     stdin, stdout, stderr = client.exec_command(cmd)
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/paramiko-2.6.0-py2.7.egg/paramiko/client.py", line 508, in exec_command
00:48:36     chan = self._transport.open_session(timeout=timeout)
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/paramiko-2.6.0-py2.7.egg/paramiko/transport.py", line 879, in open_session
00:48:36     timeout=timeout,
00:48:36   File "/home/jenkins/workspace/system-test-kafka-branch-builder/kafka/venv/lib/python2.7/site-packages/paramiko-2.6.0-py2.7.egg/paramiko/transport.py", line 1006, in open_channel
00:48:36     raise e
00:48:36 SSHException: Unable to open channel.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Jul 21, 2020

Is this ready to be merged?

Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji : Thanks for the PR. LGTM

@junrao junrao merged commit 67f5b5d into apache:trunk Jul 22, 2020
junrao pushed a commit that referenced this pull request Jul 22, 2020
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>,  Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
junrao pushed a commit that referenced this pull request Jul 23, 2020
the root cause is same to #9026 so I copy the approach of #9026 to resolve core/group_mode_transactions_test.py

Reviewers: Jun Rao <junrao@gmail.com>
hachikuji added a commit that referenced this pull request Dec 16, 2020
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>,  Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants