KAFKA-10274; Consistent timeouts in transactions_test#9026
KAFKA-10274; Consistent timeouts in transactions_test#9026junrao merged 1 commit intoapache:trunkfrom
Conversation
|
Could you also kick off a system test? |
chia7712
left a comment
There was a problem hiding this comment.
@hachikuji thanks for this PR! I neglected the hard_bounce of broker before. my bad :(
| # long as the request timeout to get a `Produce` response and we do not | ||
| # want the coordinator timing out the transaction. | ||
| self.transaction_timeout = 40000 | ||
| self.progress_timeout_sec = 60 |
There was a problem hiding this comment.
How about using different transaction_timeout for different mode? For example, lower timeout for hard_bounce of client and higher timeout for broker. I try to avoid higher waiting time (progress_timeout_sec) when encountering other error.
There was a problem hiding this comment.
What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases. Unfortunately we didn't give this class an easy way to override producer configurations, so we would need another argument. I decided to hold off on this, but I can reconsider it if you think it's worthwhile. This service (as well as VerifiableConsumer and VerifiableProducer) are a bit of a grey area as far as whether they are public APIs or not, but I have tended to take the position that they are not 🙂 .
There was a problem hiding this comment.
I decided to hold off on this, but I can reconsider it if you think it's worthwhile.
it is fine to me as your approach is more simple :)
What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases.
the root cause I observed is different to https://issues.apache.org/jira/browse/KAFKA-9802. On my local, TransactionalMessageCopier fails due to ProducerFencedException which is caused by that broker increases the producer epoch when aborting transaction.
} catch (ProducerFencedException | OutOfOrderSequenceException e) {
// We cannot recover from these errors, so just rethrow them and let the process fail
throw e;
} catch (KafkaException e) {
producer.abortTransaction();
resetToLastCommittedPositions(consumer);
}
Perhaps we should make TransactionalMessageCopier recoverable from transaction timeout before KIP-558 is addressed.
BTW, group_mode_transactions_test.py has similar issue. Could you fix it also? Or we can apply your approach to fix group_mode_transactions_test.py in another PR.
|
@abbccdda Here is a link to the test results: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-16--001.1594861014--hachikuji--KAFKA-10274--dea96a4ff/report.html. There was one failure, but it was due to an unrelated environmental issue: |
|
Is this ready to be merged? |
junrao
left a comment
There was a problem hiding this comment.
@hachikuji : Thanks for the PR. LGTM
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a Produce response. Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Boyang Chen <boyang@confluent.io>, Jun Rao <junrao@gmail.com>
KAFKA-10235 fixed a consistency issue with the transaction timeout and the progress timeout. Since the test case relies on transaction timeouts, we need to wait at last as long as the timeout in order to ensure progress. However, having a low transaction timeout makes the test prone to the issue identified in KAFKA-9802, in which the coordinator timed out the transaction while the producer was awaiting a
Produceresponse.Committer Checklist (excluded from commit message)