KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests.#432
KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests.#432ewencp wants to merge 14 commits intoapache:trunkfrom
Conversation
|
This requires #431 before this test will run reliably. You'll probably also notice a bunch of logging related changes. These were some changes that were helpful in tracking down issues in the 4 or 5 JIRAs that came out of writing this test, so I think they're worth keeping. |
There was a problem hiding this comment.
Nice. This is a cool way test distributed mode.
1adc1f8 to
e6e54c1
Compare
There was a problem hiding this comment.
Is this correct? currentTimeNs is never updated and hence always the same as sleepStartNs.
There was a problem hiding this comment.
No, it was not correct. Good catch! I've fixed it and validated correctness by running ProducerPerformance manually.
|
This looks reasonable to me |
|
Just note that the PR title is wrong - its KAFKA-2752 (not 2572). We don't want to close the wrong JIRA accidentally :) |
…opycat system tests.
…ilure and logging a bit more info during the test run.
e6e54c1 to
00016d8
Compare
|
@gwenshap Good catch, fixed in the commit msg and the PR title. @guozhangwang This is effectively complete wrt the actual test (and I added a bit more code to dump the contents of the topic in the case of failure, which can aid in debugging). However, we'll still hit failures somtimes. On the source side, we can see duplicates sometimes, sort of due to https://issues.apache.org/jira/browse/KAFKA-2713 where a sink task can block during join for long enough that it hits the worker group timeout and the rest of the group moves on, but the source task then processes data even after its out of the group (and can commit offsets since it thinks it still owns that data). I've seen missing sink data, but haven't tracked down the issue yet. I haven't manage to catch it with good enough logging + a dump of the topic yet. We can also see some unhandled exceptions currently because of wakeup exceptions during consumer.close(), an issue that was only introduced recently. (I'm coordinating w/ @hachikuji to sort that issue out). So, probably a question for the community more generally and @guozhangwang and @gwenshap immediately: how do we want to handle tests that we're confident are setup properly and useful, but are going to fail while some existing issues are sorted out? I kind of don't want this to sit in limbo (I already did that with the first 5 patches that writing this test resulted in....). On the other hand, committing failing tests is not helpful. We have an Thoughts about what to do with this patch in the mean time until those issues are addressed? |
|
@ewencp It will be a bug if the kick-out-of-group consumer can still commit offsets: their commit should be rejected with ILLEGAL_GENERATION. If you see this happening we should investigate this case asap I think. About unit testing: I would personally prefer only adding test cases that are passing to the code base, and piggy-back tests that are failing due to some issues to the fix of that issue. But I also understands keeping track of those tests is sort of a pain: more painful it is, more motivated are we to resolve the dandling problem. So probably we should just eat the bullet and fix the related existing issue right away? |
|
@guozhangwang The issue is with source connectors, which have to handle offset commits outside the normal consumer group offset commit functionality since they do not use topic partitions or integer offsets. So this issue is definitely due to the fact that we have not generalized a mechanism for having the group coordinator manage these types of write/commit operations in a more general manner. I thought about routing all these writes through the leader, but that only reduces, not removes, the issue since the lagging node could be the leader. Re: tests, yes, I am looking to fix them immediately, so in this case we can defer for a while at least. I guess part of my point is that in general system tests are a lot more finicky than unit tests, so even if it's been reviewed and passes regularly locally, we can still see cases where it fails (especially for new functionality like this!). So even when we think the test is stable, we still might want a mechanism to put it into "trial mode" for awhile. But that's a longer term discussion, we don't have to address it right now. |
…l be required to debug the distributed test failures.
… execute in parallel and cannot block the herder thread.
|
@gwenshap @guozhangwang I've updated this with trunk and it's now stable as far as I can tell (passed many times in a row). |
|
One minor question, otherwise LGTM! |
There was a problem hiding this comment.
pid files are good :) However clean_node probably should not rely on the pid file
There was a problem hiding this comment.
Fair point, replaced with a kill_process version.
… that kills all connect processes rather than using pid files.
|
See #492 for checkstyle quick fix |
…opycat system tests. Author: Ewen Cheslack-Postava <me@ewencp.org> Reviewers: Ben Stopford, Geoff Anderson, Guozhang Wang Closes #432 from ewencp/kafka-2752-copycat-clean-bounce-test
…opycat system tests. Author: Ewen Cheslack-Postava <me@ewencp.org> Reviewers: Ben Stopford, Geoff Anderson, Guozhang Wang Closes #432 from ewencp/kafka-2752-copycat-clean-bounce-test
TICKET = N/A EXIT_CRITERIA = When upstream also log similar info
No description provided.