KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests. by ewencp · Pull Request #432 · apache/kafka

ewencp · 2015-11-05T17:43:36Z

No description provided.

ewencp · 2015-11-05T17:44:51Z

This requires #431 before this test will run reliably.

You'll probably also notice a bunch of logging related changes. These were some changes that were helpful in tracking down issues in the 4 or 5 JIRAs that came out of writing this test, so I think they're worth keeping.

wushujames · 2015-11-05T18:03:58Z

Nice. This is a cool way test distributed mode.

guozhangwang · 2015-11-05T20:07:21Z

Is this correct? currentTimeNs is never updated and hence always the same as sleepStartNs.

No, it was not correct. Good catch! I've fixed it and validated correctness by running ProducerPerformance manually.

benstopford · 2015-11-05T21:02:50Z

This looks reasonable to me

gwenshap · 2015-11-06T00:09:47Z

Just note that the PR title is wrong - its KAFKA-2752 (not 2572).

We don't want to close the wrong JIRA accidentally :)

…opycat system tests.

…ilure and logging a bit more info during the test run.

ewencp · 2015-11-06T03:09:39Z

@gwenshap Good catch, fixed in the commit msg and the PR title.

@guozhangwang This is effectively complete wrt the actual test (and I added a bit more code to dump the contents of the topic in the case of failure, which can aid in debugging). However, we'll still hit failures somtimes. On the source side, we can see duplicates sometimes, sort of due to https://issues.apache.org/jira/browse/KAFKA-2713 where a sink task can block during join for long enough that it hits the worker group timeout and the rest of the group moves on, but the source task then processes data even after its out of the group (and can commit offsets since it thinks it still owns that data).

I've seen missing sink data, but haven't tracked down the issue yet. I haven't manage to catch it with good enough logging + a dump of the topic yet.

We can also see some unhandled exceptions currently because of wakeup exceptions during consumer.close(), an issue that was only introduced recently. (I'm coordinating w/ @hachikuji to sort that issue out).

So, probably a question for the community more generally and @guozhangwang and @gwenshap immediately: how do we want to handle tests that we're confident are setup properly and useful, but are going to fail while some existing issues are sorted out? I kind of don't want this to sit in limbo (I already did that with the first 5 patches that writing this test resulted in....). On the other hand, committing failing tests is not helpful. We have an @ignore decorator, but that disables running the test entirely. @granders and I were discussing an @unstable annotation, but I'm not sure how quickly we can get that because making the semantics useful (i.e. actually notifies you of failures, but in a nice way if only unstable tests fail, doesn't require proactively checking test results, doesn't require a bunch of 1-liner commits to remove unstable annotations once the test is stabilized, etc).

Thoughts about what to do with this patch in the mean time until those issues are addressed?

guozhangwang · 2015-11-06T03:18:51Z

@ewencp It will be a bug if the kick-out-of-group consumer can still commit offsets: their commit should be rejected with ILLEGAL_GENERATION. If you see this happening we should investigate this case asap I think.

About unit testing: I would personally prefer only adding test cases that are passing to the code base, and piggy-back tests that are failing due to some issues to the fix of that issue. But I also understands keeping track of those tests is sort of a pain: more painful it is, more motivated are we to resolve the dandling problem. So probably we should just eat the bullet and fix the related existing issue right away?

ewencp · 2015-11-06T04:09:26Z

@guozhangwang The issue is with source connectors, which have to handle offset commits outside the normal consumer group offset commit functionality since they do not use topic partitions or integer offsets. So this issue is definitely due to the fact that we have not generalized a mechanism for having the group coordinator manage these types of write/commit operations in a more general manner. I thought about routing all these writes through the leader, but that only reduces, not removes, the issue since the lagging node could be the leader.

Re: tests, yes, I am looking to fix them immediately, so in this case we can defer for a while at least. I guess part of my point is that in general system tests are a lot more finicky than unit tests, so even if it's been reviewed and passes regularly locally, we can still see cases where it fails (especially for new functionality like this!). So even when we think the test is stable, we still might want a mechanism to put it into "trial mode" for awhile. But that's a longer term discussion, we don't have to address it right now.

…l be required to debug the distributed test failures.

… execute in parallel and cannot block the herder thread.

…lean-bounce-test

ewencp · 2015-11-10T03:13:36Z

I think I've got this stabilized via a number of other bug fixes that are now committed. Two issues remain: #476 and #480. When those are merged, I can update this branch and we can verify one final time that this is now stable.

…lean-bounce-test

ewencp · 2015-11-10T21:16:40Z

@gwenshap @guozhangwang I've updated this with trunk and it's now stable as far as I can tell (passed many times in a row).

guozhangwang · 2015-11-10T21:25:44Z

One minor question, otherwise LGTM!

granders · 2015-11-10T21:42:36Z

pid files are good :) However clean_node probably should not rely on the pid file

Fair point, replaced with a kill_process version.

… that kills all connect processes rather than using pid files.

granthenke · 2015-11-10T23:31:26Z

See #492 for checkstyle quick fix

…opycat system tests. Author: Ewen Cheslack-Postava <me@ewencp.org> Reviewers: Ben Stopford, Geoff Anderson, Guozhang Wang Closes #432 from ewencp/kafka-2752-copycat-clean-bounce-test

TICKET = N/A EXIT_CRITERIA = When upstream also log similar info

wushujames reviewed Nov 5, 2015
View reviewed changes

ewencp force-pushed the kafka-2752-copycat-clean-bounce-test branch from 1adc1f8 to e6e54c1 Compare November 5, 2015 18:17

guozhangwang reviewed Nov 5, 2015
View reviewed changes

ewencp changed the title ~~KAFKA-2572: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests.~~ KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests. Nov 6, 2015

ewencp added 3 commits November 5, 2015 18:46

KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce C…

b9b8350

…opycat system tests.

Address review comments.

0fe6789

Add debugging support by collecting a dump of the topic in case of fa…

00016d8

…ilure and logging a bit more info during the test run.

ewencp force-pushed the kafka-2752-copycat-clean-bounce-test branch from e6e54c1 to 00016d8 Compare November 6, 2015 02:47

ewencp added 2 commits November 5, 2015 20:28

Increase log level to DEBUG for Copycat services since this level wil…

7a176e7

…l be required to debug the distributed test failures.

KAFKA-2713: Run task start and stop methods in worker threads so they…

dd9d41e

… execute in parallel and cannot block the herder thread.

ewencp mentioned this pull request Nov 6, 2015

KAFKA-2713: Run task start and stop methods in worker threads so they execute in parallel and cannot block the herder thread. #443

Closed

ewencp added 5 commits November 8, 2015 23:24

Merge remote-tracking branch 'origin/trunk' into kafka-2752-copycat-c…

9ef8212

…lean-bounce-test

Merge remote-tracking branch 'origin/trunk' into kafka-2752-copycat-c…

73ca190

…lean-bounce-test

Merge remote-tracking branch 'origin/trunk' into kafka-2752-copycat-c…

4846bc2

…lean-bounce-test

Fix connect:tools build settings.

551c640

Merge remote-tracking branch 'origin/trunk' into kafka-2752-copycat-c…

2654531

…lean-bounce-test

ewencp added 2 commits November 10, 2015 13:00

Merge remote-tracking branch 'origin/trunk' into kafka-2752-copycat-c…

8a1e5d6

…lean-bounce-test

Fix leftover copycat reference.

9454aa2

granders reviewed Nov 10, 2015
View reviewed changes

ewencp added 2 commits November 10, 2015 14:07

Replace clean_node implementation with more aggressive implementation…

8faf12a

… that kills all connect processes rather than using pid files.

Fix typo in log4j settings file.

cee0ee1

asfgit closed this in 8db5561 Nov 10, 2015

sutambe pushed a commit to sutambe/kafka that referenced this pull request Feb 23, 2023

[LI-HOTFIX] Log lastCaughtUpTime on ISR shrinkage (apache#432)

3535ab7

TICKET = N/A EXIT_CRITERIA = When upstream also log similar info

gharris1727 mentioned this pull request Mar 2, 2023

KAFKA-14760: Move ThroughputThrottler from tools to clients, remove tools dependency from connect-runtime #13313

Merged

3 tasks

davide-armand pushed a commit to aiven/kafka that referenced this pull request Dec 1, 2025

cache: use caffeine with full fetches (apache#432)

33626d7

jeqo pushed a commit to aiven/kafka that referenced this pull request Jan 16, 2026

cache: use caffeine with full fetches (apache#432)

9b2b918

Conversation

ewencp commented Nov 5, 2015

Uh oh!

ewencp commented Nov 5, 2015

Uh oh!

wushujames Nov 5, 2015

Choose a reason for hiding this comment

Uh oh!

guozhangwang Nov 5, 2015

Choose a reason for hiding this comment

Uh oh!

ewencp Nov 5, 2015

Choose a reason for hiding this comment

Uh oh!

benstopford commented Nov 5, 2015

Uh oh!

gwenshap commented Nov 6, 2015

Uh oh!

ewencp commented Nov 6, 2015

Uh oh!

guozhangwang commented Nov 6, 2015

Uh oh!

ewencp commented Nov 6, 2015

Uh oh!

ewencp commented Nov 10, 2015

Uh oh!

ewencp commented Nov 10, 2015

Uh oh!

guozhangwang commented Nov 10, 2015

Uh oh!

granders Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

ewencp Nov 10, 2015

Choose a reason for hiding this comment

Uh oh!

granthenke commented Nov 10, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants