Skip to content

KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests.#432

Closed
ewencp wants to merge 14 commits intoapache:trunkfrom
ewencp:kafka-2752-copycat-clean-bounce-test
Closed

KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests.#432
ewencp wants to merge 14 commits intoapache:trunkfrom
ewencp:kafka-2752-copycat-clean-bounce-test

Conversation

@ewencp
Copy link
Copy Markdown
Contributor

@ewencp ewencp commented Nov 5, 2015

No description provided.

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Nov 5, 2015

This requires #431 before this test will run reliably.

You'll probably also notice a bunch of logging related changes. These were some changes that were helpful in tracking down issues in the 4 or 5 JIRAs that came out of writing this test, so I think they're worth keeping.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This is a cool way test distributed mode.

@ewencp ewencp force-pushed the kafka-2752-copycat-clean-bounce-test branch from 1adc1f8 to e6e54c1 Compare November 5, 2015 18:17
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? currentTimeNs is never updated and hence always the same as sleepStartNs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was not correct. Good catch! I've fixed it and validated correctness by running ProducerPerformance manually.

@benstopford
Copy link
Copy Markdown
Contributor

This looks reasonable to me

@gwenshap
Copy link
Copy Markdown
Contributor

gwenshap commented Nov 6, 2015

Just note that the PR title is wrong - its KAFKA-2752 (not 2572).

We don't want to close the wrong JIRA accidentally :)

@ewencp ewencp changed the title KAFKA-2572: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests. KAFKA-2752: Add VerifiableSource/Sink connectors and rolling bounce Copycat system tests. Nov 6, 2015
@ewencp ewencp force-pushed the kafka-2752-copycat-clean-bounce-test branch from e6e54c1 to 00016d8 Compare November 6, 2015 02:47
@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Nov 6, 2015

@gwenshap Good catch, fixed in the commit msg and the PR title.

@guozhangwang This is effectively complete wrt the actual test (and I added a bit more code to dump the contents of the topic in the case of failure, which can aid in debugging). However, we'll still hit failures somtimes. On the source side, we can see duplicates sometimes, sort of due to https://issues.apache.org/jira/browse/KAFKA-2713 where a sink task can block during join for long enough that it hits the worker group timeout and the rest of the group moves on, but the source task then processes data even after its out of the group (and can commit offsets since it thinks it still owns that data).

I've seen missing sink data, but haven't tracked down the issue yet. I haven't manage to catch it with good enough logging + a dump of the topic yet.

We can also see some unhandled exceptions currently because of wakeup exceptions during consumer.close(), an issue that was only introduced recently. (I'm coordinating w/ @hachikuji to sort that issue out).

So, probably a question for the community more generally and @guozhangwang and @gwenshap immediately: how do we want to handle tests that we're confident are setup properly and useful, but are going to fail while some existing issues are sorted out? I kind of don't want this to sit in limbo (I already did that with the first 5 patches that writing this test resulted in....). On the other hand, committing failing tests is not helpful. We have an @ignore decorator, but that disables running the test entirely. @granders and I were discussing an @unstable annotation, but I'm not sure how quickly we can get that because making the semantics useful (i.e. actually notifies you of failures, but in a nice way if only unstable tests fail, doesn't require proactively checking test results, doesn't require a bunch of 1-liner commits to remove unstable annotations once the test is stabilized, etc).

Thoughts about what to do with this patch in the mean time until those issues are addressed?

@guozhangwang
Copy link
Copy Markdown
Contributor

@ewencp It will be a bug if the kick-out-of-group consumer can still commit offsets: their commit should be rejected with ILLEGAL_GENERATION. If you see this happening we should investigate this case asap I think.

About unit testing: I would personally prefer only adding test cases that are passing to the code base, and piggy-back tests that are failing due to some issues to the fix of that issue. But I also understands keeping track of those tests is sort of a pain: more painful it is, more motivated are we to resolve the dandling problem. So probably we should just eat the bullet and fix the related existing issue right away?

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Nov 6, 2015

@guozhangwang The issue is with source connectors, which have to handle offset commits outside the normal consumer group offset commit functionality since they do not use topic partitions or integer offsets. So this issue is definitely due to the fact that we have not generalized a mechanism for having the group coordinator manage these types of write/commit operations in a more general manner. I thought about routing all these writes through the leader, but that only reduces, not removes, the issue since the lagging node could be the leader.

Re: tests, yes, I am looking to fix them immediately, so in this case we can defer for a while at least. I guess part of my point is that in general system tests are a lot more finicky than unit tests, so even if it's been reviewed and passes regularly locally, we can still see cases where it fails (especially for new functionality like this!). So even when we think the test is stable, we still might want a mechanism to put it into "trial mode" for awhile. But that's a longer term discussion, we don't have to address it right now.

…l be required to debug the distributed test failures.
… execute in parallel and cannot block the herder thread.
@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Nov 10, 2015

I think I've got this stabilized via a number of other bug fixes that are now committed. Two issues remain: #476 and #480. When those are merged, I can update this branch and we can verify one final time that this is now stable.

@ewencp
Copy link
Copy Markdown
Contributor Author

ewencp commented Nov 10, 2015

@gwenshap @guozhangwang I've updated this with trunk and it's now stable as far as I can tell (passed many times in a row).

@guozhangwang
Copy link
Copy Markdown
Contributor

One minor question, otherwise LGTM!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pid files are good :) However clean_node probably should not rely on the pid file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, replaced with a kill_process version.

@asfgit asfgit closed this in 8db5561 Nov 10, 2015
@granthenke
Copy link
Copy Markdown
Member

See #492 for checkstyle quick fix

ewencp referenced this pull request in confluentinc/kafka Nov 18, 2015
…opycat system tests.

Author: Ewen Cheslack-Postava <me@ewencp.org>

Reviewers: Ben Stopford, Geoff Anderson, Guozhang Wang

Closes #432 from ewencp/kafka-2752-copycat-clean-bounce-test
granders referenced this pull request in confluentinc/kafka Mar 1, 2016
…opycat system tests.

Author: Ewen Cheslack-Postava <me@ewencp.org>

Reviewers: Ben Stopford, Geoff Anderson, Guozhang Wang

Closes #432 from ewencp/kafka-2752-copycat-clean-bounce-test
sutambe pushed a commit to sutambe/kafka that referenced this pull request Feb 23, 2023
TICKET = N/A
EXIT_CRITERIA = When upstream also log similar info
davide-armand pushed a commit to aiven/kafka that referenced this pull request Dec 1, 2025
jeqo pushed a commit to aiven/kafka that referenced this pull request Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants