Skip to content

KAFKA-13370: add errors when commit offsets failed and add tests#11413

Closed
showuon wants to merge 1 commit intoapache:trunkfrom
showuon:KAFKA-13370
Closed

KAFKA-13370: add errors when commit offsets failed and add tests#11413
showuon wants to merge 1 commit intoapache:trunkfrom
showuon:KAFKA-13370

Conversation

@showuon
Copy link
Copy Markdown
Member

@showuon showuon commented Oct 19, 2021

In #9642, we removed the unnecessary success parameter, and use the error as the key to identify if the commit successfully or failed. That's a good improvement. However, there are some cases we passed success with false, but without error value. I think we should always pass the error value when failed. Fix it and add tests. After this fix, all recordCommitFailure call will always pass with an error.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side fix: the error could be timed out or "cancelled". Add in the log.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the worker task stops need more than 1 sec. Increasing the timeout to 10 secs to make it reliable.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Oct 19, 2021

@chia7712 , could you please take a look? Thanks.

Comment on lines +1663 to +1671
double failurePercentage = metrics.currentMetricValueAsDouble(taskGroup, "offset-commit-failure-percentage");
double successPercentage = metrics.currentMetricValueAsDouble(taskGroup, "offset-commit-success-percentage");

if (!isCommitSucceeded) {
assertTrue(failurePercentage > 0);
assertTrue(successPercentage == 0);
} else {
assertTrue(failurePercentage == 0);
assertTrue(successPercentage > 0);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add offset-commit-failure-percentage and offset-commit-success-percentage metrics verification.

Comment on lines +555 to +558
// Very rare case: offsets were unserializable, and unable to store any data
log.error("{} Failed to flush offsets to storage: ", WorkerSourceTask.this, error);
finishFailedFlush();
recordCommitFailure(time.milliseconds() - started, error);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doFlush will return null after calling the callback with error attached. Handle the failed flush here, since we can know which error is thrown.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Oct 27, 2021

@UnityLung @chia7712 @rhauch , please help review this PR when available. Thank you.

@showuon
Copy link
Copy Markdown
Member Author

showuon commented Nov 5, 2021

@UnityLung @chia7712 @rhauch , please help review this PR. thank you.

@rhauch rhauch added the connect label Nov 16, 2021
@rhauch
Copy link
Copy Markdown
Contributor

rhauch commented Nov 16, 2021

@showuon, thanks for trying to fix this issue. But I think the best course of action here is actually to revert #9642, for a few reasons.

  1. It was actually useful to have that boolean success parameter within the protected TaskMetricsGroup.recordCommit(...) helper method, since it allowed the WorkerTask.recordCommitSuccess(...) and recordCommitFailure(...) methods to call that one method with desired behavior.
  2. Rolling back is also more encapsulated and significantly easier to backport all the way back to the 2.8 branch, especially since we've significantly refactored the source connector offset commit logic only in 3.0 and later branches.

So I think I'm going to revert #9642, but your additional unit tests here are also very useful. Would you mind creating a new PR with those unit test improvements? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants