Skip to content

KAFKA-9274: Mark retries config as deprecated and add new task.timeout.ms config#8864

Merged
mjsax merged 2 commits intoapache:trunkfrom
mjsax:kafka-9274-kip-572
Jul 21, 2020
Merged

KAFKA-9274: Mark retries config as deprecated and add new task.timeout.ms config#8864
mjsax merged 2 commits intoapache:trunkfrom
mjsax:kafka-9274-kip-572

Conversation

@mjsax
Copy link
Copy Markdown
Member

@mjsax mjsax commented Jun 13, 2020

First PR for KIP-572 for public changes only. Producer/Admin client won't have any functional change as they still will respect retries. For KS, we will get rid or the usage or retries in follow up PRs.

Call for review: @cmccabe @hachikuji @abbccdda @vvcephei

@mjsax mjsax added streams producer kip Requires or implements a KIP labels Jun 13, 2020
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmccabe I think we can keep most of those test as-is until retries is removed -- the ones with @SuppressWarning should be rewritten if retries is removed IMHO, while the ones with @Deprecated should be remove. Let me know if this makes sense or if you think we should rewrite those tests right away.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mjsax. I agree we should keep testing the deprecated option. It does kind of seem like we should rewrite the tests you want to rewrite, now, though. I don't think this scheme of deprecating some tests and suppressing the warning on others is going to be remembered for very long after this is merged.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that. Wanted to get input from @hachikuji and @cmccabe about it. Frankly, I am not always 100% sure atm how to exactly rewrite the tests and what test coverage is desired?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into the test and I think I found ways to get rid for retries for the methods that suppress the deprecation warning. Let me know what you think.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhauch @kkonstantine -- We deprecate retries via KIP-572. Not sure if we should rewrite this test right away so it does not use retries any longer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should rewrite this test right away

First of all, this is not test code. It's actually part of the Connect runtime.

Second, I'd have to look into what we would do differently and how we would set deliver.timeout.ms to work with the existing logic that handles retries. Since it's not clear, I suggest we keep the current behavior of setting retries=0 as long as that still function. We're using the ProducerConfig.RETRIES constant, so that can't be removed without changing this code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhauch @kkonstantine As discussed in person, I create a ticket that is marked as blocker for 2.7 release to address this issue: https://issues.apache.org/jira/browse/KAFKA-10297

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw: @hachikuji raise a concern about this issue, too: #8864 (review)

I just had one question about the proposal. Using retries=0 in the producer allows the user to have "at-most-once" delivery. This allows the application to implement its own retry logic for example. Do we still have a way to do this once this configuration is gone?

So maybe we need to do some follow up work in the Producer to make it work for Connect. But I would defer this to the follow up work.

My original though was, that setting max.deliver.timeout.ms := request .timeout.ms might prevent internal retries. But we need to verify this. It was also brought to my attention, that this might not work if the network disconnects -- only retries=0 would prevent to re-open the connection but a low timeout would not prevent retries.

In KIP-572, we proposed for Kafka Streams itself, to treat task.timeout.ms = 0 as "no retries" -- maybe we can do a similar thing for the producer?

There is also max.block.ms that we should consider. Unfortunately, I am not an expert on the Producer. But I would like to move forward with KIP-572 for now and are happy to help to resolve those questions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji @cmccabe @vvcephei -- I think we should deprecate this flag (guess we should update the KIP accordingly) -- Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the KIP and sent a mail to the VOTE thread on the dev list.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some side cleanup. Method does not throw anything

@mjsax mjsax force-pushed the kafka-9274-kip-572 branch from aa74cfa to 7d82538 Compare June 13, 2020 00:18
@mjsax mjsax changed the title KAFKA-9274: Mark retries config as deprecated KAFKA-9274: Mark retries config as deprecated and add new task.timeout.ms config Jun 13, 2020
Copy link
Copy Markdown

@abbccdda abbccdda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the only remaining question is the deprecation path, and no actual production logic change? If so, I could rely on reviews from other tech leads to unblock.

Copy link
Copy Markdown
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had one question about the proposal. Using retries=0 in the producer allows the user to have "at-most-once" delivery. This allows the application to implement its own retry logic for example. Do we still have a way to do this once this configuration is gone?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feels like overkill to deprecate test cases. Since they reference RETRIES directly, I don't think we need to worry about them not getting removed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the deprecation warning was part of a scheme to mark which tests should be rewritten: https://github.com/apache/kafka/pull/8864/files#r439685696

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. To make the build pass, we either need to add a Suppress annotation of just deprecate the test itself. Deprecating the test itself is "cleaner" IMHO.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mjsax , sorry I was so slow on the review. It LGTM overall, just a few comments. Thanks for the PR!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mjsax. I agree we should keep testing the deprecated option. It does kind of seem like we should rewrite the tests you want to rewrite, now, though. I don't think this scheme of deprecating some tests and suppressing the warning on others is going to be remembered for very long after this is merged.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the deprecation warning was part of a scheme to mark which tests should be rewritten: https://github.com/apache/kafka/pull/8864/files#r439685696

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

Comment thread docs/streams/upgrade-guide.html Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to fill this in as part of this PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry if this is picky, but are these also to be removed when we remove the config? If so, maybe we should just say it on every suppression.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure yet how the test will be rewritten. My expectation is, that we remove the retries config and also remove the Suppress tag -- however, we can only do this, after streams code was changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9060

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the comment!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes :) Added it as a reminder only first. Was expecting to get review comment anyway.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9047

@mjsax mjsax force-pushed the kafka-9274-kip-572 branch from b58d7a1 to e6c10d8 Compare July 21, 2020 00:12
@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented Jul 21, 2020

Updates this PR.

Copy link
Copy Markdown
Contributor

@vvcephei vvcephei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, @mjsax !

There are still multiple comments like TODO revisit in follow up PR. Should we file a blocker to actually follow up on this? Otherwise, I'm afraid we'll just forget about them. If I understand the change correctly, these tests are all using the deprecated retries parameter, which means we're lacking coverage on the non-deprecated path.

Otherwise, it LGTM!

@mjsax
Copy link
Copy Markdown
Member Author

mjsax commented Jul 21, 2020

Thanks @vvcephei -- I don't think we need tickets. I will use this PR as reference itself to make sure we address all TODO that are added.

@mjsax mjsax merged commit 194c56f into apache:trunk Jul 21, 2020
@mjsax mjsax deleted the kafka-9274-kip-572 branch July 21, 2020 19:34
}
}

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9047

}
}

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9047

private final int retries;
private final long retryBackOffMs;

@SuppressWarnings("deprecation") // TODO: remove in follow up PR when `RETRIES` is removed
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9060

assertEquals(100, returnedProps.get(StreamsConfig.topicPrefix(TopicConfig.SEGMENT_BYTES_CONFIG)));
}

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9060

private MockAdminClient mockAdminClient;
private InternalTopicManager internalTopicManager;

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via #9060

hachikuji pushed a commit that referenced this pull request Sep 10, 2020
Fixes flakiness in `KafkaAdminClientTest` as a result of #8864. Addresses the following flaky tests:

- testAlterReplicaLogDirsPartialFailure
- testDescribeLogDirsPartialFailure
- testMetadataRetries

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>

private static final String SINK_TOPIC = "streamsResilienceSink";

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted via #9333

});
}

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted via #9333

public class StreamsOptimizedTest {


@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted via #9333


public class StreamsStandByReplicaTest {

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted via #9333

streams.close(Duration.ofSeconds(10));
}

@SuppressWarnings("deprecation") // TODO revisit in follow up PR
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted via #9333

abbccdda pushed a commit to abbccdda/kafka that referenced this pull request Sep 28, 2020
Fixes flakiness in `KafkaAdminClientTest` as a result of apache#8864. Addresses the following flaky tests:

- testAlterReplicaLogDirsPartialFailure
- testDescribeLogDirsPartialFailure
- testMetadataRetries

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>, Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kip Requires or implements a KIP producer streams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants