MINOR: Producers should set delivery timeout instead of retries by hachikuji · Pull Request #5425 · apache/kafka

hachikuji · 2018-07-26T16:20:22Z

MirrorMaker should set delivery.timeout.ms instead of retries now that we have KIP-91.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ijuma

Do we want to mention this in the upgrade notes?

ijuma · 2018-07-26T16:22:03Z

Also, what if people have set retries explicitly? Maybe we should not set this if that's set.

ijuma · 2018-07-26T16:23:27Z

Also, docs say:

 * @note For mirror maker, the following settings are set by default to make sure there is no data loss:
 *       1. use producer with following settings
 *            acks=all
 *            retries=max integer
 *            max.block.ms=max long
 *            max.in.flight.requests.per.connection=1
 *       2. Consumer Settings
 *            enable.auto.commit=false
 *       3. Mirror Maker Setting:
 *            abort.on.send.failure=true

ijuma · 2018-07-26T17:51:40Z

Maybe we should log a warning if retries is set. Or does the producer do that already?

hachikuji · 2018-07-26T17:54:57Z

I'm going to go ahead and change the other overrides of retries in this PR. I will change the title.

ijuma

Thanks for the updates. Overall makes sense, just a few questions and comments.

ijuma · 2018-07-27T14:04:50Z

-                                                          + "greater than " + REQUEST_TIMEOUT_MS_CONFIG + " + " + LINGER_MS_CONFIG;
+    private static final String DELIVERY_TIMEOUT_MS_DOC = "An upper bound on the time to report success or failure after "
+            + "Producer.send() returns. The producer may report failure to send a record earlier than this config if all "
+            + "the retries have been exhausted or a record is added to a batch nearing expiration. "


What config is used for "batch expiration"?

Urmm, delivery timeout. I found this wording a little confusing as well. I will try to rephrase.

ijuma · 2018-07-27T14:05:20Z

        producerProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        producerProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getName());
-        producerProps.put(ProducerConfig.RETRIES_CONFIG, 0); // we handle retries in this class
+        producerProps.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 0); // we handle retries in this class


Do we really want to do this? Do we ensure that we send the record at least once or could it fail in the accumulator?

Actually we probably just want to leave this as it was.

ijuma · 2018-07-27T14:06:08Z

    producerProps.setProperty(ProducerConfig.LINGER_MS_CONFIG, Int.MaxValue.toString)
    val producer = TestUtils.createProducer(brokerList, securityProtocol = securityProtocol, trustStoreFile = trustStoreFile,
-        saslProperties = clientSaslProperties, retries = 0, lingerMs = Int.MaxValue, props = Some(producerProps))
+        saslProperties = clientSaslProperties, lingerMs = Int.MaxValue, props = Some(producerProps))


Do you know why we were setting this to 0 previously?

There was no obvious reason. As far as I can tell, we are just using this function to populate some data in order to check consumer operations.

ijuma · 2018-07-27T14:07:28Z


  private case class ProducerBuilder() extends ClientBuilder[KafkaProducer[String, String]] {
-    private var _retries = 0
+    private var _retries = Int.MaxValue


Do we just want to remove this?

It is being overridden in some cases.

ijuma · 2018-07-27T14:09:03Z

                           bufferSize: Long = 1024L * 1024L,
-                           retries: Int = 0,
+                           retries: Int = Int.MaxValue,
+                           deliveryTimeoutMs: Int = 20000,


I think we're not using this variable. Also, it should be higher than request timeout, which is 30 seconds by default.

Default value for delivery timeout is already 120 secs, why do we want to set its default value here to be smaller than that?

ijuma · 2018-07-27T14:13:06Z

        final Map<String, Object> tempProducerDefaultOverrides = new HashMap<>();
        tempProducerDefaultOverrides.put(ProducerConfig.LINGER_MS_CONFIG, "100");
-        tempProducerDefaultOverrides.put(ProducerConfig.RETRIES_CONFIG, 10);
+        tempProducerDefaultOverrides.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120000);


Should we just use the default?

The default is already 120000, which I think is reasonable: 2 minutes is quite long already.

ijuma · 2018-07-27T14:15:02Z

        fullProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
        fullProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
-        fullProps.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
+        fullProps.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, Integer.MAX_VALUE);


Seems like it would be better to avoid infinite so that the tests don't hang forever.

Sure, I think the defaults are good enough.

Yes, both smoke and eos test client would not expect the brokers to be down for very long time in system test, so having a reasonable value, like the default 2min should be fine.

Ditto below.

ijuma · 2018-07-27T14:15:33Z

        producerProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class);
        // the next 2 config values make sure that all records are produced with no loss and no duplicates
-        producerProps.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
+        producerProps.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, Integer.MAX_VALUE);


Same as the other comment about tests not using infinite, if possible. Although we should probably use it in one unit test to verify that we don't overflow.

ijuma · 2018-07-27T14:24:53Z

        producerProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        producerProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getName());
-        producerProps.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
+        producerProps.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, Integer.MAX_VALUE);


@rhauch do we need to update Connect documentation?

ijuma · 2018-07-27T14:26:32Z

+                               lingerMs: Int = 0,
+                               props: Option[Properties] = None): KafkaProducer[Array[Byte],Array[Byte]] = {
    val producer = TestUtils.createProducer(brokerList, securityProtocol = securityProtocol, trustStoreFile = trustStoreFile,
-      saslProperties = clientSaslProperties, retries = retries, lingerMs = lingerMs, props = props)


I was thinking that it would be good to have at least one test with retries = 0. Do we have such a test?

guozhangwang · 2018-07-27T15:51:44Z

Are the jenkins failure related?

guozhangwang · 2018-07-27T15:55:18Z

+            + " Note that this retry is no different than if the client resent the record upon receiving the error."
+            + " Allowing retries without setting <code>" + MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION + "</code> to 1 will potentially change the"
+            + " ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second"
+            + " succeeds, then the records in the second batch may appear first. Note that the produce requests will be"


nit: Also note

guozhangwang · 2018-07-27T15:59:15Z

                           bufferSize: Long = 1024L * 1024L,
-                           retries: Int = 0,
+                           retries: Int = Int.MaxValue,
+                           deliveryTimeoutMs: Int = 20000,


Default value for delivery timeout is already 120 secs, why do we want to set its default value here to be smaller than that?

guozhangwang · 2018-07-27T17:02:10Z

+ * increase {@link ConsumerConfig#MAX_POLL_INTERVAL_MS_CONFIG} using the following guidance:
 * <pre>
- *     max.poll.interval.ms > min ( max.block.ms, (retries +1) * request.timeout.ms )
+ *     max.poll.interval.ms > max.block.ms


guozhangwang · 2018-07-27T17:03:25Z

        final Map<String, Object> tempProducerDefaultOverrides = new HashMap<>();
        tempProducerDefaultOverrides.put(ProducerConfig.LINGER_MS_CONFIG, "100");
-        tempProducerDefaultOverrides.put(ProducerConfig.RETRIES_CONFIG, 10);
+        tempProducerDefaultOverrides.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120000);


The default is already 120000, which I think is reasonable: 2 minutes is quite long already.

guozhangwang · 2018-07-27T17:08:49Z

        fullProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
        fullProps.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
-        fullProps.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
+        fullProps.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, Integer.MAX_VALUE);


Yes, both smoke and eos test client would not expect the brokers to be down for very long time in system test, so having a reasonable value, like the default 2min should be fine.

Ditto below.

guozhangwang · 2018-08-01T01:08:21Z

@hachikuji Could we trigger a Streams system test to validate that the related test cases are not broken? otherwise LGTM.

hachikuji · 2018-08-01T17:57:03Z

@guozhangwang Ran the streams system tests and they all passed. Merging to trunk.

As part of apache#5425 the streams default override for producer retries was removed. The documentation was not updated to reflect that change. Reviewers: Matthias J. Sax <mjsax@apache.org>, Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bbejeck@gmail.com>

As part of #5425 the streams default override for producer retries was removed. The documentation was not updated to reflect that change. Reviewers: Matthias J. Sax <mjsax@apache.org>, Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bbejeck@gmail.com>

… default docs. (apache#6844) TICKET = LI_DESCRIPTION = EXIT_CRITERIA = HASH [aa1a285] ORIGINAL_DESCRIPTION = As part of apache#5425 the streams default override for producer retries was removed. The documentation was not updated to reflect that change. Reviewers: Matthias J. Sax <mjsax@apache.org>, Sophie Blee-Goldman <sophie@confluent.io>, Bill Bejeck <bbejeck@gmail.com> (cherry picked from commit aa1a285)

MINOR: MirrorMaker should set producer delivery timeout

254c077

ijuma approved these changes Jul 26, 2018

View reviewed changes

A few additional updates

f7494dd

hachikuji force-pushed the minor-update-mm-config branch from 27d6371 to f7494dd Compare July 26, 2018 17:33

hachikuji changed the title ~~MINOR: MirrorMaker should set producer delivery timeout~~ MINOR: Producers should set delivery timeout instead of retries Jul 26, 2018

Prefer use of delivery timeout over retries

81ebf29

ijuma reviewed Jul 27, 2018

View reviewed changes

hachikuji added 2 commits July 27, 2018 09:01

A few more cleanups and review comments

7ed56b2

Fix checkstyle error in EosTestClient

5a1f744

guozhangwang reviewed Jul 27, 2018

View reviewed changes

hachikuji force-pushed the minor-update-mm-config branch from dbc944a to 86129cf Compare July 27, 2018 18:54

Address review comments

e02c5ec

hachikuji force-pushed the minor-update-mm-config branch from 86129cf to e02c5ec Compare July 27, 2018 18:56

hachikuji added 2 commits July 30, 2018 13:03

Use retries=0 in test case to ensure expected error is raised

2d3670b

Fix failing tests

eae84c1

hachikuji merged commit c3e7c0b into apache:trunk Aug 1, 2018

cwildman mentioned this pull request May 30, 2019

MINOR: Remove stale streams producer retry default docs. #6844

Merged

3 tasks

huxihx mentioned this pull request Aug 17, 2020

KAFKA-10407: Have KafkaLog4jAppender support linger.ms and batch.size #9189

Merged

3 tasks

showuon mentioned this pull request Feb 16, 2022

KAFKA-13598: set log4j appender to default acks #11767

Closed

3 tasks

Conversation

hachikuji commented Jul 26, 2018

Committer Checklist (excluded from commit message)

Uh oh!

ijuma left a comment

Choose a reason for hiding this comment

Uh oh!

ijuma commented Jul 26, 2018

Uh oh!

ijuma commented Jul 26, 2018

Uh oh!

ijuma commented Jul 26, 2018

Uh oh!

hachikuji commented Jul 26, 2018

Uh oh!

ijuma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Jul 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Aug 1, 2018

Uh oh!

hachikuji commented Aug 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants