KAFKA-6254: Incremental fetch requests by cmccabe · Pull Request #4418 · apache/kafka

cmccabe · 2018-01-12T19:44:11Z

No description provided.

junrao

@cmccabe : Thanks for the patch. Made a pass of non-testing files. Looks good to me overall. A few comments below.

junrao · 2018-01-17T19:39:38Z

Is this temporary?

junrao · 2018-01-17T20:03:22Z

Should the return type of the method be Unit?

Has the return type been changed to Unit yet?

junrao · 2018-01-18T00:23:00Z

It's a bit unintuitive to use 0 maxBytes as an indication for removal.

This seems similar to setting everything to 0 when there is a partition error, right? It would be clearer if our RPC type system supported a more advanced type system.

junrao · 2018-01-18T00:30:58Z

Since all the callers are already synchronizing on the session object, do we need to synchronize here?

It's not technically needed, but it makes the code much clearer because the locking is consistent. It also should have a very small overhead

junrao · 2018-01-18T00:58:42Z

For consistency, perhaps it's better to either add local to all offsets or leave it out for all.

The main reason it's on this one is to distinguish from fetcherLogStartOffset (the LSO of the follower, which is different from ours). Maybe I should add "local" to all of them, though?

junrao · 2018-01-18T17:46:22Z

Should we test for !verifyFullFetchResponseParts() here and the one in line 289?

junrao · 2018-01-18T17:47:57Z

FetchType could also be SESSIONLESS. Should we check that?

SESSIONLESS should be handled the same way as FULL. Let me fix this.

junrao · 2018-01-18T21:19:32Z

It seems that we can get here if FetchType is SESSIONLESS. In this case, it seems that we want to use the ordering of partitions in next to achieve fairness when there is more data to give than the max fetch response size?

The ordering should be maintained, since FetchSessionHandler#Builder#next is a LinkedHashMap. I guess there should be a comment about this in the code, so that it's documented.

junrao · 2018-01-18T21:41:45Z

junrao · 2018-01-18T22:08:24Z

This probably needs to be reverted?

cmccabe · 2018-01-18T22:32:54Z

Rebased on changes to KafkaApis.scala

cmccabe · 2018-01-19T19:49:43Z

I hit some bugs that were triggered by changing log4j.properties to use TRACE logging. Filed fixes for them: #4450, #4449. I also reverted the log4j.properties change in this patch, so hopefully we can get a clean junit test run soon :)

cmccabe · 2018-01-29T17:22:49Z

Rebased on trunk

hachikuji · 2018-01-30T23:58:37Z

nit: These are the same descriptions as above. How about creating a static Field instances or at least extracting the message.

max_bytes does have different doc strings in different message versions, though. I started looking at adding more constants for this, but it got a bit messy-- maybe a good follow on change?

junrao · 2018-01-31T00:41:43Z

@cmccabe : Thanks for the patch. Only had time to review part of the patch. The following are my comments so far.

hachikuji

Did a quick pass over the client code and had a few questions/comments.

hachikuji · 2018-01-31T00:05:09Z

Should this be retriable? Same question for FetchSessionIdNotFoundException.

OK, let's make it retriable

hachikuji · 2018-01-31T00:06:09Z

Perhaps we may as well list out all the partitions?

As in the comment above, though, there are going to be a huge number of them. So it's not really practical (except maybe on TRACE level)

hachikuji · 2018-01-31T00:07:29Z

Same as above. It will probably be particularly useful for incremental fetches to have the partitions explicitly in the log message.

hachikuji · 2018-01-31T00:08:16Z

nit: can we spell out partitions?

hachikuji · 2018-01-31T00:10:39Z

I'm not sure this is a good idea. If we're unlucky, the partition we're interested in may not be listed. Since this is an exceptional case anyway, I would suggest using the more verbose message.

OK. If there is an error, we can log all partitions, to make it easier to debug.

hachikuji · 2018-01-31T00:22:21Z

Maybe the name can be more explicit? For example, forgetPartitions?

I wanted a name that indicated that we want to forget the partitions, but that it hasn't been done yet. I'm open to suggestions, but toForget seemed nice and simple.

hachikuji · 2018-01-31T00:23:30Z

This message should refer to all the partitions in the fetch session, right?

hachikuji · 2018-01-31T00:28:27Z

I am wondering if this can be lowered to DEBUG since it is handled internally.

I think it makes sense to log since it's a pretty rare occurrence. And if it does start happening a lot, that could indicate a problem.

hachikuji · 2018-01-31T00:32:35Z

We lost the comment we had before, but it seemed useful. Maybe you can update it to be relevant to the new logic.

Good point. I will add a log message to FetchSessionHandler which will spell out this information

hachikuji · 2018-01-31T00:41:25Z

I was expecting to see some logic to remove a partition from the session following a NOT_LEADER error. Maybe I'm missing it somewhere?

ijuma · 2018-01-31T09:42:35Z

I think the optimization of using array indices instead of pointers is a bit questionable without some benchmarks. Heaps larger than 32 GB are rarely (or never) used in Kafka. And having to go via the array has some cost as well.

There are other benefits besides reducing the pointer size. When you use array indices rather than pointers, the garbage collector needs to do less work chasing pointers. See https://issues.apache.org/jira/secure/attachment/12701400/BlocksMap%20redesign.pdf .

Excerpt:

According to an Oracle engineer, large heaps with reference dense objects in old gen with frequently mutating references is brutally hard on GC.When a reference in an old gen object is mutated, the object’s “card page” is marked as dirty. During young gen collection all references in dirty old gen card pages are used as roots for determining reachability of young gen objects.

The [HDFS] block data-structure mutates by necessity, but it does so in a non-GC friendly manner. Report processing inserts a delimiter into the storage’s doubly linked list, moves reported blocks to the head of the storage’s list, then uses the delimiter to determine excess blocks for invalidation. The updating of so many references creates intense pressure on GC.

One reason is young gen maintains a tenuring threshold equating to how conservatively it will promote young gen objects into old gen. The threshold drops relative to the rate of garbage creation and dirtying of old gen cards. The young collector may resort to prematurely promoting objects into old gen when it becomes overrun by spending too much time collecting. CMS is forced to cleanup when the old gen occupancy threshold is exceeded. The prematurely promoted objects lead to excessive fragmentation of old gen.

We can reduce abusive GC behavior by reducing the mutation of references in old gen.
Unlike references, updating primitives (ints, longs, etc) does not mark an old gen page dirty. It does not incur a penalty to young gen collection.

ijuma · 2018-01-31T10:12:13Z

The fact that we are allocating an array and potentially an Integer to compute the hashCode is suboptimal given that these elements are meant to be added to the ImplicitLinkedHashSet, which doesn't seem to cache hash codes.

ijuma · 2018-01-31T10:14:06Z

Note that Hashtable uses 11 (a prime number) as the default.

I will change this to 5, so that we also get 11 as the default number of slots.

asfgit · 2018-02-01T22:16:54Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/257/

asfgit · 2018-02-02T02:02:05Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/271/

junrao

@cmccabe : Thanks for the updated patch. A few more comments below.

junrao · 2018-01-29T17:35:38Z

Should this be trace?

junrao · 2018-01-29T17:35:44Z

Should this be trace?

junrao · 2018-01-30T01:09:43Z

log doesn't seem to be used.

junrao · 2018-01-30T01:35:11Z

Do we need to store toSend here?

junrao · 2018-01-30T02:03:45Z

From the KIP wiki, it seems that legacy request should use 0 as the epoch, not -1?

Yeah. I posted a correction about this. The correct way is now id = 0, epoch = -1 (previously it was documented as id = -1, epoch = 0)

junrao · 2018-02-02T00:39:23Z

Hmm, the caller of this method doesn't seem to be synchronized on the CachedPartition object. Do we have a visibility issue across threads?

Oops, the comment is wrong. That should read "the appropriate session lock." Fixed.

junrao · 2018-02-02T01:59:46Z

Since topicPartition doesn't exist in next if we get here, there is no need to remove it.

junrao · 2018-02-02T02:10:08Z

Since the caller always passes in a LinkedHashMap, is there a reason to change this to Map?

The rationale is that FetchSessionHandler wraps the LinkedHashMap in an UnmodifiableMap, so the type is technically UnmodifiableMap rather than LinkedHashMap at that point. Also, there's things like using Collections.emptyMap in a unit test which you can't use if you have to have a LinkedHashMap

I will add a comment saying that iteration order is significant, though.

junrao · 2018-02-02T02:34:44Z

Could we just test on set equal instead of string equal?

junrao · 2018-02-02T02:45:38Z

Hmm, are we supposed to test data? Should we build a new request?

In this case, it was intentional to skip building a new request. I'll add a comment to make it clearer (also we don't need to test data#toSend again)

asfgit · 2018-02-02T07:32:32Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/288/

asfgit · 2018-02-02T09:45:40Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/289/

cmccabe · 2018-02-02T17:22:40Z

The failed test on the jdk7 run is kafka.security.auth.SimpleAclAuthorizerTest > testHighConcurrencyModificationOfResourceAcls. Not related.

junrao

@cmccabe : Thanks for the patch. Looks good to me. Just a couple of minor comments.

@ijuma and @hachikuji : Do you want to take another look?

junrao · 2018-02-02T18:00:08Z

On the server side, we have moved to the s convention for building a string, instead of format.

OK. I will change it over to the 's' convention

junrao · 2018-02-02T18:13:32Z

I thought we agreed in the KIP that this will be a constant and not configurable?

Good catch, will make this a constant.

Implement incremental fetch requests as described by KIP-227.

cmccabe · 2018-02-02T22:43:23Z

retest this please

junrao · 2018-02-03T06:33:43Z

@cmccabe : Thanks for update. The latest code LGTM. Do you have any performance results? It would be useful to see (1) the consumption improvement when there are idle topics, (2) no degradation when caching is disabled.

afalko · 2018-02-03T08:26:41Z

@junrao I've been testing @cmccabe's patches. One thing that was important to us was the consumption latency that we define as (time it takes to consume a series of 100 small messages + time it takes to commit an offset). With kafka 1.0 and trunk, we'd see that latency exceed our SLA of 50 ms after 40-46k 3x replicated partitions. With Colin's patch, at fa01cf98 (before rebase) we were able to get to 68k 3x replicated partitions with latency of 35 ms. Generally, the offset commit latency is far higher than the consume message latency, 33 ms for the latter case and 46 ms for the former case. I couldn't push more than 68k replicated partitions due to https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-6469?filter=allopenissues

Let me know if you'd like me to get more results and share additional metrics. We're very excited about this patch!

junrao · 2018-02-03T17:53:55Z

@afalko : Thanks for sharing the results. Very helpful. Just to clarify, are you saying the offset commit latency is 33ms without this patch and 46ms with the patch?

afalko · 2018-02-03T18:24:22Z

Sorry @junrao I meant the other way around:

Without patch @46k: consume - 4 ms; commit - 46 ms
With patch @68K: consume - 2 ms; commit - 33 ms
With patch @46k: consume - 2 ms; commit - 21 ms

junrao · 2018-02-03T22:46:59Z

@afalko : That's interesting. This patch doesn't really optimize the offset commit protocol. So, I am wondering why there is an improvement on offset commit.

cmccabe · 2018-02-04T21:39:27Z

The test failure on JDK9 seems to be related to some ZK issues going on when running kafka.api.PlaintextConsumerTest.testLowMaxFetchSizeForRequestAndPartition.

from the logs:

[2018-02-03 07:33:00,012] WARN fsync-ing the write ahead log in SyncThread:0 took 1053ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog:342)
...
[2018-02-03 07:33:37,276] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x101b1f2ea4e0000 (org.apache.zookeeper.ClientCnxn:1111)

The exception is:

org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /config/topics/topic3
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:122)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
	at kafka.zookeeper.AsyncResponse.maybeThrow(ZooKeeperClient.scala:450)
	at kafka.zk.KafkaZkClient.createRecursive(KafkaZkClient.scala:1388)
	at kafka.zk.KafkaZkClient.create$1(KafkaZkClient.scala:251)
	at kafka.zk.KafkaZkClient.setOrCreateEntityConfigs(KafkaZkClient.scala:258)
	at kafka.zk.AdminZkClient.createOrUpdateTopicPartitionAssignmentPathInZK(AdminZkClient.scala:99)
	at kafka.zk.AdminZkClient.createTopic(AdminZkClient.scala:56)
	at kafka.utils.TestUtils$.createTopic(TestUtils.scala:294)
	at kafka.integration.KafkaServerTestHarness.createTopic(KafkaServerTestHarness.scala:123)
	at kafka.api.PlaintextConsumerTest.$anonfun$testLowMaxFetchSizeForRequestAndPartition$1(PlaintextConsumerTest.scala:791)

I don't think this is related to the patch at all. I will re-run the tests to see if we can get a clean run this time.

cmccabe · 2018-02-04T21:39:48Z

retest this please

cmccabe · 2018-02-04T21:47:57Z

@afalko: thanks again for your great work testing this.

@junrao wrote:

That's interesting. This patch doesn't really optimize the offset commit protocol. So, I am wondering why there is an improvement on offset commit.

Yeah, that is interesting. After all, we are handling the same number of partitions on the broker, but we are just not serializing them into every RPC like we did before. So I would expect the offset commit time improvement to come from better behaved garbage collection performance or better network utilization. Probably network utilization, since the patch doesn't make too many special efforts to optimize GC (although I made one here and there-- for example using iterators instead of copying a map in one place.)

hachikuji · 2018-02-04T22:05:39Z

Offset commits depend on replication, so any improvement to fetch overhead could reduce offset commit latency. If the result is actually meaningful, I would expect to see a similar improvement for produce latency.

asfgit · 2018-02-04T23:30:41Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/320/

junrao · 2018-02-05T18:04:09Z

@hachikuji : Great point. That makes sense.

afalko · 2018-02-05T19:47:16Z

Thanks @hachikuji, @junrao, @cmccabe . Fresh off open source presses, I've been able to open source the test I wrote that was measuring the results that I mentioned: https://github.com/salesforce/kafka-partition-availability-benchmark

I plan to expand it to have another mode where it is producing continuously without resetting offset. That'll be able to measure the produce latency.

junrao · 2018-02-05T20:15:51Z

@afalko : Thanks. You may want to link that to the jira so that other people know how your tests were done.

Author: Colin P. Mccabe <cmccabe@confluent.io> Reviewers: Jason Gustafson <jason@confluent.io>, Ismael Juma <ismael@juma.me.uk>, Jun Rao <junrao@gmail.com> Closes #4418 from cmccabe/KAFKA-6254

cmccabe force-pushed the KAFKA-6254 branch 5 times, most recently from cad085b to c2d258b Compare January 16, 2018 22:22

junrao reviewed Jan 18, 2018

View reviewed changes

cmccabe force-pushed the KAFKA-6254 branch from c2d258b to beaede5 Compare January 18, 2018 22:32

cmccabe force-pushed the KAFKA-6254 branch 5 times, most recently from 50f06aa to 33cea3b Compare January 19, 2018 19:47

cmccabe force-pushed the KAFKA-6254 branch 4 times, most recently from 5dbcd59 to e5310fc Compare January 26, 2018 19:41

cmccabe force-pushed the KAFKA-6254 branch 3 times, most recently from a9ec0e0 to 5af21fc Compare January 30, 2018 21:22

hachikuji reviewed Jan 30, 2018

View reviewed changes

hachikuji reviewed Jan 31, 2018

View reviewed changes

ijuma reviewed Jan 31, 2018

View reviewed changes

cmccabe force-pushed the KAFKA-6254 branch 2 times, most recently from 3140801 to 1c1697c Compare January 31, 2018 19:49

cmccabe force-pushed the KAFKA-6254 branch from 48de531 to b8f918c Compare February 1, 2018 22:58

junrao reviewed Feb 2, 2018

View reviewed changes

cmccabe force-pushed the KAFKA-6254 branch from f9a6abe to 65222ea Compare February 2, 2018 05:30

junrao reviewed Feb 2, 2018

View reviewed changes

KAFKA-6254: Incremental fetch requests

6f53158

Implement incremental fetch requests as described by KIP-227.

cmccabe force-pushed the KAFKA-6254 branch from e4331f9 to 6f53158 Compare February 2, 2018 22:42

junrao closed this in 7fe1c2b Feb 5, 2018

cmccabe deleted the KAFKA-6254 branch May 20, 2019 18:57

Conversation

cmccabe commented Jan 12, 2018

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Jan 18, 2018

Uh oh!

cmccabe commented Jan 19, 2018

Uh oh!

cmccabe commented Jan 29, 2018

Uh oh!

hachikuji Jan 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao commented Jan 31, 2018

Uh oh!

hachikuji left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmccabe Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hachikuji Jan 30, 2018 •

edited

Loading

cmccabe Jan 31, 2018 •

edited

Loading

cmccabe Jan 31, 2018 •

edited

Loading