KAFKA-4196: Improved Test Stability#2570
KAFKA-4196: Improved Test Stability#2570original-brownbear wants to merge 2 commits intoapache:trunkfrom
Conversation
…fkaAPI Error Response
|
Thanks for the PR. Can you do a separate PR for the Good find regarding the ZooKeeper fsync taking long. I'd like to get some additional opinions before merging that one. |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@ijuma sure will add the separate PR in a few minutes :) |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
Refer to this link for build results (access rights to CI server needed): |
|
@original-brownbear, another option would be to increase the request timeout, right? @hachikuji, @junrao do you think it's OK to disable ZK fsyncs in tests? |
Yea sure, I guess that would be "Plan B" just go up to 60s to be safe. If it was up to me I'd just rather go with more predictable/portable tests than with increasing the timeout. |
|
This is the full description of the config: So, it's been there for a long time (3.4.6 was released in 2014). I can see the argument for disabling it although it is a bit concerning that pause times can be that high in tests (supposedly it could happen in production too). |
|
@ijuma my bad sorry :) I would've sworn it came in with ZK 3.4.9 ... :) That said in production this is an issue for dev ops imo. Either set up ext4 without global blocking sync or make ZK run from a separate disk. |
|
@original-brownbear, that config states that it's just about observers, not participants. Have you verified that it does actually fix the issue? I think the option you were looking for is: |
|
@ijuma you're right ... yikes so sorry for wasting time on this. Just tried it out with the debugger ... never hitting that setting in the zookeeper code in any relevant way. Tried to redeem myself by at least putting the effort into testing your solution too, but bad news here too. You keep eventually running into one that times out (see below example of another one in that suit failing in a similar manner) while others run at constant pace for hundreds of runs: => closing here, def on the wrong track looking at Zookeeper syncing here. |

This addresses https://issues.apache.org/jira/browse/KAFKA-4196
What I found was below warning accompanying all failures I was seeing from this test (reproduced instability by putting system under load):
[2017-02-18 16:17:42,892] WARN fsync-ing the write ahead log in SyncThread:0 took 20632ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog:338)ZK at times keeps locking for multiple seconds in tests (not only this one, but it's very frequent in this one for some reason). In this case (20s) the ZK locking lasted longer than the test timeout waiting only 15s (
org.apache.kafka.test.TestUtils#DEFAULT_MAX_WAIT_MS) for the path/admin/delete_topic/topicto be deleted.The only way to really fix this in a portable manner (should mainly hit ext3 users) is to turn off ZK fsyncing (not really needed in UTs anyways) as far as I know.
Did that here as described in (https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html) by setting
This should also help general test performance in my opinion.
Edit: Adjustment to KafkaApi merged separately already :)