KAFKA-4196: Improved Test Stability by original-brownbear · Pull Request #2570 · apache/kafka

original-brownbear · 2017-02-18T16:39:14Z

This addresses https://issues.apache.org/jira/browse/KAFKA-4196

What I found was below warning accompanying all failures I was seeing from this test (reproduced instability by putting system under load):

[2017-02-18 16:17:42,892] WARN fsync-ing the write ahead log in SyncThread:0 took 20632ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog:338)

ZK at times keeps locking for multiple seconds in tests (not only this one, but it's very frequent in this one for some reason). In this case (20s) the ZK locking lasted longer than the test timeout waiting only 15s (org.apache.kafka.test.TestUtils#DEFAULT_MAX_WAIT_MS) for the path /admin/delete_topic/topic to be deleted.
The only way to really fix this in a portable manner (should mainly hit ext3 users) is to turn off ZK fsyncing (not really needed in UTs anyways) as far as I know.
Did that here as described in (https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html) by setting

  sys.props.put("zookeeper.observer.syncEnabled", "false")

This should also help general test performance in my opinion.

Edit: Adjustment to KafkaApi merged separately already :)

…fkaAPI Error Response

ijuma · 2017-02-18T17:14:59Z

Thanks for the PR.

Can you do a separate PR for the KafkaApis issue? I'd like to merge that ASAP as it's a regression from a PR that was merged yesterday.

Good find regarding the ZooKeeper fsync taking long. I'd like to get some additional opinions before merging that one.

asfbot · 2017-02-18T17:18:35Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/1761/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-02-18T17:18:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/1758/
Test FAILed (JDK 7 and Scala 2.10).

original-brownbear · 2017-02-18T17:21:07Z

@ijuma sure will add the separate PR in a few minutes :)

asfbot · 2017-02-18T18:26:14Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/1760/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-02-18T18:26:28Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1761/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-02-18T18:53:43Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/1763/
Test FAILed (JDK 8 and Scala 2.11).

ijuma · 2017-02-21T13:58:29Z

@original-brownbear, another option would be to increase the request timeout, right?

@hachikuji, @junrao do you think it's OK to disable ZK fsyncs in tests?

original-brownbear · 2017-02-21T14:02:55Z

@ijuma

another option would be to increase the request timeout, right?

Yea sure, I guess that would be "Plan B" just go up to 60s to be safe. If it was up to me I'd just rather go with more predictable/portable tests than with increasing the timeout.
IMO a good argument for just dropping the fsync would be that this is a new feature in ZK anyhow (just came in with the latest version) and doesn't really have any functional impact beyond the odd pause every now and then when it comes to testing (I mean whether or not ZK survives a power outage clean isn't really relevant here :D).

ijuma · 2017-02-21T14:08:51Z

This is the full description of the config:

(Java system property: zookeeper.observer.syncEnabled)

New in 3.4.6, 3.5.0: The observers now log transaction and write snapshot to disk by default like the participants. This reduces the recovery time of the observers on restart. Set to "false" to disable this feature. Default is "true"

So, it's been there for a long time (3.4.6 was released in 2014). I can see the argument for disabling it although it is a bit concerning that pause times can be that high in tests (supposedly it could happen in production too).

original-brownbear · 2017-02-21T14:15:33Z

@ijuma my bad sorry :) I would've sworn it came in with ZK 3.4.9 ... :)

That said in production this is an issue for dev ops imo. Either set up ext4 without global blocking sync or make ZK run from a separate disk.
Btw ... this even hits ZK tests themselves it seems https://issues.apache.org/jira/browse/ZOOKEEPER-1800?focusedCommentId=13807654&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13807654

ijuma · 2017-02-24T23:54:22Z

@original-brownbear, that config states that it's just about observers, not participants. Have you verified that it does actually fix the issue? I think the option you were looking for is:

forceSync
(Java system property: zookeeper.forceSync)

Requires updates to be synced to media of the transaction log before finishing processing the update. If this option is set to no, ZooKeeper will not require updates to be synced to the media.

original-brownbear · 2017-02-25T10:21:43Z

@ijuma you're right ... yikes so sorry for wasting time on this.
Apparently this was a coincidence (there is obviously some correlation between fsyncing and lots of load but not in an individual test) and I was way too optimistic (when it comes to sample size) when retesting.

Just tried it out with the debugger ... never hitting that setting in the zookeeper code in any relevant way.

Tried to redeem myself by at least putting the effort into testing your solution too, but bad news here too.
Same problem arises eventually.
The only thing I learnt is that this is probably not a question of load. After running 12 parallel loops of the suit for 30 min it kept dying down to 4 (stopping loop on failure) step by step.
Looks more like some timing issue to me and it also happens in other tests of that suit.

You keep eventually running into one that times out (see below example of another one in that suit failing in a similar manner) while others run at constant pace for hundreds of runs:

=> closing here, def on the wrong track looking at Zookeeper syncing here.

KAFKA-4196 Improved Test Stability by Disabling ZK Fsync and Fixed Ka…

99dd9c6

…fkaAPI Error Response

original-brownbear mentioned this pull request Feb 18, 2017

HOTFIX: ClassCastException in Request logging #2571

Closed

KAFKA-4196 revert noise import changes

e2bb52e

original-brownbear changed the title ~~KAFKA-4196: Improved Test Stability by Disabling ZK Fsync and Fixed KafkaAPI Error Response~~ KAFKA-4196: Improved Test Stability Feb 18, 2017

original-brownbear closed this Feb 25, 2017

original-brownbear deleted the KAFKA-4196 branch February 26, 2017 11:27

Conversation

original-brownbear commented Feb 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ijuma commented Feb 18, 2017

Uh oh!

asfbot commented Feb 18, 2017

Uh oh!

asfbot commented Feb 18, 2017

Uh oh!

original-brownbear commented Feb 18, 2017

Uh oh!

asfbot commented Feb 18, 2017

Uh oh!

asfbot commented Feb 18, 2017

Uh oh!

asfbot commented Feb 18, 2017

Uh oh!

ijuma commented Feb 21, 2017

Uh oh!

original-brownbear commented Feb 21, 2017

Uh oh!

ijuma commented Feb 21, 2017

Uh oh!

original-brownbear commented Feb 21, 2017

Uh oh!

ijuma commented Feb 24, 2017

Uh oh!

original-brownbear commented Feb 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

original-brownbear commented Feb 18, 2017 •

edited

Loading

original-brownbear commented Feb 25, 2017 •

edited

Loading