Skip to content

KAFKA-4196: Improved Test Stability#2570

Closed
original-brownbear wants to merge 2 commits intoapache:trunkfrom
original-brownbear:KAFKA-4196
Closed

KAFKA-4196: Improved Test Stability#2570
original-brownbear wants to merge 2 commits intoapache:trunkfrom
original-brownbear:KAFKA-4196

Conversation

@original-brownbear
Copy link
Copy Markdown
Member

@original-brownbear original-brownbear commented Feb 18, 2017

This addresses https://issues.apache.org/jira/browse/KAFKA-4196

What I found was below warning accompanying all failures I was seeing from this test (reproduced instability by putting system under load):

[2017-02-18 16:17:42,892] WARN fsync-ing the write ahead log in SyncThread:0 took 20632ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog:338)

ZK at times keeps locking for multiple seconds in tests (not only this one, but it's very frequent in this one for some reason). In this case (20s) the ZK locking lasted longer than the test timeout waiting only 15s (org.apache.kafka.test.TestUtils#DEFAULT_MAX_WAIT_MS) for the path /admin/delete_topic/topic to be deleted.
The only way to really fix this in a portable manner (should mainly hit ext3 users) is to turn off ZK fsyncing (not really needed in UTs anyways) as far as I know.
Did that here as described in (https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html) by setting

  sys.props.put("zookeeper.observer.syncEnabled", "false")

This should also help general test performance in my opinion.

Edit: Adjustment to KafkaApi merged separately already :)

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 18, 2017

Thanks for the PR.

Can you do a separate PR for the KafkaApis issue? I'd like to merge that ASAP as it's a regression from a PR that was merged yesterday.

Good find regarding the ZooKeeper fsync taking long. I'd like to get some additional opinions before merging that one.

@asfbot
Copy link
Copy Markdown

asfbot commented Feb 18, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/1761/
Test FAILed (JDK 8 and Scala 2.11).

@asfbot
Copy link
Copy Markdown

asfbot commented Feb 18, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/1758/
Test FAILed (JDK 7 and Scala 2.10).

@original-brownbear
Copy link
Copy Markdown
Member Author

@ijuma sure will add the separate PR in a few minutes :)

@asfbot
Copy link
Copy Markdown

asfbot commented Feb 18, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/1760/
Test PASSed (JDK 7 and Scala 2.10).

@asfbot
Copy link
Copy Markdown

asfbot commented Feb 18, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1761/
Test PASSed (JDK 8 and Scala 2.12).

@asfbot
Copy link
Copy Markdown

asfbot commented Feb 18, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/1763/
Test FAILed (JDK 8 and Scala 2.11).

@original-brownbear original-brownbear changed the title KAFKA-4196: Improved Test Stability by Disabling ZK Fsync and Fixed KafkaAPI Error Response KAFKA-4196: Improved Test Stability Feb 18, 2017
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 21, 2017

@original-brownbear, another option would be to increase the request timeout, right?

@hachikuji, @junrao do you think it's OK to disable ZK fsyncs in tests?

@original-brownbear
Copy link
Copy Markdown
Member Author

@ijuma

another option would be to increase the request timeout, right?

Yea sure, I guess that would be "Plan B" just go up to 60s to be safe. If it was up to me I'd just rather go with more predictable/portable tests than with increasing the timeout.
IMO a good argument for just dropping the fsync would be that this is a new feature in ZK anyhow (just came in with the latest version) and doesn't really have any functional impact beyond the odd pause every now and then when it comes to testing (I mean whether or not ZK survives a power outage clean isn't really relevant here :D).

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 21, 2017

This is the full description of the config:

(Java system property: zookeeper.observer.syncEnabled)

New in 3.4.6, 3.5.0: The observers now log transaction and write snapshot to disk by default like the participants. This reduces the recovery time of the observers on restart. Set to "false" to disable this feature. Default is "true"

So, it's been there for a long time (3.4.6 was released in 2014). I can see the argument for disabling it although it is a bit concerning that pause times can be that high in tests (supposedly it could happen in production too).

@original-brownbear
Copy link
Copy Markdown
Member Author

@ijuma my bad sorry :) I would've sworn it came in with ZK 3.4.9 ... :)

That said in production this is an issue for dev ops imo. Either set up ext4 without global blocking sync or make ZK run from a separate disk.
Btw ... this even hits ZK tests themselves it seems https://issues.apache.org/jira/browse/ZOOKEEPER-1800?focusedCommentId=13807654&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13807654

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 24, 2017

@original-brownbear, that config states that it's just about observers, not participants. Have you verified that it does actually fix the issue? I think the option you were looking for is:

forceSync
(Java system property: zookeeper.forceSync)

Requires updates to be synced to media of the transaction log before finishing processing the update. If this option is set to no, ZooKeeper will not require updates to be synced to the media.

@original-brownbear
Copy link
Copy Markdown
Member Author

original-brownbear commented Feb 25, 2017

@ijuma you're right ... yikes so sorry for wasting time on this.
Apparently this was a coincidence (there is obviously some correlation between fsyncing and lots of load but not in an individual test) and I was way too optimistic (when it comes to sample size) when retesting.

Just tried it out with the debugger ... never hitting that setting in the zookeeper code in any relevant way.

Tried to redeem myself by at least putting the effort into testing your solution too, but bad news here too.
Same problem arises eventually.
The only thing I learnt is that this is probably not a question of load. After running 12 parallel loops of the suit for 30 min it kept dying down to 4 (stopping loop on failure) step by step.
Looks more like some timing issue to me and it also happens in other tests of that suit.

You keep eventually running into one that times out (see below example of another one in that suit failing in a similar manner) while others run at constant pace for hundreds of runs:
screenshot from 2017-02-25 11-18-09

=> closing here, def on the wrong track looking at Zookeeper syncing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants