Skip to content

Flaky-test: CLITest (and possibly a lot of other integration tests because of the same reason) #9622

@lhotari

Description

@lhotari

CLITest is flaky. It fails when test retries are disabled. This problem can be reproduced by running CLITest in IntelliJ.
It seems that the same root cause is causing test failures in other integration tests.

Bookies die in the test after this kind of error message failed to allocate 16777216 byte(s) of direct memory (used: 536870912, max: 536870912).

To investigate the issue I set the environment variable TESTCONTAINERS_RYUK_DISABLED=true to disable TestContainers automatic container cleanup. I also locally temporarily added these lines to PulsarCluster.stop method:

     public synchronized void stop() {
+        boolean leaveContainersRunning = Boolean.parseBoolean(System.getenv("TESTCONTAINERS_RYUK_DISABLED"));
+        if (leaveContainersRunning) {
+            log.warn("Pulsar cluster is left running since TESTCONTAINERS_RYUK_DISABLED=true.");
+            return;
+        }

After this, it's possible to use docker exec -it CLITest-euiia-pulsar-bookie-0 bash and get a shell to view the /var/log/pulsar/bookie.log file which is created by the tests/docker-images/latest-version-image/conf/bookie.conf config used to run bookies in the apachepulsar/pulsar-test-latest-version docker image. This file contained the following error message:

08:26:53.194 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.EntryLogManagerBase - Creating a new entry log fi
le because current active log channel has not initialized yet
08:26:53.195 [SyncThread-7-1] ERROR org.apache.bookkeeper.proto.BookieServer - Unable to allocate memory, exiting 
bookie
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 5368709
12, max: 536870912)

After restarting the bookie service with supervisorctl start bookie, the bookie starts again and I was able to check the java process command line:

/usr/local/openjdk-8/bin/java -cp /pulsar/conf:::/pulsar/lib/*: -Dlog4j.configurationFile=log4j2.yaml -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -Xmx128M -XX:MaxDirectMemorySize=512M -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024 -Dpulsar.log.appender=RoutingAppender -Dpulsar.log.dir=/pulsar/logs -Dpulsar.log.level=info -Dpulsar.log.root.level=info -Dpulsar.routing.appender.default=Console -Dlog4j2.is.webapp=false -Dpulsar.functions.process.container.log.dir=/pulsar/logs -Dpulsar.functions.java.instance.jar=/pulsar/instances/java-instance.jar -Dpulsar.functions.python.instance.file=/pulsar/instances/python-instance/python_instance_main.py -Dpulsar.functions.extra.dependencies.dir=/pulsar/instances/deps -Dpulsar.functions.instance.classpath=/pulsar/conf:::/pulsar/lib/*: -Dpulsar.log.file=bookkeeper.log org.apache.bookkeeper.server.Main --conf /pulsar/conf/bookkeeper.conf

I can see that -XX:MaxDirectMemorySize=512M is properly passed. I noticed that PULSAR_GC isn't used from tests/docker-images/latest-version-image/conf/bookie.conf so I created #9621 to fix that. However that's not the reason for the flakiness.

In tests/docker-images/latest-version-image/conf/bookie.conf there is configuration to set dbStorage_writeCacheMaxSizeMb="16",dbStorage_readAheadCacheMaxSizeMb="16" . However this is a no-op since apply-config-from-env.py script is called in
tests/docker-images/latest-version-image/scripts/run-bookie.sh . There is nothing that will put the environment variables defined in the supervisord config to conf/bookkeeper.conf. Therefore, the configuration for dbStorage_* should be directly in the run-bookie.sh script to fix the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions