CLITest is flaky. It fails when test retries are disabled. This problem can be reproduced by running CLITest in IntelliJ.
It seems that the same root cause is causing test failures in other integration tests.
Bookies die in the test after this kind of error message failed to allocate 16777216 byte(s) of direct memory (used: 536870912, max: 536870912).
To investigate the issue I set the environment variable TESTCONTAINERS_RYUK_DISABLED=true to disable TestContainers automatic container cleanup. I also locally temporarily added these lines to PulsarCluster.stop method:
public synchronized void stop() {
+ boolean leaveContainersRunning = Boolean.parseBoolean(System.getenv("TESTCONTAINERS_RYUK_DISABLED"));
+ if (leaveContainersRunning) {
+ log.warn("Pulsar cluster is left running since TESTCONTAINERS_RYUK_DISABLED=true.");
+ return;
+ }
After this, it's possible to use docker exec -it CLITest-euiia-pulsar-bookie-0 bash and get a shell to view the /var/log/pulsar/bookie.log file which is created by the tests/docker-images/latest-version-image/conf/bookie.conf config used to run bookies in the apachepulsar/pulsar-test-latest-version docker image. This file contained the following error message:
08:26:53.194 [SyncThread-7-1] INFO org.apache.bookkeeper.bookie.EntryLogManagerBase - Creating a new entry log fi
le because current active log channel has not initialized yet
08:26:53.195 [SyncThread-7-1] ERROR org.apache.bookkeeper.proto.BookieServer - Unable to allocate memory, exiting
bookie
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 5368709
12, max: 536870912)
After restarting the bookie service with supervisorctl start bookie, the bookie starts again and I was able to check the java process command line:
/usr/local/openjdk-8/bin/java -cp /pulsar/conf:::/pulsar/lib/*: -Dlog4j.configurationFile=log4j2.yaml -Djute.maxbuffer=10485760 -Djava.net.preferIPv4Stack=true -Xmx128M -XX:MaxDirectMemorySize=512M -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=32 -XX:ConcGCThreads=32 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.maxCapacity.default=1000 -Dio.netty.recycler.linkCapacity=1024 -Dpulsar.log.appender=RoutingAppender -Dpulsar.log.dir=/pulsar/logs -Dpulsar.log.level=info -Dpulsar.log.root.level=info -Dpulsar.routing.appender.default=Console -Dlog4j2.is.webapp=false -Dpulsar.functions.process.container.log.dir=/pulsar/logs -Dpulsar.functions.java.instance.jar=/pulsar/instances/java-instance.jar -Dpulsar.functions.python.instance.file=/pulsar/instances/python-instance/python_instance_main.py -Dpulsar.functions.extra.dependencies.dir=/pulsar/instances/deps -Dpulsar.functions.instance.classpath=/pulsar/conf:::/pulsar/lib/*: -Dpulsar.log.file=bookkeeper.log org.apache.bookkeeper.server.Main --conf /pulsar/conf/bookkeeper.conf
I can see that -XX:MaxDirectMemorySize=512M is properly passed. I noticed that PULSAR_GC isn't used from tests/docker-images/latest-version-image/conf/bookie.conf so I created #9621 to fix that. However that's not the reason for the flakiness.
In tests/docker-images/latest-version-image/conf/bookie.conf there is configuration to set dbStorage_writeCacheMaxSizeMb="16",dbStorage_readAheadCacheMaxSizeMb="16" . However this is a no-op since apply-config-from-env.py script is called in
tests/docker-images/latest-version-image/scripts/run-bookie.sh . There is nothing that will put the environment variables defined in the supervisord config to conf/bookkeeper.conf. Therefore, the configuration for dbStorage_* should be directly in the run-bookie.sh script to fix the issue.
CLITest is flaky. It fails when test retries are disabled. This problem can be reproduced by running CLITest in IntelliJ.
It seems that the same root cause is causing test failures in other integration tests.
Bookies die in the test after this kind of error message
failed to allocate 16777216 byte(s) of direct memory (used: 536870912, max: 536870912).To investigate the issue I set the environment variable
TESTCONTAINERS_RYUK_DISABLED=trueto disable TestContainers automatic container cleanup. I also locally temporarily added these lines toPulsarCluster.stopmethod:After this, it's possible to use
docker exec -it CLITest-euiia-pulsar-bookie-0 bashand get a shell to view the/var/log/pulsar/bookie.logfile which is created by thetests/docker-images/latest-version-image/conf/bookie.confconfig used to run bookies in theapachepulsar/pulsar-test-latest-versiondocker image. This file contained the following error message:After restarting the bookie service with
supervisorctl start bookie, the bookie starts again and I was able to check thejavaprocess command line:I can see that
-XX:MaxDirectMemorySize=512Mis properly passed. I noticed thatPULSAR_GCisn't used fromtests/docker-images/latest-version-image/conf/bookie.confso I created #9621 to fix that. However that's not the reason for the flakiness.In
tests/docker-images/latest-version-image/conf/bookie.confthere is configuration to setdbStorage_writeCacheMaxSizeMb="16",dbStorage_readAheadCacheMaxSizeMb="16". However this is a no-op sinceapply-config-from-env.pyscript is called intests/docker-images/latest-version-image/scripts/run-bookie.sh. There is nothing that will put the environment variables defined in the supervisord config toconf/bookkeeper.conf. Therefore, the configuration fordbStorage_*should be directly in therun-bookie.shscript to fix the issue.