Hello, congratulations on your paper -- nice work, interesting idea! I was interested in reproducing your results and was looking into your artifact for that. During this, I noticed a few things that I would like clarified.
A note before my questions: Since not all the code of your artifacts is part of your main GitHub repository and can only be found in the Docker image pulled as part of the reproduction setup, I created a GitHub repository that contains a copy of the data located in your Docker image adamstorek/fox:latest (sha256:11ac4f0ceb501d734af81aa4dc4d9ca7d4c87d55239ccf0218271d48bea8b78d) in /workspace/fuzzopt-eval. I'll use that to refer to specific parts of your code.
Coverage on standalone targets
You evaluated 15 standalone targets in your paper to compare against the state of the art. Table 4 displays the results of these experiments. I included the table below for convenience.

Reproduction steps I've performed
To check the setup used for evaluation, I followed the instructions you provided as part of your artifact. First, I pulled the provided Docker image via docker pull adamstorek/fox:latest (sha256:11ac4f0ceb501d734af81aa4dc4d9ca7d4c87d55239ccf0218271d48bea8b78d). Second, I spawned a Docker container using your image as described:
docker run --privileged --network='host' -d -it adamstorek/fox:latest
docker exec -it optfuzz_eval /bin/bash
Next, I switched to the /workspace/fuzzopt-eval/fuzzdeployment/targets folder and modified the set_all_targets.sh script, such that it builds libarchive (bsdtar) and ffmpeg, by setting TARGETS="ffmpeg libarchive". I chose these targets since you reported a coverage increase of up to 97.25% and 49.04% in your paper, respectively.
After executing set_all_targets.sh and once the build process terminated, I checked the resulting binaries.
For this matter, I executed the following command, which prints the size of the SanitizerCoverage section of each individual binary that has been built for the libarchive and ffmpeg target:
# ffmpeg binaries
as5827@19e7b435119d:/workspace/fuzzopt-eval/fuzzdeployment/targets$ find ffmpeg/binaries/ -type f -executable -print -exec bash -c 'readelf -S {} | grep guard' \;
ffmpeg/binaries/optfuzz_build/ffmpeg
[28] __sancov_guards PROGBITS 0000000008ddcd6c 08ddbd6c
ffmpeg/binaries/cmplog_build/ffmpeg
[28] __sancov_guards PROGBITS 0000000005d94494 05d93494
ffmpeg/binaries/aflpp_build/ffmpeg
[28] __sancov_guards PROGBITS 0000000005b29d54 05b28d54
# libarchive binaries
as5827@19e7b435119d:/workspace/fuzzopt-eval/fuzzdeployment/targets$ find libarchive/binaries/ -type f -executable -print -exec bash -c 'readelf -S {} | grep guard' \;
libarchive/binaries/optfuzz_build/bsdtar
[28] __sancov_guards PROGBITS 000000000072540c 0072440c
libarchive/binaries/cmplog_build/bsdtar
[28] __sancov_guards PROGBITS 00000000004636b4 004626b4
libarchive/binaries/aflpp_build/bsdtar
[28] __sancov_guards PROGBITS 00000000004243f4 004233f4
For each edge in the target, this map contains one entry that is 4 bytes in size. As we can see, the map sizes differ significantly. I believe that this is due to additional instrumentation added by your modified AFL++ pass (for example, here).
Computing the relative guard section size between your fuzzer (optfuzz), AFL++ (aflpp), and AFL++ cmplog (cmplog_build) yields the following results.
For ffmpeg, we are getting the following numbers:
optfuzz / aflpp = 0x0000000008ddcd6c / 0x0000000005b29d54 = 1.52
optfuzz / cmplog_build = 0x0000000008ddcd6c / 0x0000000005d94494 = 1.52
And for libarchive, we are getting the following ratios:
optfuzz / aflpp = 0x000000000072540c / 0x00000000004243f4 = 1.73
optfuzz / cmplog_build = 0x000000000072540c / 0x00000000004636b4 = 1.63
As we can see, again, the guard sections are quite different in size. Essentially, this means that the binaries compiled using your fuzzer have a considerable number of additional edges compared to the ones of AFL++. This, in turn, has ramifications for computing coverage over time (at least the way you do to generate the plots for your paper): Of course, if your fuzzer's binary has significantly more edges (added by your instrumentation), it is easy to cover more edges than the baseline fuzzer, for which the binary has fewer edges.
Looking at the script you are using for coverage computation (after the fuzzing runs), you are parsing AFL++'s fuzzing_stats or plot_data to calculate the coverage:
https://github.com/fuzz-evaluator/FOX-fuzzopt-eval-upstream/blob/23a3277c6604616157bb085dd26ba2f365780ff1/fuzzdeployment/process_results/parse_results.py#L305-L335
I don't think this comparison is fair: Since your binaries contain more than 50% additional edges for both targets (compared to the other fuzzer's binaries), this inflates your results artificially.
Now, all of the above is based on my superficial understanding so far (I'll definitely take a closer look in the coming days), so please let me know if there's anything I missed or understood wrong. I'd appreciate if you could outline why/how the issue I'm seeing is taken care of.
On another note, I'm curious if there is any place I can find your fuzzbench configuration so I can rerun your exact evaluation. Thanks in advance!
Hello, congratulations on your paper -- nice work, interesting idea! I was interested in reproducing your results and was looking into your artifact for that. During this, I noticed a few things that I would like clarified.
A note before my questions: Since not all the code of your artifacts is part of your main GitHub repository and can only be found in the Docker image pulled as part of the reproduction setup, I created a GitHub repository that contains a copy of the data located in your Docker image
adamstorek/fox:latest(sha256:11ac4f0ceb501d734af81aa4dc4d9ca7d4c87d55239ccf0218271d48bea8b78d) in/workspace/fuzzopt-eval. I'll use that to refer to specific parts of your code.Coverage on standalone targets
You evaluated 15 standalone targets in your paper to compare against the state of the art. Table 4 displays the results of these experiments. I included the table below for convenience.

Reproduction steps I've performed
To check the setup used for evaluation, I followed the instructions you provided as part of your artifact. First, I pulled the provided Docker image via
docker pull adamstorek/fox:latest(sha256:11ac4f0ceb501d734af81aa4dc4d9ca7d4c87d55239ccf0218271d48bea8b78d). Second, I spawned a Docker container using your image as described:Next, I switched to the
/workspace/fuzzopt-eval/fuzzdeployment/targetsfolder and modified theset_all_targets.shscript, such that it builds libarchive (bsdtar) and ffmpeg, by settingTARGETS="ffmpeg libarchive". I chose these targets since you reported a coverage increase of up to 97.25% and 49.04% in your paper, respectively.After executing
set_all_targets.shand once the build process terminated, I checked the resulting binaries.For this matter, I executed the following command, which prints the size of the SanitizerCoverage section of each individual binary that has been built for the libarchive and ffmpeg target:
For each edge in the target, this map contains one entry that is 4 bytes in size. As we can see, the map sizes differ significantly. I believe that this is due to additional instrumentation added by your modified AFL++ pass (for example, here).
Computing the relative guard section size between your fuzzer (optfuzz), AFL++ (aflpp), and AFL++ cmplog (cmplog_build) yields the following results.
For ffmpeg, we are getting the following numbers:
optfuzz / aflpp = 0x0000000008ddcd6c / 0x0000000005b29d54 = 1.52
optfuzz / cmplog_build = 0x0000000008ddcd6c / 0x0000000005d94494 = 1.52
And for libarchive, we are getting the following ratios:
optfuzz / aflpp = 0x000000000072540c / 0x00000000004243f4 = 1.73
optfuzz / cmplog_build = 0x000000000072540c / 0x00000000004636b4 = 1.63
As we can see, again, the guard sections are quite different in size. Essentially, this means that the binaries compiled using your fuzzer have a considerable number of additional edges compared to the ones of AFL++. This, in turn, has ramifications for computing coverage over time (at least the way you do to generate the plots for your paper): Of course, if your fuzzer's binary has significantly more edges (added by your instrumentation), it is easy to cover more edges than the baseline fuzzer, for which the binary has fewer edges.
Looking at the script you are using for coverage computation (after the fuzzing runs), you are parsing AFL++'s
fuzzing_statsorplot_datato calculate the coverage:https://github.com/fuzz-evaluator/FOX-fuzzopt-eval-upstream/blob/23a3277c6604616157bb085dd26ba2f365780ff1/fuzzdeployment/process_results/parse_results.py#L305-L335
I don't think this comparison is fair: Since your binaries contain more than 50% additional edges for both targets (compared to the other fuzzer's binaries), this inflates your results artificially.
Now, all of the above is based on my superficial understanding so far (I'll definitely take a closer look in the coming days), so please let me know if there's anything I missed or understood wrong. I'd appreciate if you could outline why/how the issue I'm seeing is taken care of.
On another note, I'm curious if there is any place I can find your fuzzbench configuration so I can rerun your exact evaluation. Thanks in advance!