Skip to content

Questions regarding your paper's evalutation #2

@nbars

Description

@nbars

Hello, congratulations on your paper -- nice work, interesting idea! I was interested in reproducing your results and was looking into your artifact for that. During this, I noticed a few things that I would like clarified.

A note before my questions: Since not all the code of your artifacts is part of your main GitHub repository and can only be found in the Docker image pulled as part of the reproduction setup, I created a GitHub repository that contains a copy of the data located in your Docker image adamstorek/fox:latest (sha256:11ac4f0ceb501d734af81aa4dc4d9ca7d4c87d55239ccf0218271d48bea8b78d) in /workspace/fuzzopt-eval. I'll use that to refer to specific parts of your code.

Coverage on standalone targets

You evaluated 15 standalone targets in your paper to compare against the state of the art. Table 4 displays the results of these experiments. I included the table below for convenience.
Screenshot 2024-10-17 at 14 26 24

Reproduction steps I've performed

To check the setup used for evaluation, I followed the instructions you provided as part of your artifact. First, I pulled the provided Docker image via docker pull adamstorek/fox:latest (sha256:11ac4f0ceb501d734af81aa4dc4d9ca7d4c87d55239ccf0218271d48bea8b78d). Second, I spawned a Docker container using your image as described:

docker run --privileged --network='host' -d -it adamstorek/fox:latest
docker exec -it optfuzz_eval /bin/bash

Next, I switched to the /workspace/fuzzopt-eval/fuzzdeployment/targets folder and modified the set_all_targets.sh script, such that it builds libarchive (bsdtar) and ffmpeg, by setting TARGETS="ffmpeg libarchive". I chose these targets since you reported a coverage increase of up to 97.25% and 49.04% in your paper, respectively.
After executing set_all_targets.sh and once the build process terminated, I checked the resulting binaries.

For this matter, I executed the following command, which prints the size of the SanitizerCoverage section of each individual binary that has been built for the libarchive and ffmpeg target:

# ffmpeg binaries
as5827@19e7b435119d:/workspace/fuzzopt-eval/fuzzdeployment/targets$ find ffmpeg/binaries/  -type f -executable -print -exec bash -c 'readelf -S {} | grep guard' \;
ffmpeg/binaries/optfuzz_build/ffmpeg
 [28] __sancov_guards   PROGBITS         0000000008ddcd6c  08ddbd6c
ffmpeg/binaries/cmplog_build/ffmpeg
 [28] __sancov_guards   PROGBITS         0000000005d94494  05d93494
ffmpeg/binaries/aflpp_build/ffmpeg
 [28] __sancov_guards   PROGBITS         0000000005b29d54  05b28d54

# libarchive binaries
as5827@19e7b435119d:/workspace/fuzzopt-eval/fuzzdeployment/targets$ find libarchive/binaries/  -type f -executable -print -exec bash -c 'readelf -S {} | grep guard' \;
libarchive/binaries/optfuzz_build/bsdtar
 [28] __sancov_guards   PROGBITS         000000000072540c  0072440c
libarchive/binaries/cmplog_build/bsdtar
 [28] __sancov_guards   PROGBITS         00000000004636b4  004626b4
libarchive/binaries/aflpp_build/bsdtar
 [28] __sancov_guards   PROGBITS         00000000004243f4  004233f4

For each edge in the target, this map contains one entry that is 4 bytes in size. As we can see, the map sizes differ significantly. I believe that this is due to additional instrumentation added by your modified AFL++ pass (for example, here).

Computing the relative guard section size between your fuzzer (optfuzz), AFL++ (aflpp), and AFL++ cmplog (cmplog_build) yields the following results.

For ffmpeg, we are getting the following numbers:
optfuzz / aflpp = 0x0000000008ddcd6c / 0x0000000005b29d54 = 1.52
optfuzz / cmplog_build = 0x0000000008ddcd6c / 0x0000000005d94494 = 1.52

And for libarchive, we are getting the following ratios:
optfuzz / aflpp = 0x000000000072540c / 0x00000000004243f4 = 1.73
optfuzz / cmplog_build = 0x000000000072540c / 0x00000000004636b4 = 1.63

As we can see, again, the guard sections are quite different in size. Essentially, this means that the binaries compiled using your fuzzer have a considerable number of additional edges compared to the ones of AFL++. This, in turn, has ramifications for computing coverage over time (at least the way you do to generate the plots for your paper): Of course, if your fuzzer's binary has significantly more edges (added by your instrumentation), it is easy to cover more edges than the baseline fuzzer, for which the binary has fewer edges.

Looking at the script you are using for coverage computation (after the fuzzing runs), you are parsing AFL++'s fuzzing_stats or plot_data to calculate the coverage:

https://github.com/fuzz-evaluator/FOX-fuzzopt-eval-upstream/blob/23a3277c6604616157bb085dd26ba2f365780ff1/fuzzdeployment/process_results/parse_results.py#L305-L335

I don't think this comparison is fair: Since your binaries contain more than 50% additional edges for both targets (compared to the other fuzzer's binaries), this inflates your results artificially.

Now, all of the above is based on my superficial understanding so far (I'll definitely take a closer look in the coming days), so please let me know if there's anything I missed or understood wrong. I'd appreciate if you could outline why/how the issue I'm seeing is taken care of.

On another note, I'm curious if there is any place I can find your fuzzbench configuration so I can rerun your exact evaluation. Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions