FuzzBench: SBFT'23 Final Evaluation report

Experiment Summary

There is one known bug in each benchmark. Fuzzers are evaluated based on their ability to find the input to trigger the bug and cause a crash. We show two different aggregates (cross-benchmark) rankings of fuzzers. The first ranking is based on the average of per-benchmark scores, where the score represents the percentage of the highest reached median bug coverage on a given benchmark (the higher, the better), and ties are broken by the average time taken to find the input. Repeatedly triggering the bug will not gain an extra score. The second ranking shows the average rank of fuzzers, after we rank them on each benchmark according to their median reached bug-coverage (the lower, the better).

By avg. score

	average normalized score	average extra time to find bugs (seconds)
fuzzer
pastis	53.33	0.0
aflrustrust	53.33	960.0
aflsmart_plusplus	50.00	1440.0
afl	46.67	4140.0
honggfuzz	46.67	5310.0
libafl_libfuzzer	46.67	5490.0
aflplusplusplus	40.00	7680.0
hastefuzz	40.00	8880.0
libfuzzer	40.00	9030.0
aflplusplus	40.00	9600.0
symsan	20.00	24720.0
learnperffuzz	6.67	32100.0

By avg. rank

	average rank
fuzzer
aflrustrust	1.40
pastis	1.47
honggfuzz	1.80
afl	1.80
aflsmart_plusplus	2.00
libfuzzer	2.13
libafl_libfuzzer	2.13
hastefuzz	2.13
aflplusplusplus	2.13
aflplusplus	2.13
symsan	3.67
learnperffuzz	5.07

Critical difference diagram

The diagram visualizes the average rank of fuzzers (second ranking above) while showing the significance of the differences as well. What is considered a "critical difference" (CD) is based on the Friedman/Nemenyi post-hoc test. See more in the documentation.

Note: If a fuzzer does not support all benchmarks, its ranking as shown in this diagram can be lower than it should be. So please check the list of supported benchmarks for the fuzzer(s) of your interest. The list could be specified in the fuzzer's README.md like this.

Median relative code-coverages on each benchmark

Note: The relative coverage summary table shows the median relative performance of each fuzzer to the experiment maximum. Thus the highest relative performance may not be 100%.
trial_relative_coverage = trial_coverage / experiment_max_coverage

	libafl_libfuzzer	hastefuzz	aflrustrust	aflplusplusplus	aflplusplus	afl	aflsmart_plusplus	libfuzzer	pastis	honggfuzz	symsan	learnperffuzz
FuzzerMedian	93.00	92.00	94.00	88.00	89.00	93.00	91.00	84.00	83.00	84.00	83.00	61.00
FuzzerMean	86.67	85.27	85.13	85.00	83.60	83.20	82.13	81.13	76.53	74.07	56.60	55.87
arrow_arrow-ipc-stream-fuzz_1a34a0	95.00	96.00	94.00	86.00	89.00	95.00	83.00	92.00	nan	nan	nan	59.00
aspell_aspell_fuzzer_e8eb74	80.00	81.00	82.00	81.00	81.00	80.00	80.00	78.00	80.00	84.00	83.00	74.00
assimp_assimp_fuzzer_4d451f	33.00	57.00	51.00	62.00	51.00	36.00	36.00	71.00	71.00	81.00	90.00	0.00
bloaty_fuzz_target_52948c	98.00	81.00	95.00	76.00	90.00	96.00	96.00	71.00	84.00	93.00	nan	71.00
ffmpeg_ffmpeg_demuxer_fuzzer_7adeef	81.00	61.00	56.00	85.00	72.00	58.00	59.00	65.00	83.00	82.00	nan	16.00
file_magic_fuzzer_2d5f85	93.00	98.00	99.00	89.00	88.00	93.00	92.00	92.00	73.00	nan	71.00	71.00
grok_grk_decompress_fuzzer_9cd001	87.00	96.00	94.00	97.00	97.00	95.00	95.00	93.00	97.00	96.00	96.00	86.00
harfbuzz_hb-shape-fuzzer_17863b	99.00	96.00	89.00	95.00	95.00	96.00	96.00	84.00	95.00	96.00	89.00	77.00
lcms_cms_transform_all_fuzzer_97d37d	84.00	82.00	76.00	70.00	59.00	59.00	56.00	67.00	68.00	66.00	nan	2.00
libaom_av1_dec_fuzzer_6e1848	98.00	97.00	97.00	94.00	94.00	94.00	97.00	91.00	97.00	98.00	93.00	84.00
libpcap_fuzz_filter_98b0a2	93.00	94.00	90.00	95.00	95.00	92.00	88.00	86.00	94.00	91.00	nan	54.00
libxml2_xml_e85b9b	97.00	92.00	95.00	93.00	94.00	98.00	98.00	76.00	98.00	85.00	83.00	61.00
mbedtls_fuzz_dtlsclient_7c6b0e	68.00	69.00	68.00	68.00	68.00	68.00	69.00	68.00	66.00	68.00	67.00	48.00
php_php-fuzz-parser_0dbedb	96.00	95.00	95.00	96.00	96.00	96.00	96.00	95.00	98.00	98.00	92.00	89.00
systemd_fuzz-network-parser_288baf	98.00	84.00	96.00	88.00	85.00	92.00	91.00	88.00	44.00	73.00	85.00	46.00

Fuzzers are sorted by "FuzzerMean" (average median relative coverage), highest on the left.
Green background = highest relative median coverage.
Blue gradient background = greater than 95% relative median coverage.

Median relative bug-coverages on each benchmark

Note: The relative coverage summary table shows the median relative performance of each fuzzer to the experiment maximum. Thus the highest relative performance may not be 100%.
trial_relative_coverage = trial_coverage / experiment_max_coverage

	aflrustrust	pastis	aflsmart_plusplus	afl	honggfuzz	libafl_libfuzzer	aflplusplus	aflplusplusplus	hastefuzz	libfuzzer	symsan	learnperffuzz
FuzzerMedian	100.00	100.00	50.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
FuzzerMean	53.33	53.33	50.00	46.67	46.67	46.67	40.00	40.00	40.00	40.00	20.00	6.67
arrow_arrow-ipc-stream-fuzz_1a34a0	0.00	nan	0.00	0.00	nan	0.00	0.00	0.00	0.00	0.00	nan	0.00
aspell_aspell_fuzzer_e8eb74	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00
assimp_assimp_fuzzer_4d451f	100.00	100.00	50.00	100.00	100.00	0.00	100.00	100.00	100.00	100.00	100.00	0.00
bloaty_fuzz_target_52948c	100.00	100.00	100.00	100.00	0.00	100.00	0.00	0.00	0.00	0.00	nan	0.00
ffmpeg_ffmpeg_demuxer_fuzzer_7adeef	0.00	100.00	0.00	0.00	100.00	100.00	100.00	100.00	0.00	100.00	nan	0.00
file_magic_fuzzer_2d5f85	100.00	0.00	100.00	100.00	nan	0.00	100.00	100.00	100.00	100.00	0.00	0.00
grok_grk_decompress_fuzzer_9cd001	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
harfbuzz_hb-shape-fuzzer_17863b	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	0.00
lcms_cms_transform_all_fuzzer_97d37d	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	nan	0.00
libaom_av1_dec_fuzzer_6e1848	100.00	100.00	100.00	0.00	100.00	100.00	0.00	0.00	100.00	0.00	0.00	0.00
libpcap_fuzz_filter_98b0a2	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	nan	0.00
libxml2_xml_e85b9b	100.00	100.00	100.00	100.00	100.00	100.00	0.00	0.00	0.00	0.00	0.00	0.00
mbedtls_fuzz_dtlsclient_7c6b0e	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
php_php-fuzz-parser_0dbedb	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
systemd_fuzz-network-parser_288baf	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00

Fuzzers are sorted by "FuzzerMean" (average median relative coverage), highest on the left.
Green background = highest relative median coverage.
Blue gradient background = greater than 95% relative median coverage.

Total unique crashes found on each benchmark
(Note that a unique crash does not imply a unique bug)

	Total	honggfuzz	pastis	aflplusplusplus	symsan	aflrustrust	hastefuzz	libafl_libfuzzer	aflplusplus	libfuzzer	aflsmart_plusplus	afl	learnperffuzz
FuzzerSum	229	133	94	79	76	70	55	53	48	39	37	29	5
arrow_arrow-ipc-stream-fuzz_1a34a0	0	nan	nan	0	nan	0	0	0	0	0	0	0	0
aspell_aspell_fuzzer_e8eb74	5	2	5	2	1	2	3	2	2	2	2	2	1
assimp_assimp_fuzzer_4d451f	149	83	55	49	68	30	27	5	22	24	2	3	2
bloaty_fuzz_target_52948c	1	1	1	1	nan	1	1	1	1	0	1	1	0
ffmpeg_ffmpeg_demuxer_fuzzer_7adeef	21	7	7	12	nan	3	5	12	6	2	4	2	0
file_magic_fuzzer_2d5f85	2	nan	0	1	0	1	1	1	1	1	1	2	0
grok_grk_decompress_fuzzer_9cd001	3	2	2	3	2	2	2	1	2	2	2	2	1
harfbuzz_hb-shape-fuzzer_17863b	8	5	4	4	2	5	6	8	4	3	6	6	1
lcms_cms_transform_all_fuzzer_97d37d	8	2	2	1	nan	1	3	8	0	3	0	2	0
libaom_av1_dec_fuzzer_6e1848	16	16	16	5	3	16	7	13	5	0	14	6	0
libpcap_fuzz_filter_98b0a2	0	0	0	0	nan	0	0	0	0	0	0	0	0
libxml2_xml_e85b9b	3	2	2	0	0	2	0	2	1	2	3	2	0
mbedtls_fuzz_dtlsclient_7c6b0e	0	0	0	0	0	0	0	0	0	0	0	0	0
php_php-fuzz-parser_0dbedb	3	3	0	1	0	0	0	0	1	0	2	1	0
systemd_fuzz-network-parser_288baf	10	10	0	0	0	7	0	0	3	0	0	0	0

Fuzzers are sorted by "FuzzerSum", highest on the left.
Green background = most unique bugs found.
*note: This table represents unique bugs found across all trials.

arrow_arrow-ipc-stream-fuzz_1a34a0 summary

Discovered bug coverage distribution

Mean code coverage growth over time

Mean bug coverage growth over time

* The error bands show the 95% confidence interval around the mean code coverage.

Sample statistics and statistical significance (bugs covered)

Bug coverage sample statistics

		count	mean	std	min	25%	median	75%	max
fuzzer	time
afl	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
aflplusplus	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
aflplusplusplus	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
aflrustrust	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
aflsmart_plusplus	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
hastefuzz	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
learnperffuzz	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
libafl_libfuzzer	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
libfuzzer	82800	20.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

Vargha-Delaney A12 measure

The table summarizes the A12 values from the pairwise Vargha-Delaney A measure of effect size. Green cells indicate the probability the fuzzer in the row will outperform the fuzzer in the column.

Mann-Whitney U test

The table summarizes the p values of pairwise Mann-Whitney U tests. Green cells indicate that the reached coverage distribution of a given fuzzer pair is significantly different.

Sample statistics and statistical significance (code coverage)

Code coverage sample statistics

		count	mean	std	min	25%	median	75%	max
fuzzer	time
hastefuzz	82800	20.0	2459.55	69.794039	2316.0	2397.00	2473.5	2512.25	2566.0
libafl_libfuzzer	82800	20.0	2438.90	34.129629	2336.0	2427.75	2439.0	2455.50	2510.0
afl	82800	20.0	2433.90	30.755231	2369.0	2427.75	2438.0	2446.50	2504.0
aflrustrust	82800	20.0	2392.80	52.156243	2306.0	2334.50	2419.5	2438.50	2453.0
libfuzzer	82800	20.0	2373.65	45.375944	2296.0	2342.75	2380.5	2402.50	2444.0
aflplusplus	82800	20.0	2327.45	38.558875	2300.0	2305.75	2309.0	2332.50	2446.0
aflplusplusplus	82800	20.0	2210.05	49.650860	2025.0	2200.00	2220.5	2238.50	2255.0
aflsmart_plusplus	82800	20.0	2123.70	95.438103	1822.0	2121.50	2150.0	2171.25	2225.0
learnperffuzz	82800	20.0	1674.50	204.702223	1516.0	1516.00	1516.0	1872.00	2031.0

Vargha-Delaney A12 measure

The table summarizes the A12 values from the pairwise Vargha-Delaney A measure of effect size. Green cells indicate the probability the fuzzer in the row will outperform the fuzzer in the column.

Mann-Whitney U test

The table summarizes the p values of pairwise Mann-Whitney U tests. Green cells indicate that the reached coverage distribution of a given fuzzer pair is significantly different.