fuzzer: don't remove or modify byte of empty input by McSinyx · Pull Request #23180 · ziglang/zig

McSinyx · 2025-03-10T09:10:35Z

The actual patch is rather trivial, but the debugging process reveals more hidden problems. In today's episode of Who Fuzzes the Fuzzer?, I got a segfault with the following:

const std = @import("std");

fn findSecret(context: []const u8, input: []const u8) !void {
    if (std.mem.eql(u8, context, input))
        return error.FoundSecretString;
}

test "fuzz example" {
    try std.testing.fuzz(@as([]const u8, "canyoufindme"), findSecret, .{
        .corpus = &.{ "c" }
    });
}

Relevant log:

Segmentation fault at address 0x0
.../lib/fuzzer.zig:589:35: 0x11a2880 in start (fuzzer)
        @memcpy(l.items[old_len..][0..items.len], items);
                                  ^
.../lib/fuzzer.zig:460:17: 0x11a4e9f in fuzzer_start (fuzzer)
    fuzzer.start() catch |err| oom(err);

Yes, part of the trace is missing, but I managed to pinpoint the bug to be from below when old_input.len == 0, leading to omitted_index == std.math.maxInt(usize):

zig/lib/fuzzer.zig

Lines 314 to 318 in 8e0a4ca

    
           .remove_byte => { 
        
               const omitted_index = rng.uintLessThanBiased(usize, old_input.len); 
        
               f.input.appendSliceAssumeCapacity(old_input[0..omitted_index]); 
        
               f.input.appendSliceAssumeCapacity(old_input[omitted_index + 1 ..]); 
        
           },

Mysteriously, assertions are evaded, unreachables are reached, bound checks are ignored and panics don't pan above the segfaulting statement. I took a look at lib/fuzzer/web/main.zig's panic function does indeed call @trap:

zig/lib/fuzzer/web/main.zig

Lines 33 to 38 in 8e0a4ca

    
           pub fn panic(msg: []const u8, st: ?*std.builtin.StackTrace, addr: ?usize) noreturn { 
        
               _ = st; 
        
               _ = addr; 
        
               log.err("panic: {s}", .{msg}); 
        
               @trap(); 
        
           }

McSinyx · 2025-03-30T14:37:43Z

🥺 May I have some eyes over the patch, pwetty pwease?

This PR significantly improves the capabilities of the fuzzer. For comparison, here is a ten minute head to head between the old and new fuzzer implementations (with newly included fuzz tests): -- Old -- ``` Total Runs: 49020931 Unique Runs: 1044131 (2.1%) Speed (Runs/Second): 81696 Coverage: 2069 / 15866 (13.0%) ``` (note: Unique Runs is highly inflated due of the inefficiency of the old implementation) -- New -- ``` Total Runs: 537039526 Unique Runs: 1511 (0.0%) Speed (Runs/Second): 894950 Coverage: 3000 / 15719 (19.1%) Examples: `while(C)i(){}else|` `{y:n()align(b)addrspace` `switch(P){else=>` `[:l]align(_:r:l)R` `(if(b){defer{nosuspend` `union(enum(I))` ``` NOTE: You have to rebuild the compiler due to new fuzzing instrumentation being enabled for memory loads. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine new runs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start, though this does not close ziglang#20803 since it does not output the input (though it can be verily easily retrieved from the cache directory.) Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Additionally, volatile was removed from MemoryMappedList since all that is needed is a guarantee that compiler has done the writes, which is already accomplished with atomic ordering. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180 POSSIBLE IMPROVEMENTS: * Remove the 8-bit pc counting code prefer a call to a sanitizer function that updates a flag if a new pc hit happened (similar to how the __sanitizer_cov_load functions already operate). * Less basic input minimization function. It could also try splitting inputs into two between each byte to see if they both hit the same pcs. This is useful as smaller inputs are usually much more efficient. * Deterministic mutations when a new input is found. * Culling out corpus inputs that are redundant due to smaller inputs already hitting their pcs and memory addresses. * Applying multiple mutations during dry spells. * Prioritizing some corpus inputs. * Creating a list of the most successful input splices (which would likely contain grammar keywords) and creating a custom mutation for adding them. * Removing some less-efficient mutations. * Store effective mutations to the disk for the benefit of future runs. * Counting __sanitizer_cov `@returnAddress`es in determining unique runs. * Optimize __sanitizer_cov_trace_const_cmp methods (the use of an ArrayHashMap is not too fast). * Processor affinity * Exclude fuzzer's .rodata Nevertheless, I feel like the fuzzer is in a viable place to start being useful (as demonstrated in ziglang#23413)

This PR significantly improves the capabilities of the fuzzer. For comparison, here is a ten minute head to head between the old and new fuzzer implementations (with newly included fuzz tests): -- Old -- ``` Total Runs: 49020931 Unique Runs: 1044131 (2.1%) Speed (Runs/Second): 81696 Coverage: 2069 / 15866 (13.0%) ``` (note: Unique Runs is highly inflated due of the inefficiency of the old implementation) -- New -- ``` Total Runs: 537039526 Unique Runs: 1511 (0.0%) Speed (Runs/Second): 894950 Coverage: 3000 / 15719 (19.1%) Examples: `while(C)i(){}else|` `{y:n()align(b)addrspace` `switch(P){else=>` `[:l]align(_:r:l)R` `(if(b){defer{nosuspend` `union(enum(I))` ``` NOTE: You have to rebuild the compiler due to new fuzzing instrumentation being enabled for memory loads. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine new runs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start, though this does not close ziglang#20803 since it does not output the input (though it can be very easily retrieved from the cache directory.) Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Additionally, volatile was removed from MemoryMappedList since all that is needed is a guarantee that compiler has done the writes, which is already accomplished with atomic ordering. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180 POSSIBLE IMPROVEMENTS: * Remove the 8-bit pc counting code prefer a call to a sanitizer function that updates a flag if a new pc hit happened (similar to how the __sanitizer_cov_load functions already operate). * Less basic input minimization function. It could also try splitting inputs into two between each byte to see if they both hit the same pcs. This is useful as smaller inputs are usually much more efficient. * Deterministic mutations when a new input is found. * Culling out corpus inputs that are redundant due to smaller inputs already hitting their pcs and memory addresses. * Applying multiple mutations during dry spells. * Prioritizing some corpus inputs. * Creating a list of the most successful input splices (which would likely contain grammar keywords) and creating a custom mutation for adding them. * Removing some less-efficient mutations. * Store effective mutations to the disk for the benefit of future runs. * Counting __sanitizer_cov `@returnAddress`es in determining unique runs. * Optimize __sanitizer_cov_trace_const_cmp methods (the use of an ArrayHashMap is not too fast). * Processor affinity * Exclude fuzzer's .rodata Nevertheless, I feel like the fuzzer is in a viable place to start being useful (as demonstrated in ziglang#23413)

This PR significantly improves the capabilities of the fuzzer. For comparison, here is a ten minute head to head between the old and new fuzzer implementations (with newly included fuzz tests): -- Old -- ``` Total Runs: 49020931 Unique Runs: 1044131 (2.1%) Speed (Runs/Second): 81696 Coverage: 2069 / 15866 (13.0%) ``` (note: Unique Runs is highly inflated due of the inefficiency of the old implementation) -- New -- ``` Total Runs: 537039526 Unique Runs: 1511 (0.0%) Speed (Runs/Second): 894950 Coverage: 3000 / 15719 (19.1%) Examples: `while(C)i(){}else|` `{y:n()align(b)addrspace` `switch(P){else=>` `[:l]align(_:r:l)R` `(if(b){defer{nosuspend` `union(enum(I))` ``` NOTE: You have to rebuild the compiler due to new fuzzing instrumentation being enabled for memory loads. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine new runs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start, though this does not close ziglang#20803 since it does not output the input (though it can be very easily retrieved from the cache directory.) Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Additionally, volatile was removed from MemoryMappedList since all that is needed is a guarantee that compiler has done the writes, which is already accomplished with atomic ordering. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180 POSSIBLE IMPROVEMENTS: * Remove the 8-bit pc counting code prefer a call to a sanitizer function that updates a flag if a new pc hit happened (similar to how the __sanitizer_cov_load functions already operate). * Less basic input minimization function. It could also try splitting inputs into two between each byte to see if they both hit the same pcs. This is useful as smaller inputs are usually much more efficient. * Deterministic mutations when a new input is found. * Culling out corpus inputs that are redundant due to smaller inputs already hitting their pcs and memory addresses. * Applying multiple mutations during dry spells. * Prioritizing some corpus inputs. * Creating a list of the most successful input splices (which would likely contain grammar keywords) and creating a custom mutation for adding them. * Removing some less-efficient mutations. * Store effective mutations to the disk for the benefit of future runs. * Counting __sanitizer_cov `@returnAddress`es in determining unique runs. * Optimize __sanitizer_cov_trace_const_cmp methods (the use of an ArrayHashMap is not too fast). * Processor affinity * Exclude fuzzer's .rodata Nevertheless, I feel like the fuzzer is in a viable place to start being useful (as demonstrated with the find in ziglang#23413)

This PR significantly improves the capabilities of the fuzzer. For comparison, here is a ten minute head to head between the old and new fuzzer implementations (with newly included fuzz tests): -- Old -- ``` Total Runs: 49020931 Unique Runs: 1044131 (2.1%) Speed (Runs/Second): 81696 Coverage: 2069 / 15866 (13.0%) ``` (note: Unique Runs is highly inflated due of the inefficiency of the old implementation) -- New -- ``` Total Runs: 537039526 Unique Runs: 1511 (0.0%) Speed (Runs/Second): 894950 Coverage: 3000 / 15719 (19.1%) Examples: `while(C)i(){}else|` `{y:n()align(b)addrspace` `switch(P){else=>` `[:l]align(_:r:l)R` `(if(b){defer{nosuspend` `union(enum(I))` ``` NOTE: You have to rebuild the compiler due to new fuzzing instrumentation being enabled for memory loads. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine new runs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start, though this does not close ziglang#20803 since it does not output the input (though it can be very easily retrieved from the cache directory.) Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Additionally, volatile was removed from MemoryMappedList since all that is needed is a guarantee that compiler has done the writes, which is already accomplished with atomic ordering. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180 Possible Improvements: * Remove the 8-bit pc counting code prefer a call to a sanitizer function that updates a flag if a new pc hit happened (similar to how the __sanitizer_cov_load functions already operate). * Less basic input minimization function. It could also try splitting inputs into two between each byte to see if they both hit the same pcs. This is useful as smaller inputs are usually much more efficient. * Deterministic mutations when a new input is found. * Culling out corpus inputs that are redundant due to smaller inputs already hitting their pcs and memory addresses. * Applying multiple mutations during dry spells. * Prioritizing some corpus inputs. * Creating a list of the most successful input splices (which would likely contain grammar keywords) and creating a custom mutation for adding them. * Removing some less-efficient mutations. * Store effective mutations to the disk for the benefit of future runs. * Counting __sanitizer_cov `@returnAddress`es in determining unique runs. * Optimize __sanitizer_cov_trace_const_cmp methods (the use of an ArrayHashMap is not too fast). * Processor affinity * Exclude fuzzer's .rodata Nevertheless, I feel like the fuzzer is in a viable place to start being useful (as demonstrated with the find in ziglang#23413)

This PR significantly improves the capabilities of the fuzzer. For comparison, here is a ten minute head to head between the old and new fuzzer implementations (with newly included fuzz tests): -- Old -- ``` Total Runs: 49020931 Unique Runs: 1044131 (2.1%) Speed (Runs/Second): 81696 Coverage: 2069 / 15866 (13.0%) ``` (note: Unique Runs is highly inflated due of the inefficiency of the old implementation) -- New -- ``` Total Runs: 537039526 Unique Runs: 1511 (0.0%) Speed (Runs/Second): 894950 Coverage: 3000 / 15719 (19.1%) Examples: `while(C)i(){}else|` `{y:n()align(b)addrspace` `switch(P){else=>` `[:l]align(_:r:l)R` `(if(b){defer{nosuspend` `union(enum(I))` ``` NOTE: You have to rebuild the compiler due to new fuzzing instrumentation being enabled for memory loads. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine new runs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start, though this does not close ziglang#20803 since it does not output the input (though it can be very easily retrieved from the cache directory.) Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180

This PR significantly improves the capabilities of the fuzzer. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine new runs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start, though this does not close ziglang#20803 since it does not output the input (though it can be very easily retrieved from the cache directory.) Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180

This PR significantly improves the capabilities of the fuzzer. The changes made to the fuzzer to accomplish this feat mostly include tracking memory reads from .rodata to determine fresh inputs, new mutations (especially the ones that insert const values from .rodata reads and __sanitizer_conv_const_cmp), and minimizing found inputs. Additionally, the runs per second has greatly been increased due to generating smaller inputs and avoiding clearing the 8-bit pc counters. An additional feature added is that the length of the input file is now stored and the old input file is rerun upon start. Other changes made to the fuzzer include more logical initialization, using one shared file `in` for inputs, creating corpus files with proper sizes, and using hexadecimal-numbered corpus files for simplicity. Furthermore, I added several new fuzz tests to gauge the fuzzer's efficiency. I also tried to add a test for zstandard decompression, which it crashed within 60,000 runs (less than a second.) Bug fixes include: * Fixed a race conditions when multiple fuzzer processes needed to use the same coverage file. * Web interface stats now update even when unique runs is not changing. * Fixed tokenizer.testPropertiesUpheld to allow stray carriage returns since they are valid whitespace. * Closes ziglang#23180

McSinyx · 2025-09-22T01:37:36Z

Closing as superseded by GH-23416, which is merged.

fuzzer: don't remove or modify byte of empty input

bf30d62

gooncreeper mentioned this pull request Mar 31, 2025

greatly improve capabilities of the fuzzer #23416

Merged

McSinyx closed this Sep 22, 2025

McSinyx deleted the fuzz-bound branch September 22, 2025 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fuzzer: don't remove or modify byte of empty input#23180

fuzzer: don't remove or modify byte of empty input#23180
McSinyx wants to merge 1 commit intoziglang:masterfrom
McSinyx:fuzz-bound

McSinyx commented Mar 10, 2025

Uh oh!

McSinyx commented Mar 30, 2025

Uh oh!

McSinyx commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	.remove_byte => {
	const omitted_index = rng.uintLessThanBiased(usize, old_input.len);
	f.input.appendSliceAssumeCapacity(old_input[0..omitted_index]);
	f.input.appendSliceAssumeCapacity(old_input[omitted_index + 1 ..]);
	},

	pub fn panic(msg: []const u8, st: ?*std.builtin.StackTrace, addr: ?usize) noreturn {
	_ = st;
	_ = addr;
	log.err("panic: {s}", .{msg});
	@trap();
	}

Uh oh!

Conversation

McSinyx commented Mar 10, 2025

Uh oh!

McSinyx commented Mar 30, 2025

Uh oh!

McSinyx commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant