Background
Stringy is designed to analyze binary files for string extraction, which often involves processing large executables, malware samples, system libraries, and packed binaries that can range from megabytes to gigabytes in size. Loading entire files into memory using traditional std::fs::read() is inefficient and can cause memory pressure, especially when analyzing multiple files or very large binaries.
Problem
Currently, the extraction pipeline is not yet implemented, but when it is, it will need to read binary files efficiently. The container parsing infrastructure already works with byte slices (&[u8]), which is perfect for memory-mapped files. Without memory mapping:
- Large files (>100MB) consume excessive RAM
- Memory allocation overhead impacts performance
- No benefit from OS page caching
- Potential OOM errors on systems with limited RAM
- Slower startup time for large binaries
Proposed Solution
Implement a file reading strategy using the memmap2 crate that:
1. Add Dependencies
Add memmap2 to Cargo.toml:
[dependencies]
memmap2 = "0.9"
2. Create File Reader Module
Create src/io.rs or src/file_reader.rs with:
FileReader trait or struct that abstracts file reading
- Memory-mapped reading for files > threshold (e.g., 10MB)
- Direct
std::fs::read() for smaller files (avoids mmap overhead)
- Safe handling of memory mapping (readonly, error handling)
3. Integration Points
- Update
src/main.rs to use the new file reader
- Ensure
src/container/mod.rs parsers work with memory-mapped data
- Handle edge cases (empty files, special files, pipes)
4. Implementation Example
pub struct FileReader {
_mmap: Option<Mmap>,
data: Vec<u8>,
}
impl FileReader {
pub fn open<P: AsRef<Path>>(path: P) -> Result<Self> {
let file = File::open(path)?;
let metadata = file.metadata()?;
if metadata.len() > MMAP_THRESHOLD {
// Use memory mapping for large files
let mmap = unsafe { Mmap::map(&file)? };
Ok(Self {
_mmap: Some(mmap),
data: Vec::new(),
})
} else {
// Read small files directly
let data = std::fs::read(path)?;
Ok(Self {
_mmap: None,
data,
})
}
}
pub fn as_slice(&self) -> &[u8] {
self._mmap.as_ref()
.map(|m| m.as_ref())
.unwrap_or(&self.data)
}
}
Benefits
- Performance: Eliminates large memory allocations and copies
- Scalability: Enables analysis of multi-gigabyte files without memory issues
- Efficiency: OS handles paging and caching automatically
- User Experience: Faster startup and lower memory footprint
- Architecture: Clean abstraction that doesn't affect existing parser APIs
Testing Requirements
-
Unit Tests:
- Test small file reading (< threshold)
- Test large file reading (> threshold)
- Test empty files and edge cases
- Verify correct byte content for both paths
-
Integration Tests:
- Test with real binary files (ELF, PE, Mach-O)
- Verify parser compatibility with memory-mapped data
- Test error handling for invalid/missing files
-
Benchmarks:
- Compare performance vs.
std::fs::read() for various file sizes
- Measure memory usage with large files
- Use
criterion (already in dev-dependencies)
Acceptance Criteria
Related
- Part of milestone v0.1 (Binary Analyzer MVP)
- Foundation for extraction pipeline implementation
- Enables efficient processing of large malware samples and system binaries
Task-ID
stringy-analyzer/memory-mapping-support
References
Background
Stringy is designed to analyze binary files for string extraction, which often involves processing large executables, malware samples, system libraries, and packed binaries that can range from megabytes to gigabytes in size. Loading entire files into memory using traditional
std::fs::read()is inefficient and can cause memory pressure, especially when analyzing multiple files or very large binaries.Problem
Currently, the extraction pipeline is not yet implemented, but when it is, it will need to read binary files efficiently. The container parsing infrastructure already works with byte slices (
&[u8]), which is perfect for memory-mapped files. Without memory mapping:Proposed Solution
Implement a file reading strategy using the
memmap2crate that:1. Add Dependencies
Add
memmap2toCargo.toml:2. Create File Reader Module
Create
src/io.rsorsrc/file_reader.rswith:FileReadertrait or struct that abstracts file readingstd::fs::read()for smaller files (avoids mmap overhead)3. Integration Points
src/main.rsto use the new file readersrc/container/mod.rsparsers work with memory-mapped data4. Implementation Example
Benefits
Testing Requirements
Unit Tests:
Integration Tests:
Benchmarks:
std::fs::read()for various file sizescriterion(already in dev-dependencies)Acceptance Criteria
memmap2dependency added toCargo.tomlRelated
Task-ID
stringy-analyzer/memory-mapping-support
References