Overview
Implement comprehensive file path pattern matching and classification for the StringyMcStringFace binary analyzer. This feature will enable automatic detection and tagging of file system paths across different operating systems and Windows registry paths.
Background
The classification system (documented in docs/src/classification.md) defines patterns for file paths but lacks the actual implementation in src/classification/mod.rs. This feature is critical for malware analysis, as file paths often indicate:
- Persistence mechanisms (startup folders, system directories)
- Data exfiltration targets
- Configuration file locations
- Temporary file usage patterns
Technical Requirements
1. POSIX File Path Detection
Pattern: /[^\0\n\r]*
Characteristics:
- Must start with forward slash
/
- May contain multiple path components separated by
/
- Valid characters: alphanumeric, dots, underscores, hyphens, spaces
- Should handle both absolute and relative paths
- Common prefixes:
/usr/, /etc/, /var/, /home/, /tmp/
Examples:
/usr/bin/malware
/etc/passwd
/var/log/system.log
/home/user/.config/app
Validation Rules:
- Minimum length: 2 characters (e.g.,
/a)
- No null bytes, carriage returns, or line feeds
- Path components should not be empty (no
//)
- Should detect common system directories for confidence boosting
2. Windows File Path Detection
Pattern: [A-Za-z]:\\[^\0\n\r]*
Characteristics:
- Must start with drive letter followed by
:\
- Backslashes as path separators
- Case-insensitive drive letters
- May contain spaces in path components
- Common prefixes:
C:\Windows, C:\Program Files, C:\Users
Examples:
C:\Windows\System32\evil.dll
D:\Data\config.ini
C:\Program Files (x86)\App\binary.exe
UNC Path Support (stretch goal):
- Pattern:
\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*
- Example:
\\server\share\file.txt
Validation Rules:
- Minimum length: 3 characters (e.g.,
C:\)
- Drive letter must be A-Z (case insensitive)
- Must use backslashes, not forward slashes
- Should detect common system directories
3. Windows Registry Path Detection
Pattern: HKEY_[A-Z_]+\\[^\0\n\r]*
Root Keys:
HKEY_LOCAL_MACHINE (HKLM)
HKEY_CURRENT_USER (HKCU)
HKEY_CLASSES_ROOT (HKCR)
HKEY_USERS (HKU)
HKEY_CURRENT_CONFIG (HKCC)
Examples:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run
HKEY_CURRENT_USER\Software\App\Settings
HKLM\System\CurrentControlSet\Services (abbreviated form)
Validation Rules:
- Must start with valid HKEY root
- Support both full names and abbreviations (HKLM, HKCU, etc.)
- Backslash path separator
- Keys associated with persistence should boost confidence score
Implementation Structure
Proposed Module Structure
// src/classification/mod.rs
mod file_path;
mod registry;
mod url;
mod network;
// ... other classifiers
pub use file_path::FilePathClassifier;
pub use registry::RegistryPathClassifier;
pub struct SemanticClassifier {
file_path: FilePathClassifier,
registry: RegistryPathClassifier,
// ... other classifiers
}
impl SemanticClassifier {
pub fn new() -> Result<Self> {
Ok(Self {
file_path: FilePathClassifier::new()?,
registry: RegistryPathClassifier::new()?,
})
}
pub fn classify(&self, text: &str, context: &StringContext) -> Vec<ClassificationResult> {
let mut results = Vec::new();
// Try file path classification
if let Some(result) = self.file_path.classify(text, context) {
results.push(result);
}
// Try registry path classification
if let Some(result) = self.registry.classify(text, context) {
results.push(result);
}
results
}
}
File Path Classifier
// src/classification/file_path.rs
use regex::Regex;
use crate::types::{Tag, ClassificationResult, StringContext, BinaryFormat};
pub struct FilePathClassifier {
posix_regex: Regex,
windows_regex: Regex,
unc_regex: Regex,
suspicious_posix_paths: HashSet<&'static str>,
suspicious_windows_paths: HashSet<&'static str>,
}
impl FilePathClassifier {
pub fn new() -> Result<Self> {
Ok(Self {
posix_regex: Regex::new(r"^/[^\0\n\r]*")?,
windows_regex: Regex::new(r"^[A-Za-z]:\\[^\0\n\r]*")?,
unc_regex: Regex::new(r"^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*")?,
suspicious_posix_paths: Self::init_suspicious_posix_paths(),
suspicious_windows_paths: Self::init_suspicious_windows_paths(),
})
}
pub fn classify(&self, text: &str, context: &StringContext) -> Option<ClassificationResult> {
// Try POSIX first
if let Some(result) = self.classify_posix(text, context) {
return Some(result);
}
// Try Windows
if let Some(result) = self.classify_windows(text, context) {
return Some(result);
}
None
}
fn classify_posix(&self, text: &str, context: &StringContext) -> Option<ClassificationResult> {
if !self.posix_regex.is_match(text) {
return None;
}
let confidence = self.calculate_posix_confidence(text, context);
Some(ClassificationResult {
tag: Tag::FilePath,
confidence,
evidence: vec!["POSIX path pattern".to_string()],
})
}
fn calculate_posix_confidence(&self, text: &str, context: &StringContext) -> f32 {
let mut confidence = 0.6; // Base confidence
// Boost for known system directories
if text.starts_with("/usr/") || text.starts_with("/etc/") {
confidence += 0.2;
}
// Boost for suspicious persistence locations
if self.is_suspicious_posix_path(text) {
confidence += 0.15;
}
// Context-based boosting
if matches!(context.binary_format, BinaryFormat::Elf | BinaryFormat::MachO) {
confidence += 0.1;
}
confidence.min(1.0)
}
}
Confidence Scoring Criteria
High Confidence (0.8-1.0)
- Matches pattern AND starts with known system directory
- Found in appropriate binary format (POSIX paths in ELF/Mach-O, Windows paths in PE)
- Contains file extension matching context
- Part of known persistence location
Medium Confidence (0.5-0.8)
- Matches pattern with valid structure
- Contains multiple path components
- Has reasonable length (not too short)
Low Confidence (0.3-0.5)
- Matches pattern but very short
- Found in unexpected context
- Potential false positive (e.g.,
C:\x could be format string)
Acceptance Criteria
Test Cases
POSIX Tests
#[test]
fn test_posix_absolute_path() {
assert!(classifier.is_posix_path("/usr/bin/bash"));
assert!(classifier.is_posix_path("/etc/passwd"));
}
#[test]
fn test_posix_home_directory() {
assert!(classifier.is_posix_path("/home/user/.bashrc"));
}
#[test]
fn test_posix_with_spaces() {
assert!(classifier.is_posix_path("/Users/John Doe/Documents/file.txt"));
}
Windows Tests
#[test]
fn test_windows_absolute_path() {
assert!(classifier.is_windows_path("C:\\Windows\\System32\\cmd.exe"));
}
#[test]
fn test_windows_program_files() {
assert!(classifier.is_windows_path("C:\\Program Files (x86)\\App"));
}
Registry Tests
#[test]
fn test_registry_run_key() {
assert!(classifier.is_registry_path("HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run"));
}
#[test]
fn test_registry_abbreviated() {
assert!(classifier.is_registry_path("HKLM\\System\\CurrentControlSet"));
}
Dependencies
Security Implications
File paths in binaries often indicate:
- Persistence mechanisms: Startup folders, cron jobs, LaunchDaemons
- Data exfiltration: Sensitive file locations
- Configuration: Where malware stores settings
- Logs: Forensic artifacts
High-confidence classification enables analysts to quickly identify these indicators.
Performance Considerations
- Compile regexes once at initialization (
lazy_static or once_cell)
- Short-circuit on first match to avoid unnecessary regex evaluations
- Consider string length pre-filtering before regex matching
- Profile regex performance with
criterion benchmarks
Related Documentation
- Classification System:
docs/src/classification.md
- Type Definitions:
src/types.rs
- Examples: Requirements 3.4, 3.5
Task ID
stringy-analyzer/file-path-classification
Overview
Implement comprehensive file path pattern matching and classification for the StringyMcStringFace binary analyzer. This feature will enable automatic detection and tagging of file system paths across different operating systems and Windows registry paths.
Background
The classification system (documented in
docs/src/classification.md) defines patterns for file paths but lacks the actual implementation insrc/classification/mod.rs. This feature is critical for malware analysis, as file paths often indicate:Technical Requirements
1. POSIX File Path Detection
Pattern:
/[^\0\n\r]*Characteristics:
///usr/,/etc/,/var/,/home/,/tmp/Examples:
/usr/bin/malware/etc/passwd/var/log/system.log/home/user/.config/appValidation Rules:
/a)//)2. Windows File Path Detection
Pattern:
[A-Za-z]:\\[^\0\n\r]*Characteristics:
:\C:\Windows,C:\Program Files,C:\UsersExamples:
C:\Windows\System32\evil.dllD:\Data\config.iniC:\Program Files (x86)\App\binary.exeUNC Path Support (stretch goal):
\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*\\server\share\file.txtValidation Rules:
C:\)3. Windows Registry Path Detection
Pattern:
HKEY_[A-Z_]+\\[^\0\n\r]*Root Keys:
HKEY_LOCAL_MACHINE(HKLM)HKEY_CURRENT_USER(HKCU)HKEY_CLASSES_ROOT(HKCR)HKEY_USERS(HKU)HKEY_CURRENT_CONFIG(HKCC)Examples:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\RunHKEY_CURRENT_USER\Software\App\SettingsHKLM\System\CurrentControlSet\Services(abbreviated form)Validation Rules:
Implementation Structure
Proposed Module Structure
File Path Classifier
Confidence Scoring Criteria
High Confidence (0.8-1.0)
Medium Confidence (0.5-0.8)
Low Confidence (0.3-0.5)
C:\xcould be format string)Acceptance Criteria
FilePathClassifierstruct implemented with regex patternsSemanticClassifierclassification.mdTest Cases
POSIX Tests
Windows Tests
Registry Tests
Dependencies
SemanticClassifierstructure is now implemented with IPv4/IPv6 and URL detectionSecurity Implications
File paths in binaries often indicate:
High-confidence classification enables analysts to quickly identify these indicators.
Performance Considerations
lazy_staticoronce_cell)criterionbenchmarksRelated Documentation
docs/src/classification.mdsrc/types.rsTask ID
stringy-analyzer/file-path-classification