Skip to content

Implement File Path Classification for POSIX, Windows, and Registry Paths #17

@unclesp1d3r

Description

@unclesp1d3r

Overview

Implement comprehensive file path pattern matching and classification for the StringyMcStringFace binary analyzer. This feature will enable automatic detection and tagging of file system paths across different operating systems and Windows registry paths.

Background

The classification system (documented in docs/src/classification.md) defines patterns for file paths but lacks the actual implementation in src/classification/mod.rs. This feature is critical for malware analysis, as file paths often indicate:

  • Persistence mechanisms (startup folders, system directories)
  • Data exfiltration targets
  • Configuration file locations
  • Temporary file usage patterns

Technical Requirements

1. POSIX File Path Detection

Pattern: /[^\0\n\r]*

Characteristics:

  • Must start with forward slash /
  • May contain multiple path components separated by /
  • Valid characters: alphanumeric, dots, underscores, hyphens, spaces
  • Should handle both absolute and relative paths
  • Common prefixes: /usr/, /etc/, /var/, /home/, /tmp/

Examples:

  • /usr/bin/malware
  • /etc/passwd
  • /var/log/system.log
  • /home/user/.config/app

Validation Rules:

  • Minimum length: 2 characters (e.g., /a)
  • No null bytes, carriage returns, or line feeds
  • Path components should not be empty (no //)
  • Should detect common system directories for confidence boosting

2. Windows File Path Detection

Pattern: [A-Za-z]:\\[^\0\n\r]*

Characteristics:

  • Must start with drive letter followed by :\
  • Backslashes as path separators
  • Case-insensitive drive letters
  • May contain spaces in path components
  • Common prefixes: C:\Windows, C:\Program Files, C:\Users

Examples:

  • C:\Windows\System32\evil.dll
  • D:\Data\config.ini
  • C:\Program Files (x86)\App\binary.exe

UNC Path Support (stretch goal):

  • Pattern: \\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*
  • Example: \\server\share\file.txt

Validation Rules:

  • Minimum length: 3 characters (e.g., C:\)
  • Drive letter must be A-Z (case insensitive)
  • Must use backslashes, not forward slashes
  • Should detect common system directories

3. Windows Registry Path Detection

Pattern: HKEY_[A-Z_]+\\[^\0\n\r]*

Root Keys:

  • HKEY_LOCAL_MACHINE (HKLM)
  • HKEY_CURRENT_USER (HKCU)
  • HKEY_CLASSES_ROOT (HKCR)
  • HKEY_USERS (HKU)
  • HKEY_CURRENT_CONFIG (HKCC)

Examples:

  • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run
  • HKEY_CURRENT_USER\Software\App\Settings
  • HKLM\System\CurrentControlSet\Services (abbreviated form)

Validation Rules:

  • Must start with valid HKEY root
  • Support both full names and abbreviations (HKLM, HKCU, etc.)
  • Backslash path separator
  • Keys associated with persistence should boost confidence score

Implementation Structure

Proposed Module Structure

// src/classification/mod.rs

mod file_path;
mod registry;
mod url;
mod network;
// ... other classifiers

pub use file_path::FilePathClassifier;
pub use registry::RegistryPathClassifier;

pub struct SemanticClassifier {
    file_path: FilePathClassifier,
    registry: RegistryPathClassifier,
    // ... other classifiers
}

impl SemanticClassifier {
    pub fn new() -> Result<Self> {
        Ok(Self {
            file_path: FilePathClassifier::new()?,
            registry: RegistryPathClassifier::new()?,
        })
    }

    pub fn classify(&self, text: &str, context: &StringContext) -> Vec<ClassificationResult> {
        let mut results = Vec::new();
        
        // Try file path classification
        if let Some(result) = self.file_path.classify(text, context) {
            results.push(result);
        }
        
        // Try registry path classification
        if let Some(result) = self.registry.classify(text, context) {
            results.push(result);
        }
        
        results
    }
}

File Path Classifier

// src/classification/file_path.rs

use regex::Regex;
use crate::types::{Tag, ClassificationResult, StringContext, BinaryFormat};

pub struct FilePathClassifier {
    posix_regex: Regex,
    windows_regex: Regex,
    unc_regex: Regex,
    suspicious_posix_paths: HashSet<&'static str>,
    suspicious_windows_paths: HashSet<&'static str>,
}

impl FilePathClassifier {
    pub fn new() -> Result<Self> {
        Ok(Self {
            posix_regex: Regex::new(r"^/[^\0\n\r]*")?,
            windows_regex: Regex::new(r"^[A-Za-z]:\\[^\0\n\r]*")?,
            unc_regex: Regex::new(r"^\\\\[a-zA-Z0-9.-]+\\[^\0\n\r]*")?,
            suspicious_posix_paths: Self::init_suspicious_posix_paths(),
            suspicious_windows_paths: Self::init_suspicious_windows_paths(),
        })
    }

    pub fn classify(&self, text: &str, context: &StringContext) -> Option<ClassificationResult> {
        // Try POSIX first
        if let Some(result) = self.classify_posix(text, context) {
            return Some(result);
        }
        
        // Try Windows
        if let Some(result) = self.classify_windows(text, context) {
            return Some(result);
        }
        
        None
    }

    fn classify_posix(&self, text: &str, context: &StringContext) -> Option<ClassificationResult> {
        if !self.posix_regex.is_match(text) {
            return None;
        }
        
        let confidence = self.calculate_posix_confidence(text, context);
        
        Some(ClassificationResult {
            tag: Tag::FilePath,
            confidence,
            evidence: vec!["POSIX path pattern".to_string()],
        })
    }

    fn calculate_posix_confidence(&self, text: &str, context: &StringContext) -> f32 {
        let mut confidence = 0.6; // Base confidence
        
        // Boost for known system directories
        if text.starts_with("/usr/") || text.starts_with("/etc/") {
            confidence += 0.2;
        }
        
        // Boost for suspicious persistence locations
        if self.is_suspicious_posix_path(text) {
            confidence += 0.15;
        }
        
        // Context-based boosting
        if matches!(context.binary_format, BinaryFormat::Elf | BinaryFormat::MachO) {
            confidence += 0.1;
        }
        
        confidence.min(1.0)
    }
}

Confidence Scoring Criteria

High Confidence (0.8-1.0)

  • Matches pattern AND starts with known system directory
  • Found in appropriate binary format (POSIX paths in ELF/Mach-O, Windows paths in PE)
  • Contains file extension matching context
  • Part of known persistence location

Medium Confidence (0.5-0.8)

  • Matches pattern with valid structure
  • Contains multiple path components
  • Has reasonable length (not too short)

Low Confidence (0.3-0.5)

  • Matches pattern but very short
  • Found in unexpected context
  • Potential false positive (e.g., C:\x could be format string)

Acceptance Criteria

  • FilePathClassifier struct implemented with regex patterns
  • POSIX path detection with validation
  • Windows path detection with validation
  • Registry path detection with all root keys
  • UNC path detection (Windows network paths)
  • Confidence scoring based on context and known paths
  • Integration with SemanticClassifier
  • Comprehensive unit tests covering:
    • Valid POSIX paths
    • Valid Windows paths
    • Valid registry paths
    • Edge cases (short paths, special characters)
    • False positive prevention
    • Confidence score calculations
    • Context-aware classification
  • Documentation updates in classification.md
  • Performance benchmarks for regex matching

Test Cases

POSIX Tests

#[test]
fn test_posix_absolute_path() {
    assert!(classifier.is_posix_path("/usr/bin/bash"));
    assert!(classifier.is_posix_path("/etc/passwd"));
}

#[test]
fn test_posix_home_directory() {
    assert!(classifier.is_posix_path("/home/user/.bashrc"));
}

#[test]
fn test_posix_with_spaces() {
    assert!(classifier.is_posix_path("/Users/John Doe/Documents/file.txt"));
}

Windows Tests

#[test]
fn test_windows_absolute_path() {
    assert!(classifier.is_windows_path("C:\\Windows\\System32\\cmd.exe"));
}

#[test]
fn test_windows_program_files() {
    assert!(classifier.is_windows_path("C:\\Program Files (x86)\\App"));
}

Registry Tests

#[test]
fn test_registry_run_key() {
    assert!(classifier.is_registry_path("HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Run"));
}

#[test]
fn test_registry_abbreviated() {
    assert!(classifier.is_registry_path("HKLM\\System\\CurrentControlSet"));
}

Dependencies

Security Implications

File paths in binaries often indicate:

  • Persistence mechanisms: Startup folders, cron jobs, LaunchDaemons
  • Data exfiltration: Sensitive file locations
  • Configuration: Where malware stores settings
  • Logs: Forensic artifacts

High-confidence classification enables analysts to quickly identify these indicators.

Performance Considerations

  • Compile regexes once at initialization (lazy_static or once_cell)
  • Short-circuit on first match to avoid unnecessary regex evaluations
  • Consider string length pre-filtering before regex matching
  • Profile regex performance with criterion benchmarks

Related Documentation

  • Classification System: docs/src/classification.md
  • Type Definitions: src/types.rs
  • Examples: Requirements 3.4, 3.5

Task ID

stringy-analyzer/file-path-classification

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions