Skip to content

Performance: Implement Regex Compilation Caching for Semantic Classifier #33

@unclesp1d3r

Description

@unclesp1d3r

Background

StringyMcStringFace's semantic classifier needs to identify and tag various string patterns extracted from binaries, including:

  • URLs and domains
  • IP addresses (IPv4/IPv6)
  • File paths (POSIX/Windows)
  • Registry keys
  • GUIDs
  • Email addresses
  • JWT tokens
  • Base64 sequences
  • Format strings (printf-style)
  • User agent strings

Each classification requires regex pattern matching, and the classifier will process thousands of strings per binary. Without caching, regex compilation overhead becomes a significant performance bottleneck.

Problem Statement

Currently, the classification module (src/classification/mod.rs) is minimal. When implemented, it will need to compile multiple complex regex patterns. Compiling regex patterns on-demand for each string classification would:

  • Add significant CPU overhead (regex compilation is expensive)
  • Create unnecessary memory allocations
  • Slow down the overall analysis pipeline
  • Make the tool less suitable for batch processing

Proposed Solution

Implement a regex caching strategy using lazy initialization:

1. Add Dependencies

Add regex and once_cell (or use std::sync::LazyLock in Rust 2024) to Cargo.toml:

[dependencies]
regex = "1.11"
once_cell = "1.20"  # or use std::sync::LazyLock

2. Implement Lazy-Initialized Regex Cache

Create a module with static, lazily-initialized regex patterns:

use once_cell::sync::Lazy;
use regex::Regex;

pub struct PatternCache {
    pub url: &'static Lazy<Regex>,
    pub domain: &'static Lazy<Regex>,
    pub ipv4: &'static Lazy<Regex>,
    pub ipv6: &'static Lazy<Regex>,
    pub filepath_posix: &'static Lazy<Regex>,
    pub filepath_windows: &'static Lazy<Regex>,
    pub registry_key: &'static Lazy<Regex>,
    pub guid: &'static Lazy<Regex>,
    pub email: &'static Lazy<Regex>,
    pub base64: &'static Lazy<Regex>,
    pub format_string: &'static Lazy<Regex>,
    pub user_agent: &'static Lazy<Regex>,
}

static URL_PATTERN: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"https?://[\w.-]+(?:/[\w./?%&=-]*)?").unwrap()
});

static GUID_PATTERN: Lazy<Regex> = Lazy::new(|| {
    Regex::new(r"(?i)[{]?[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}[}]?").unwrap()
});

// ... additional patterns

3. Use Cached Patterns in Classifier

The semantic classifier should reference these static patterns:

pub fn classify_string(text: &str) -> Vec<Tag> {
    let mut tags = Vec::new();
    
    if URL_PATTERN.is_match(text) {
        tags.push(Tag::Url);
    }
    if GUID_PATTERN.is_match(text) {
        tags.push(Tag::Guid);
    }
    // ... additional classifications
    
    tags
}

4. Consider RegexSet for Multiple Patterns

For optimal performance when checking multiple patterns against the same text, consider using RegexSet:

use regex::RegexSet;

static PATTERN_SET: Lazy<RegexSet> = Lazy::new(|| {
    RegexSet::new(&[
        r"https?://.*",  // URL
        r"[{]?[0-9a-f]{8}-.*",  // GUID
        // ... more patterns
    ]).unwrap()
});

Implementation Tasks

  • Add regex dependency to Cargo.toml
  • Add once_cell or use std::sync::LazyLock for lazy initialization
  • Create src/classification/patterns.rs with cached regex patterns
  • Implement pattern definitions for all semantic tags
  • Create classify_string() function using cached patterns
  • Add unit tests for each pattern
  • Create benchmark suite to measure performance improvement
  • Document pattern syntax and matching behavior

Success Criteria

  • All regex patterns compiled once on first use
  • Zero regex compilation overhead during string classification
  • Measurable performance improvement (target: >90% reduction in classification time)
  • Comprehensive test coverage for all patterns
  • Benchmarks demonstrating caching effectiveness

Performance Benchmarks

Create benchmarks in benches/regex_caching.rs to measure:

  1. Compilation overhead: Time to compile patterns with/without caching
  2. Classification throughput: Strings classified per second
  3. Memory usage: Compare cached vs. on-demand compilation
  4. Batch processing: Time to classify 10k, 100k, 1M strings

Expected improvement: 10-100x faster classification depending on pattern complexity and string volume.

References

Related

  • Part of v0.1 MVP milestone
  • Blocks efficient implementation of semantic classification (requirement 8.3)
  • Required for achieving acceptable performance on large binaries

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions