Overview
Implement semantic pattern matching for the remaining high-value string classifications: GUID, email addresses, Base64 data, printf-style format strings, and user agent strings. These patterns are critical for identifying security indicators and code artifacts in binary analysis.
Current State
- Tag Enum: Already defined in
src/types.rs with Guid, Email, Base64, FormatString, and UserAgent variants
- Classification Module: Empty stub at
src/classification/mod.rs (only contains comment)
- Documentation: Comprehensive patterns and implementation examples exist in
docs/src/classification.md
- Dependency Gap: Missing
regex crate for pattern matching
- Blocker: Depends on Semantic Classification Framework implementation
Technical Requirements
Dependencies to Add
Add to Cargo.toml:
[dependencies]
regex = "1.10"
lazy_static = "1.4" # For regex compilation caching
Implementation Details
Create SemanticClassifier struct in src/classification/mod.rs:
use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString, StringContext};
pub struct SemanticClassifier {
guid_regex: Regex,
email_regex: Regex,
base64_regex: Regex,
format_regex: Regex,
user_agent_regex: Regex,
}
impl SemanticClassifier {
pub fn new() -> Self {
// Initialize with compiled regex patterns
}
pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
// Pattern matching logic
}
}
Regex Patterns
Based on docs/src/classification.md:
-
GUID/UUID
- Pattern:
\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}
- Example:
{12345678-1234-1234-1234-123456789abc}
- Validation: Format compliance, version field checking
-
Email Address
- Pattern:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
- Example:
admin@malware.com
- Validation: RFC compliance, domain validation
-
Base64
- Pattern:
[A-Za-z0-9+/]{20,}={0,2}
- Example:
SGVsbG8gV29ybGQ=
- Validation: Length divisibility by 4, padding correctness, minimum length threshold
-
Printf Format Strings
- Pattern:
%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}
- Examples:
Error: %s at line %d, User {0} logged in
- Context: Proximity to other format strings, common in
.rodata
-
User Agent
- Pattern:
Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+
- Example:
Mozilla/5.0 (Windows NT 10.0; Win64; x64)
- Validation: Known browser identifiers, version format
Acceptance Criteria
Test Coverage Requirements
Unit Tests (tests/classification_tests.rs)
#[test]
fn test_guid_detection() {
// Valid GUID formats
// Invalid GUID formats
// Case sensitivity
}
#[test]
fn test_email_detection() {
// Valid emails
// Invalid emails (missing @, invalid TLD)
}
#[test]
fn test_base64_detection() {
// Valid Base64 (with/without padding)
// Invalid Base64 (wrong length, invalid characters)
// Minimum length threshold
}
#[test]
fn test_format_string_detection() {
// Printf-style: %s, %d, %x
// Python-style: {0}, {1}
// Mixed format strings
}
#[test]
fn test_user_agent_detection() {
// Common browsers
// Mobile user agents
// Bot user agents
}
#[test]
fn test_false_positive_reduction() {
// High-entropy binary data
// Very short matches
// Invalid context
}
Integration Tests
Test with real binaries containing these patterns extracted from:
- ELF binaries with GUIDs in
.rodata
- PE files with user agents in
.rdata
- Mach-O binaries with format strings in
__TEXT,__cstring
Performance Considerations
- Use
lazy_static for one-time regex compilation
- Implement short-circuit evaluation (check simpler patterns first)
- Consider minimum string length before applying expensive regex
- Profile regex performance on large binaries
Related Issues
References
- Detailed patterns:
docs/src/classification.md
- Type definitions:
src/types.rs
- Tag enum: Lines 20-40 in
src/types.rs
Implementation Notes
- Start with
SemanticClassifier struct definition
- Implement each pattern matcher as a separate method
- Add validation logic for each pattern type
- Integrate with existing
FoundString and Tag types
- Write comprehensive unit tests for each pattern
- Add integration tests with real binaries
- Optimize with benchmarking
- Update documentation with usage examples
Overview
Implement semantic pattern matching for the remaining high-value string classifications: GUID, email addresses, Base64 data, printf-style format strings, and user agent strings. These patterns are critical for identifying security indicators and code artifacts in binary analysis.
Current State
src/types.rswithGuid,Email,Base64,FormatString, andUserAgentvariantssrc/classification/mod.rs(only contains comment)docs/src/classification.mdregexcrate for pattern matchingTechnical Requirements
Dependencies to Add
Add to
Cargo.toml:Implementation Details
Create
SemanticClassifierstruct insrc/classification/mod.rs:Regex Patterns
Based on
docs/src/classification.md:GUID/UUID
\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\}{12345678-1234-1234-1234-123456789abc}Email Address
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}admin@malware.comBase64
[A-Za-z0-9+/]{20,}={0,2}SGVsbG8gV29ybGQ=Printf Format Strings
%[sdxofcpn]|%\d+[sdxofcpn]|\{\d+\}Error: %s at line %d,User {0} logged in.rodataUser Agent
Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+|AppleWebKit/[0-9.]+Mozilla/5.0 (Windows NT 10.0; Win64; x64)Acceptance Criteria
SemanticClassifierstruct implemented with all five pattern typeslazy_staticFoundStringtype to populatetagsfieldTest Coverage Requirements
Unit Tests (
tests/classification_tests.rs)Integration Tests
Test with real binaries containing these patterns extracted from:
.rodata.rdata__TEXT,__cstringPerformance Considerations
lazy_staticfor one-time regex compilationRelated Issues
References
docs/src/classification.mdsrc/types.rssrc/types.rsImplementation Notes
SemanticClassifierstruct definitionFoundStringandTagtypes