Overview
This task implements URL and domain name pattern matching and validation as part of the semantic classification framework. This is a foundational component that enables Stringy to automatically identify and tag network-related strings extracted from binary files, which is critical for malware analysis and reverse engineering workflows.
Context
The classification system is designed to apply semantic analysis to extracted strings, identifying patterns that indicate specific types of data. URLs and domain names are high-priority network indicators that help analysts quickly identify C2 infrastructure, legitimate services, and other network communication endpoints.
The container parsing module (src/container/) is fully implemented with ELF, PE, and Mach-O support. Classification tag enums (Tag::Url, Tag::Domain, Tag::IPv4, Tag::IPv6) are already defined in src/types.rs. This task creates the actual pattern matching engine in src/classification/semantic.rs.
Requirements Mapping
- Requirement 3.1: Network Indicators - URL Pattern Matching
- Requirement 3.2: Network Indicators - Domain Name Detection
Implementation Details
1. Add Dependencies to Cargo.toml
[dependencies]
regex = "1.11"
lazy_static = "1.5" # For regex caching
2. Create src/classification/semantic.rs
Implement the SemanticClassifier struct with compiled regex patterns:
use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString};
lazy_static! {
static ref URL_REGEX: Regex = Regex::new(
r"https?://[^\s<>\"{}|\\\^\[\]`]+"
).unwrap();
static ref DOMAIN_REGEX: Regex = Regex::new(
r"\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b"
).unwrap();
}
pub struct SemanticClassifier;
impl SemanticClassifier {
pub fn new() -> Self {
Self
}
pub fn classify_url(&self, text: &str) -> Option<Tag> {
if URL_REGEX.is_match(text) {
Some(Tag::Url)
} else {
None
}
}
pub fn classify_domain(&self, text: &str) -> Option<Tag> {
// Only tag as domain if it's not already a URL
if !URL_REGEX.is_match(text) && DOMAIN_REGEX.is_match(text) {
if self.has_valid_tld(text) {
Some(Tag::Domain)
} else {
None
}
} else {
None
}
}
fn has_valid_tld(&self, domain: &str) -> bool {
// Implement TLD validation
// Consider common TLDs: .com, .net, .org, .io, etc.
todo!("Implement TLD validation")
}
pub fn classify(&self, string: &FoundString) -> Vec<Tag> {
let mut tags = Vec::new();
if let Some(tag) = self.classify_url(&string.text) {
tags.push(tag);
}
if let Some(tag) = self.classify_domain(&string.text) {
tags.push(tag);
}
tags
}
}
3. Update src/classification/mod.rs
pub mod semantic;
pub use semantic::SemanticClassifier;
4. Pattern Specifications
Based on the classification documentation:
URL Pattern:
- Regex:
https?://[^\s]+
- Examples:
https://api.example.com/v1/users, http://malware.com/payload
- Validation: Valid TLD, reasonable path structure
- Security relevance: High - indicates network communication
Domain Pattern:
- Regex:
[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
- Examples:
api.example.com, malware-c2.net
- Validation: TLD checking, DNS format compliance (RFC 1035)
- Security relevance: High - C2 domains, legitimate services
5. TLD Validation
Implement validation against a list of valid TLDs. Consider using:
- Hardcoded list of common TLDs (.com, .net, .org, .io, .gov, .edu, etc.)
- Or integrate a TLD list from IANA/public suffix list
6. False Positive Reduction
- Minimum domain length: 4 characters (e.g.,
a.co)
- Reject domains with invalid characters
- Validate TLD is alphabetic and 2+ characters
- Consider context from section type (string data vs code)
Testing Requirements
Create unit tests in src/classification/semantic.rs:
#[cfg(test)]
mod tests {
use super::*;
use crate::types::{FoundString, Encoding, StringSource};
fn create_test_string(text: &str) -> FoundString {
FoundString {
text: text.to_string(),
encoding: Encoding::Ascii,
offset: 0,
rva: None,
section: Some("test".to_string()),
length: text.len() as u32,
tags: vec![],
score: 0,
source: StringSource::SectionData,
}
}
#[test]
fn test_url_detection() {
let classifier = SemanticClassifier::new();
// Valid URLs
assert!(classifier.classify_url("https://example.com").is_some());
assert!(classifier.classify_url("http://api.malware.com/v1/data").is_some());
assert!(classifier.classify_url("https://192.168.1.1:8080/path").is_some());
// Not URLs
assert!(classifier.classify_url("example.com").is_none());
assert!(classifier.classify_url("not a url").is_none());
}
#[test]
fn test_domain_detection() {
let classifier = SemanticClassifier::new();
// Valid domains (not URLs)
assert!(classifier.classify_domain("example.com").is_some());
assert!(classifier.classify_domain("api.service.io").is_some());
assert!(classifier.classify_domain("malware-c2.net").is_some());
// Should not match URLs
assert!(classifier.classify_domain("https://example.com").is_none());
// Invalid domains
assert!(classifier.classify_domain("invalid").is_none());
assert!(classifier.classify_domain("too.short.x").is_none());
}
#[test]
fn test_url_classification() {
let classifier = SemanticClassifier::new();
let string = create_test_string("https://api.example.com/endpoint");
let tags = classifier.classify(&string);
assert_eq!(tags.len(), 1);
assert_eq!(tags[0], Tag::Url);
}
#[test]
fn test_domain_classification() {
let classifier = SemanticClassifier::new();
let string = create_test_string("malware.example.com");
let tags = classifier.classify(&string);
assert_eq!(tags.len(), 1);
assert_eq!(tags[0], Tag::Domain);
}
#[test]
fn test_url_not_double_tagged() {
let classifier = SemanticClassifier::new();
let string = create_test_string("https://example.com");
let tags = classifier.classify(&string);
// Should only be tagged as URL, not both URL and Domain
assert_eq!(tags.len(), 1);
assert_eq!(tags[0], Tag::Url);
}
}
Acceptance Criteria
Dependencies
Security Considerations
This classifier is critical for malware analysis as it identifies:
- Command & Control (C2) server addresses
- Data exfiltration endpoints
- Update servers
- Legitimate service integrations
Proper validation reduces false positives that could waste analyst time.
References
Task-ID
stringy-analyzer/url-domain-classification
Overview
This task implements URL and domain name pattern matching and validation as part of the semantic classification framework. This is a foundational component that enables Stringy to automatically identify and tag network-related strings extracted from binary files, which is critical for malware analysis and reverse engineering workflows.
Context
The classification system is designed to apply semantic analysis to extracted strings, identifying patterns that indicate specific types of data. URLs and domain names are high-priority network indicators that help analysts quickly identify C2 infrastructure, legitimate services, and other network communication endpoints.
The container parsing module (
src/container/) is fully implemented with ELF, PE, and Mach-O support. Classification tag enums (Tag::Url,Tag::Domain,Tag::IPv4,Tag::IPv6) are already defined insrc/types.rs. This task creates the actual pattern matching engine insrc/classification/semantic.rs.Requirements Mapping
Implementation Details
1. Add Dependencies to
Cargo.toml2. Create
src/classification/semantic.rsImplement the
SemanticClassifierstruct with compiled regex patterns:3. Update
src/classification/mod.rs4. Pattern Specifications
Based on the classification documentation:
URL Pattern:
https?://[^\s]+https://api.example.com/v1/users,http://malware.com/payloadDomain Pattern:
[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}api.example.com,malware-c2.net5. TLD Validation
Implement validation against a list of valid TLDs. Consider using:
6. False Positive Reduction
a.co)Testing Requirements
Create unit tests in
src/classification/semantic.rs:Acceptance Criteria
regexdependency added toCargo.tomlsrc/classification/semantic.rscreated withSemanticClassifierstructcargo testDependencies
Security Considerations
This classifier is critical for malware analysis as it identifies:
Proper validation reduces false positives that could waste analyst time.
References
Task-ID
stringy-analyzer/url-domain-classification