Skip to content

Implement URL and Domain Pattern Matching in Semantic Classification System #15

@unclesp1d3r

Description

@unclesp1d3r

Overview

This task implements URL and domain name pattern matching and validation as part of the semantic classification framework. This is a foundational component that enables Stringy to automatically identify and tag network-related strings extracted from binary files, which is critical for malware analysis and reverse engineering workflows.

Context

The classification system is designed to apply semantic analysis to extracted strings, identifying patterns that indicate specific types of data. URLs and domain names are high-priority network indicators that help analysts quickly identify C2 infrastructure, legitimate services, and other network communication endpoints.

The container parsing module (src/container/) is fully implemented with ELF, PE, and Mach-O support. Classification tag enums (Tag::Url, Tag::Domain, Tag::IPv4, Tag::IPv6) are already defined in src/types.rs. This task creates the actual pattern matching engine in src/classification/semantic.rs.

Requirements Mapping

  • Requirement 3.1: Network Indicators - URL Pattern Matching
  • Requirement 3.2: Network Indicators - Domain Name Detection

Implementation Details

1. Add Dependencies to Cargo.toml

[dependencies]
regex = "1.11"
lazy_static = "1.5"  # For regex caching

2. Create src/classification/semantic.rs

Implement the SemanticClassifier struct with compiled regex patterns:

use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString};

lazy_static! {
    static ref URL_REGEX: Regex = Regex::new(
        r"https?://[^\s<>\"{}|\\\^\[\]`]+"
    ).unwrap();
    
    static ref DOMAIN_REGEX: Regex = Regex::new(
        r"\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b"
    ).unwrap();
}

pub struct SemanticClassifier;

impl SemanticClassifier {
    pub fn new() -> Self {
        Self
    }
    
    pub fn classify_url(&self, text: &str) -> Option<Tag> {
        if URL_REGEX.is_match(text) {
            Some(Tag::Url)
        } else {
            None
        }
    }
    
    pub fn classify_domain(&self, text: &str) -> Option<Tag> {
        // Only tag as domain if it's not already a URL
        if !URL_REGEX.is_match(text) && DOMAIN_REGEX.is_match(text) {
            if self.has_valid_tld(text) {
                Some(Tag::Domain)
            } else {
                None
            }
        } else {
            None
        }
    }
    
    fn has_valid_tld(&self, domain: &str) -> bool {
        // Implement TLD validation
        // Consider common TLDs: .com, .net, .org, .io, etc.
        todo!("Implement TLD validation")
    }
    
    pub fn classify(&self, string: &FoundString) -> Vec<Tag> {
        let mut tags = Vec::new();
        
        if let Some(tag) = self.classify_url(&string.text) {
            tags.push(tag);
        }
        
        if let Some(tag) = self.classify_domain(&string.text) {
            tags.push(tag);
        }
        
        tags
    }
}

3. Update src/classification/mod.rs

pub mod semantic;

pub use semantic::SemanticClassifier;

4. Pattern Specifications

Based on the classification documentation:

URL Pattern:

  • Regex: https?://[^\s]+
  • Examples: https://api.example.com/v1/users, http://malware.com/payload
  • Validation: Valid TLD, reasonable path structure
  • Security relevance: High - indicates network communication

Domain Pattern:

  • Regex: [a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  • Examples: api.example.com, malware-c2.net
  • Validation: TLD checking, DNS format compliance (RFC 1035)
  • Security relevance: High - C2 domains, legitimate services

5. TLD Validation

Implement validation against a list of valid TLDs. Consider using:

  • Hardcoded list of common TLDs (.com, .net, .org, .io, .gov, .edu, etc.)
  • Or integrate a TLD list from IANA/public suffix list

6. False Positive Reduction

  • Minimum domain length: 4 characters (e.g., a.co)
  • Reject domains with invalid characters
  • Validate TLD is alphabetic and 2+ characters
  • Consider context from section type (string data vs code)

Testing Requirements

Create unit tests in src/classification/semantic.rs:

#[cfg(test)]
mod tests {
    use super::*;
    use crate::types::{FoundString, Encoding, StringSource};

    fn create_test_string(text: &str) -> FoundString {
        FoundString {
            text: text.to_string(),
            encoding: Encoding::Ascii,
            offset: 0,
            rva: None,
            section: Some("test".to_string()),
            length: text.len() as u32,
            tags: vec![],
            score: 0,
            source: StringSource::SectionData,
        }
    }

    #[test]
    fn test_url_detection() {
        let classifier = SemanticClassifier::new();
        
        // Valid URLs
        assert!(classifier.classify_url("https://example.com").is_some());
        assert!(classifier.classify_url("http://api.malware.com/v1/data").is_some());
        assert!(classifier.classify_url("https://192.168.1.1:8080/path").is_some());
        
        // Not URLs
        assert!(classifier.classify_url("example.com").is_none());
        assert!(classifier.classify_url("not a url").is_none());
    }

    #[test]
    fn test_domain_detection() {
        let classifier = SemanticClassifier::new();
        
        // Valid domains (not URLs)
        assert!(classifier.classify_domain("example.com").is_some());
        assert!(classifier.classify_domain("api.service.io").is_some());
        assert!(classifier.classify_domain("malware-c2.net").is_some());
        
        // Should not match URLs
        assert!(classifier.classify_domain("https://example.com").is_none());
        
        // Invalid domains
        assert!(classifier.classify_domain("invalid").is_none());
        assert!(classifier.classify_domain("too.short.x").is_none());
    }

    #[test]
    fn test_url_classification() {
        let classifier = SemanticClassifier::new();
        let string = create_test_string("https://api.example.com/endpoint");
        
        let tags = classifier.classify(&string);
        assert_eq!(tags.len(), 1);
        assert_eq!(tags[0], Tag::Url);
    }

    #[test]
    fn test_domain_classification() {
        let classifier = SemanticClassifier::new();
        let string = create_test_string("malware.example.com");
        
        let tags = classifier.classify(&string);
        assert_eq!(tags.len(), 1);
        assert_eq!(tags[0], Tag::Domain);
    }

    #[test]
    fn test_url_not_double_tagged() {
        let classifier = SemanticClassifier::new();
        let string = create_test_string("https://example.com");
        
        let tags = classifier.classify(&string);
        // Should only be tagged as URL, not both URL and Domain
        assert_eq!(tags.len(), 1);
        assert_eq!(tags[0], Tag::Url);
    }
}

Acceptance Criteria

  • regex dependency added to Cargo.toml
  • src/classification/semantic.rs created with SemanticClassifier struct
  • URL pattern matching implemented with regex
  • Domain pattern matching implemented with validation
  • TLD validation prevents false positives
  • URLs are not double-tagged as domains
  • Unit tests cover positive and negative cases
  • Tests verify that URLs and domains are properly distinguished
  • All tests pass with cargo test
  • Code follows existing patterns from container parsers
  • Documentation comments added for public APIs

Dependencies

Security Considerations

This classifier is critical for malware analysis as it identifies:

  • Command & Control (C2) server addresses
  • Data exfiltration endpoints
  • Update servers
  • Legitimate service integrations

Proper validation reduces false positives that could waste analyst time.

References

Task-ID

stringy-analyzer/url-domain-classification

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions