Implement URL and Domain Pattern Matching in Semantic Classification System

## Overview

This task implements URL and domain name pattern matching and validation as part of the semantic classification framework. This is a foundational component that enables Stringy to automatically identify and tag network-related strings extracted from binary files, which is critical for malware analysis and reverse engineering workflows.

## Context

The [classification system](https://github.com/EvilBit-Labs/StringyMcStringFace/blob/main/docs/src/classification.md) is designed to apply semantic analysis to extracted strings, identifying patterns that indicate specific types of data. URLs and domain names are high-priority network indicators that help analysts quickly identify C2 infrastructure, legitimate services, and other network communication endpoints.

The container parsing module (`src/container/`) is fully implemented with ELF, PE, and Mach-O support. Classification tag enums (`Tag::Url`, `Tag::Domain`, `Tag::IPv4`, `Tag::IPv6`) are already defined in `src/types.rs`. This task creates the actual pattern matching engine in `src/classification/semantic.rs`.

## Requirements Mapping

- **Requirement 3.1**: Network Indicators - URL Pattern Matching
- **Requirement 3.2**: Network Indicators - Domain Name Detection

## Implementation Details

### 1. Add Dependencies to `Cargo.toml`

```toml
[dependencies]
regex = "1.11"
lazy_static = "1.5"  # For regex caching
```

### 2. Create `src/classification/semantic.rs`

Implement the `SemanticClassifier` struct with compiled regex patterns:

```rust
use regex::Regex;
use lazy_static::lazy_static;
use crate::types::{Tag, FoundString};

lazy_static! {
    static ref URL_REGEX: Regex = Regex::new(
        r"https?://[^\s<>\"{}|\\\^\[\]`]+"
    ).unwrap();
    
    static ref DOMAIN_REGEX: Regex = Regex::new(
        r"\b(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,}\b"
    ).unwrap();
}

pub struct SemanticClassifier;

impl SemanticClassifier {
    pub fn new() -> Self {
        Self
    }
    
    pub fn classify_url(&self, text: &str) -> Option<Tag> {
        if URL_REGEX.is_match(text) {
            Some(Tag::Url)
        } else {
            None
        }
    }
    
    pub fn classify_domain(&self, text: &str) -> Option<Tag> {
        // Only tag as domain if it's not already a URL
        if !URL_REGEX.is_match(text) && DOMAIN_REGEX.is_match(text) {
            if self.has_valid_tld(text) {
                Some(Tag::Domain)
            } else {
                None
            }
        } else {
            None
        }
    }
    
    fn has_valid_tld(&self, domain: &str) -> bool {
        // Implement TLD validation
        // Consider common TLDs: .com, .net, .org, .io, etc.
        todo!("Implement TLD validation")
    }
    
    pub fn classify(&self, string: &FoundString) -> Vec<Tag> {
        let mut tags = Vec::new();
        
        if let Some(tag) = self.classify_url(&string.text) {
            tags.push(tag);
        }
        
        if let Some(tag) = self.classify_domain(&string.text) {
            tags.push(tag);
        }
        
        tags
    }
}
```

### 3. Update `src/classification/mod.rs`

```rust
pub mod semantic;

pub use semantic::SemanticClassifier;
```

### 4. Pattern Specifications

Based on the [classification documentation](https://github.com/EvilBit-Labs/StringyMcStringFace/blob/main/docs/src/classification.md):

**URL Pattern:**
- Regex: `https?://[^\s]+`
- Examples: `https://api.example.com/v1/users`, `http://malware.com/payload`
- Validation: Valid TLD, reasonable path structure
- Security relevance: High - indicates network communication

**Domain Pattern:**
- Regex: `[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`
- Examples: `api.example.com`, `malware-c2.net`
- Validation: TLD checking, DNS format compliance (RFC 1035)
- Security relevance: High - C2 domains, legitimate services

### 5. TLD Validation

Implement validation against a list of valid TLDs. Consider using:
- Hardcoded list of common TLDs (.com, .net, .org, .io, .gov, .edu, etc.)
- Or integrate a TLD list from IANA/public suffix list

### 6. False Positive Reduction

- Minimum domain length: 4 characters (e.g., `a.co`)
- Reject domains with invalid characters
- Validate TLD is alphabetic and 2+ characters
- Consider context from section type (string data vs code)

## Testing Requirements

Create unit tests in `src/classification/semantic.rs`:

```rust
#[cfg(test)]
mod tests {
    use super::*;
    use crate::types::{FoundString, Encoding, StringSource};

    fn create_test_string(text: &str) -> FoundString {
        FoundString {
            text: text.to_string(),
            encoding: Encoding::Ascii,
            offset: 0,
            rva: None,
            section: Some("test".to_string()),
            length: text.len() as u32,
            tags: vec![],
            score: 0,
            source: StringSource::SectionData,
        }
    }

    #[test]
    fn test_url_detection() {
        let classifier = SemanticClassifier::new();
        
        // Valid URLs
        assert!(classifier.classify_url("https://example.com").is_some());
        assert!(classifier.classify_url("http://api.malware.com/v1/data").is_some());
        assert!(classifier.classify_url("https://192.168.1.1:8080/path").is_some());
        
        // Not URLs
        assert!(classifier.classify_url("example.com").is_none());
        assert!(classifier.classify_url("not a url").is_none());
    }

    #[test]
    fn test_domain_detection() {
        let classifier = SemanticClassifier::new();
        
        // Valid domains (not URLs)
        assert!(classifier.classify_domain("example.com").is_some());
        assert!(classifier.classify_domain("api.service.io").is_some());
        assert!(classifier.classify_domain("malware-c2.net").is_some());
        
        // Should not match URLs
        assert!(classifier.classify_domain("https://example.com").is_none());
        
        // Invalid domains
        assert!(classifier.classify_domain("invalid").is_none());
        assert!(classifier.classify_domain("too.short.x").is_none());
    }

    #[test]
    fn test_url_classification() {
        let classifier = SemanticClassifier::new();
        let string = create_test_string("https://api.example.com/endpoint");
        
        let tags = classifier.classify(&string);
        assert_eq!(tags.len(), 1);
        assert_eq!(tags[0], Tag::Url);
    }

    #[test]
    fn test_domain_classification() {
        let classifier = SemanticClassifier::new();
        let string = create_test_string("malware.example.com");
        
        let tags = classifier.classify(&string);
        assert_eq!(tags.len(), 1);
        assert_eq!(tags[0], Tag::Domain);
    }

    #[test]
    fn test_url_not_double_tagged() {
        let classifier = SemanticClassifier::new();
        let string = create_test_string("https://example.com");
        
        let tags = classifier.classify(&string);
        // Should only be tagged as URL, not both URL and Domain
        assert_eq!(tags.len(), 1);
        assert_eq!(tags[0], Tag::Url);
    }
}
```

## Acceptance Criteria

- [ ] `regex` dependency added to `Cargo.toml`
- [ ] `src/classification/semantic.rs` created with `SemanticClassifier` struct
- [ ] URL pattern matching implemented with regex
- [ ] Domain pattern matching implemented with validation
- [ ] TLD validation prevents false positives
- [ ] URLs are not double-tagged as domains
- [ ] Unit tests cover positive and negative cases
- [ ] Tests verify that URLs and domains are properly distinguished
- [ ] All tests pass with `cargo test`
- [ ] Code follows existing patterns from container parsers
- [ ] Documentation comments added for public APIs

## Dependencies

- **Blocked by**: Semantic Classification Framework setup
- **Blocks**: IP address classification (#16), File path classification (#17)

## Security Considerations

This classifier is critical for malware analysis as it identifies:
- Command & Control (C2) server addresses
- Data exfiltration endpoints
- Update servers
- Legitimate service integrations

Proper validation reduces false positives that could waste analyst time.

## References

- [Classification System Documentation](https://github.com/EvilBit-Labs/StringyMcStringFace/blob/main/docs/src/classification.md)
- [Architecture Overview](https://github.com/EvilBit-Labs/StringyMcStringFace/blob/main/docs/src/architecture.md)
- [Existing test patterns](https://github.com/EvilBit-Labs/StringyMcStringFace/tree/main/src/container) (ELF/PE/Mach-O parsers)

## Task-ID
`stringy-analyzer/url-domain-classification`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement URL and Domain Pattern Matching in Semantic Classification System #15

Overview

Context

Requirements Mapping

Implementation Details

1. Add Dependencies to `Cargo.toml`

2. Create `src/classification/semantic.rs`

3. Update `src/classification/mod.rs`

4. Pattern Specifications

5. TLD Validation

6. False Positive Reduction

Testing Requirements

Acceptance Criteria

Dependencies

Security Considerations

References

Task-ID

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Implement URL and Domain Pattern Matching in Semantic Classification System #15

Description

Overview

Context

Requirements Mapping

Implementation Details

1. Add Dependencies to Cargo.toml

2. Create src/classification/semantic.rs

3. Update src/classification/mod.rs

4. Pattern Specifications

5. TLD Validation

6. False Positive Reduction

Testing Requirements

Acceptance Criteria

Dependencies

Security Considerations

References

Task-ID

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Add Dependencies to `Cargo.toml`

2. Create `src/classification/semantic.rs`

3. Update `src/classification/mod.rs`