Summary
Implement comprehensive IPv4 and IPv6 address pattern detection and validation within the semantic classification system to identify and tag IP addresses found in binary strings.
Context
The semantic classifier currently supports URL and domain detection. IP addresses are a critical type of network indicator that appear frequently in binaries (C&C addresses, configuration endpoints, telemetry servers, hardcoded network targets). Adding IPv4 and IPv6 detection will enable security analysts and reverse engineers to quickly identify potential network indicators of compromise (IOCs).
Current State
Tag::IPv4 and Tag::IPv6 enum variants are already defined in src/types.rs (lines 18-19)
src/classification/mod.rs exists but is currently empty
- The semantic tagging infrastructure is in place but needs pattern matching implementation
- Architecture supports regex-based classification per
concept.md
Dependencies
Proposed Solution
Implementation Approach
Implement IP address detection in src/classification/mod.rs (or a dedicated submodule) with the following components:
1. IPv4 Pattern Matching
Pattern: XXX.XXX.XXX.XXX where each octet is 0-255
Validation Rules:
- Four octets separated by dots
- Each octet must be 0-255 (no leading zeros except for "0" itself)
- No leading/trailing dots
- Exclude invalid ranges for context (e.g., 0.0.0.0, 255.255.255.255 may be flagged with lower scores)
Example Valid: 192.168.1.1, 10.0.0.1, 172.16.0.1, 8.8.8.8
Example Invalid: 256.1.1.1, 192.168.1, 192.168.1.1.1, 192.168.01.1 (leading zero)
Regex Pattern (starting point):
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
2. IPv6 Pattern Matching
Format Support:
- Full notation:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
- Compressed notation:
2001:db8:85a3::8a2e:370:7334
- Mixed notation (IPv4-mapped):
::ffff:192.0.2.1
- Loopback:
::1
- Link-local:
fe80::1
Validation Rules:
- Eight groups of 4 hexadecimal digits separated by colons
- Double colon
:: allowed once to represent consecutive zeros
- Trailing/embedded IPv4 addresses in mixed notation
- Case-insensitive hex digits (a-f, A-F)
Example Valid:
2001:db8::1
fe80::1
::1
2001:0db8:85a3::8a2e:0370:7334
::ffff:192.0.2.1
Implementation Note: IPv6 regex validation is complex. Consider using the ipnetwork or std::net::IpAddr for validation after initial pattern matching.
3. Classification Function Structure
pub fn classify_string(text: &str) -> Vec<Tag> {
let mut tags = Vec::new();
if is_ipv4_address(text) {
tags.push(Tag::IPv4);
}
if is_ipv6_address(text) {
tags.push(Tag::IPv6);
}
// Add other classification logic (URL, Domain, etc.)
tags
}
fn is_ipv4_address(text: &str) -> bool {
// Implementation with regex + validation
}
fn is_ipv6_address(text: &str) -> bool {
// Implementation with regex + std::net::Ipv6Addr parsing
}
4. Integration with Scoring System
IP addresses should receive semantic boost per concept.md ranking algorithm:
- IPv4/IPv6 in private ranges: +2 score (internal network indicators)
- IPv4/IPv6 in public ranges: +3 to +5 score (potential C&C, external endpoints)
- IPv4 in special ranges (loopback, multicast): +1 score (informational)
Technical Considerations
-
False Positives:
- Version numbers may look like IPs:
1.2.3.4
- Add context checks: IPs in networking sections get higher confidence
- Consider excluding common version patterns (all octets < 20)
-
Performance:
- Use compiled regex with
regex crate
- Consider
aho-corasick for multi-pattern matching if combined with URL/Domain
- Lazy static initialization for regex patterns
-
Dependencies:
regex = "1.10" (already in use per architecture)
- Optional:
ipnetwork = "0.20" or use std::net for validation
-
Port Handling:
- Decide if
192.168.1.1:8080 should be tagged as IPv4
- Suggest: Strip port suffix before validation, still tag as IPv4
Testing Requirements
Unit Tests
Create src/classification/tests.rs or inline tests with coverage for:
IPv4 Tests:
- ✅ Valid addresses:
192.168.1.1, 10.0.0.1, 8.8.8.8, 1.1.1.1
- ✅ Edge cases:
0.0.0.0, 255.255.255.255, 127.0.0.1
- ❌ Invalid:
256.1.1.1, 192.168.1, 192.168.1.1.1, 999.999.999.999
- ❌ Leading zeros:
192.168.01.1
- ❌ Version numbers:
1.2.3.4 (context-dependent)
- ✅ With ports:
192.168.1.1:8080 (should extract IP)
IPv6 Tests:
- ✅ Full notation:
2001:0db8:85a3:0000:0000:8a2e:0370:7334
- ✅ Compressed:
2001:db8::1, ::1, fe80::1
- ✅ Mixed notation:
::ffff:192.0.2.1, 64:ff9b::192.0.2.1
- ✅ All zeros:
::
- ❌ Invalid:
gggg::1, 2001:db8::1::2 (double ::), 2001:db8:1
- ✅ With ports/brackets:
[2001:db8::1]:8080
Integration Tests:
- Extract IPs from sample binary strings (mix of text, URLs with IPs, config strings)
- Verify tagging applied correctly to
FoundString objects
- Test scoring boosts are applied
Documentation
- Add rustdoc comments to classification functions
- Update
concept.md with IP classification details
- Add examples to
README.md showing IP detection
Acceptance Criteria
References
Related Issues
Task-ID: stringy-analyzer/ip-address-classification
Requirements: 3.3
Estimated Effort: 2-3 days (implementation + comprehensive testing)
Summary
Implement comprehensive IPv4 and IPv6 address pattern detection and validation within the semantic classification system to identify and tag IP addresses found in binary strings.
Context
The semantic classifier currently supports URL and domain detection. IP addresses are a critical type of network indicator that appear frequently in binaries (C&C addresses, configuration endpoints, telemetry servers, hardcoded network targets). Adding IPv4 and IPv6 detection will enable security analysts and reverse engineers to quickly identify potential network indicators of compromise (IOCs).
Current State
Tag::IPv4andTag::IPv6enum variants are already defined insrc/types.rs(lines 18-19)src/classification/mod.rsexists but is currently emptyconcept.mdDependencies
Proposed Solution
Implementation Approach
Implement IP address detection in
src/classification/mod.rs(or a dedicated submodule) with the following components:1. IPv4 Pattern Matching
Pattern:
XXX.XXX.XXX.XXXwhere each octet is 0-255Validation Rules:
Example Valid:
192.168.1.1,10.0.0.1,172.16.0.1,8.8.8.8Example Invalid:
256.1.1.1,192.168.1,192.168.1.1.1,192.168.01.1(leading zero)Regex Pattern (starting point):
2. IPv6 Pattern Matching
Format Support:
2001:0db8:85a3:0000:0000:8a2e:0370:73342001:db8:85a3::8a2e:370:7334::ffff:192.0.2.1::1fe80::1Validation Rules:
::allowed once to represent consecutive zerosExample Valid:
2001:db8::1fe80::1::12001:0db8:85a3::8a2e:0370:7334::ffff:192.0.2.1Implementation Note: IPv6 regex validation is complex. Consider using the
ipnetworkorstd::net::IpAddrfor validation after initial pattern matching.3. Classification Function Structure
4. Integration with Scoring System
IP addresses should receive semantic boost per
concept.mdranking algorithm:Technical Considerations
False Positives:
1.2.3.4Performance:
regexcrateaho-corasickfor multi-pattern matching if combined with URL/DomainDependencies:
regex = "1.10"(already in use per architecture)ipnetwork = "0.20"or usestd::netfor validationPort Handling:
192.168.1.1:8080should be tagged as IPv4Testing Requirements
Unit Tests
Create
src/classification/tests.rsor inline tests with coverage for:IPv4 Tests:
192.168.1.1,10.0.0.1,8.8.8.8,1.1.1.10.0.0.0,255.255.255.255,127.0.0.1256.1.1.1,192.168.1,192.168.1.1.1,999.999.999.999192.168.01.11.2.3.4(context-dependent)192.168.1.1:8080(should extract IP)IPv6 Tests:
2001:0db8:85a3:0000:0000:8a2e:0370:73342001:db8::1,::1,fe80::1::ffff:192.0.2.1,64:ff9b::192.0.2.1::gggg::1,2001:db8::1::2(double::),2001:db8:1[2001:db8::1]:8080Integration Tests:
FoundStringobjectsDocumentation
concept.mdwith IP classification detailsREADME.mdshowing IP detectionAcceptance Criteria
Tag::IPv4,Tag::IPv6)References
std::net::IpAddrdocumentationRelated Issues
Task-ID: stringy-analyzer/ip-address-classification
Requirements: 3.3
Estimated Effort: 2-3 days (implementation + comprehensive testing)