Skip to content

feat(classification): add semantic classifier for URL and domain detection#118

Merged
unclesp1d3r merged 3 commits into
mainfrom
15-implement-url-and-domain-pattern-matching-in-semantic-classification-system
Jan 4, 2026
Merged

feat(classification): add semantic classifier for URL and domain detection#118
unclesp1d3r merged 3 commits into
mainfrom
15-implement-url-and-domain-pattern-matching-in-semantic-classification-system

Conversation

@unclesp1d3r
Copy link
Copy Markdown
Member

  • Introduced a new SemanticClassifier module for identifying and tagging network indicators such as URLs and domain names within extracted strings.
  • Implemented pattern matching using compiled regular expressions for efficient detection, including TLD validation to minimize false positives.
  • Updated the Cargo.toml to include new dependencies: lazy_static and regex.
  • Enhanced the mod.rs file to expose the new SemanticClassifier functionality.

This addition significantly improves the library's ability to analyze strings for network-related content, enhancing its utility in binary analysis.

…ction

- Introduced a new `SemanticClassifier` module for identifying and tagging network indicators such as URLs and domain names within extracted strings.
- Implemented pattern matching using compiled regular expressions for efficient detection, including TLD validation to minimize false positives.
- Updated the `Cargo.toml` to include new dependencies: `lazy_static` and `regex`.
- Enhanced the `mod.rs` file to expose the new `SemanticClassifier` functionality.

This addition significantly improves the library's ability to analyze strings for network-related content, enhancing its utility in binary analysis.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r linked an issue Jan 4, 2026 that may be closed by this pull request
11 tasks
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 4, 2026

Caution

Review failed

The pull request is closed.

Summary by CodeRabbit

  • New Features
    • Added semantic classification to detect URLs and domain names in text, with TLD validation and a public classifier API.
  • Tests
    • Added unit tests covering URL/domain detection, edge cases, and TLD validation.
  • Chores
    • Added runtime dependencies for regex support.
  • Documentation
    • Updated documentation build workflow and refined a doc comment for clarity.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

Adds a new semantic classification module that detects URLs and domain names (with TLD validation) and exposes it via the classification module; adds regex and lazy_static dependencies; updates mdBook version and removes a docs plugin; minor doc-style change in a PE extraction file.

Changes

Cohort / File(s) Summary
Dependency Management
Cargo.toml
Added lazy_static = "1.5" and regex = "1.12.2" to [dependencies].
Module Exposure
src/classification/mod.rs
Added pub mod semantic; and pub use semantic::SemanticClassifier; to expose the new classifier.
Semantic Classification Implementation
src/classification/semantic.rs
New SemanticClassifier providing new(), classify_url(), classify_domain(), and classify(); uses precompiled regexes and an internal TLD validator; includes unit tests covering URLs, domains, TLDs, and edge cases.
Docs Workflow
.github/workflows/docs.yml
Bumped mdBook from 0.4.52 to 0.5.2 and removed the mdbook-alerts plugin from install steps.
PE Extraction Docs
src/extraction/pe_resources.rs
Minor doc-comment style change: converted return-type description to inline code style in the doc comment (no behavioral change).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐰 I hop through bytes both close and far,
I sniff for domains and every URL star,
TLDs checked with careful regex art,
I tag each string and play my part,
Hooray — semantic hunting from the start! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding a semantic classifier for URL and domain detection, which aligns with the primary objective of the PR.
Description check ✅ Passed The description is directly related to the changeset, covering the new SemanticClassifier module, regex-based pattern matching, dependency updates, and module exposure.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e971b47 and ccfb91f.

📒 Files selected for processing (5)
  • .github/workflows/docs.yml
  • Cargo.toml
  • src/classification/mod.rs
  • src/classification/semantic.rs
  • src/extraction/pe_resources.rs

Comment @coderabbitai help to get the list of available commands and usage tips.

… workflow

- Upgraded `mdbook` from version 0.4.52 to 0.5.2 to leverage new features and improvements.
- Simplified the installation command for mdBook plugins by removing redundant entries, ensuring a cleaner configuration.

These changes enhance the documentation build process and maintain compatibility with the latest mdBook features.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
- Updated the documentation for the `parse_string_table_block` function to clarify the return type, specifying that it returns a vector of `Option<String>`, where `Some` contains the decoded string and `None` indicates an empty entry.

This change enhances the clarity of the function's purpose and expected output, improving usability for developers.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
@unclesp1d3r unclesp1d3r enabled auto-merge (squash) January 4, 2026 23:44
@unclesp1d3r unclesp1d3r merged commit e2eb692 into main Jan 4, 2026
15 of 17 checks passed
@unclesp1d3r unclesp1d3r deleted the 15-implement-url-and-domain-pattern-matching-in-semantic-classification-system branch January 4, 2026 23:47
unclesp1d3r added a commit that referenced this pull request Feb 25, 2026
…ction (#118)

* feat(classification): add semantic classifier for URL and domain detection

- Introduced a new `SemanticClassifier` module for identifying and tagging network indicators such as URLs and domain names within extracted strings.
- Implemented pattern matching using compiled regular expressions for efficient detection, including TLD validation to minimize false positives.
- Updated the `Cargo.toml` to include new dependencies: `lazy_static` and `regex`.
- Enhanced the `mod.rs` file to expose the new `SemanticClassifier` functionality.

This addition significantly improves the library's ability to analyze strings for network-related content, enhancing its utility in binary analysis.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* chore: Update mdBook version and streamline plugin installation in CI workflow

- Upgraded `mdbook` from version 0.4.52 to 0.5.2 to leverage new features and improvements.
- Simplified the installation command for mdBook plugins by removing redundant entries, ensuring a cleaner configuration.

These changes enhance the documentation build process and maintain compatibility with the latest mdBook features.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

* fix: Improve documentation for parse_string_table_block function

- Updated the documentation for the `parse_string_table_block` function to clarify the return type, specifying that it returns a vector of `Option<String>`, where `Some` contains the decoded string and `None` indicates an empty entry.

This change enhances the clarity of the function's purpose and expected output, improving usability for developers.

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>

---------

Signed-off-by: UncleSp1d3r <unclesp1d3r@evilbitlabs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement URL and Domain Pattern Matching in Semantic Classification System

1 participant