Add URL filtering #72

Voamorim · 2026-01-12T12:16:00Z

No description provided.

cunha

Main points:

Tests don't match expected behavior; we need to define a number of domain and path levels beyond which we start "anonymization"
Need tests to cover URL query parameters
Need tests to cover special domains and CC TLDs

urls_filter/get-tld-json.py

urls_filter/requirements.txt

urls_filter/test_filter.py

cunha

Forgot to mention CC TLDs in the previous review.

urls_filter/data/dns-keywords.txt

urls_filter/test_filter.py

cunha

Another round of comments. I think we should focus on comparing the URL identity keys (after we "erase" subdomains, path levels, and query parameter values). The filter is just taking one URL for each distinct identity key.

urls_filter/data/tlds.json

cunha · 2026-01-28T19:28:20Z

urls_filter/test_filter.py

+        """Testing a single URL identity key."""
+        url = "http://example.com/unique"
+        result = filter_instance.anonymize_url(url)
+        expected = "n1/unique"


I think this should be http://example.com/unique, no? Why the n1?

Looking at the other tests, it seems like we are really anonymizing the URLs, but we do not need to (and should not) anonymize in this project, as we need to know exactly what the URL is to crawl it later.

cunha · 2026-01-28T19:29:33Z

urls_filter/test_filter.py

+            "http://sub.example.com/page",
+            "http://another.example.com/page" 
+        ]
+        result = filter_instance.filter_urls(urls)


We should focus our tests on anonymize_urls, to check what URL identity keys are being generated. The filter is then just running set() on the list of anonymized URLs.

urls_filter/test_filter.py

cunha · 2026-01-28T19:38:33Z

urls_filter/test_filter.py

+
+    def test_special_cctld(self, filter_instance):
+        """Test URL with special ccTLD identity key."""
+        url = "http://example.co/page"


We need tests for long subdomains like a.b.c.d.com.br to make sure we're erasing the correct subdomains according to the config.

Add URL filtering

6db4e3c

cunha requested changes Jan 12, 2026

View reviewed changes

urls_filter/data/dns-keywords.txt Outdated Show resolved Hide resolved

urls_filter/test_filter.py Show resolved Hide resolved

fix: adress requested changes from review

46b466d

cunha reviewed Jan 28, 2026

View reviewed changes

Add URL filtering #72

Are you sure you want to change the base?

Add URL filtering #72

Uh oh!

Conversation

Voamorim commented Jan 12, 2026

Uh oh!

cunha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cunha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cunha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cunha Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

cunha Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

cunha Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cunha Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants