Skip to content

[WEB-4172] feat: Crawl work item links for title and favicon#7117

Merged
sriramveeraghanta merged 21 commits intopreviewfrom
feat-crawl-work-item-link
May 30, 2025
Merged

[WEB-4172] feat: Crawl work item links for title and favicon#7117
sriramveeraghanta merged 21 commits intopreviewfrom
feat-crawl-work-item-link

Conversation

@sriramveeraghanta
Copy link
Member

@sriramveeraghanta sriramveeraghanta commented May 25, 2025

Description

Created a function that provides a solution for crawling URLs to extract titles and favicons.

Type of Change

  • Improvement (change that would cause existing functionality to not work as expected)

Summary by CodeRabbit

  • New Features
    • Adds asynchronous fetching of webpage titles and favicons for issue links.
    • Enhances issue link data with metadata for improved context and visualization.
    • Displays favicons alongside issue links with optional subtitle showing the link’s page title.

** Reference
#7124

@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 25, 2025

Walkthrough

A background task module is added to asynchronously retrieve webpage titles and favicons for issue links. The view triggers this task after creating or updating an issue link, updating the record with the fetched metadata. The frontend now prioritizes displaying the favicon or a fallback icon with an optional subtitle. Error handling and helper functions are included.

Changes

File(s) Change Summary
apiserver/plane/app/views/issue/link.py Imports and calls crawl_work_item_link_title asynchronously after saving issue links in create and partial_update methods; re-serializes updated IssueLink for response.
apiserver/plane/bgtasks/work_item_link_task.py Introduces a Celery task crawl_work_item_link_title with functions to validate URLs, fetch and parse webpage title and favicon, including fallback mechanisms and error handling.
web/core/components/issues/issue-detail/links/link-item.tsx Updates icon rendering to show favicon if available or fallback to a default Link icon; adds optional subtitle displaying the link's metadata title.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant IssueLinkViewSet
    participant CeleryWorker
    participant ExternalSite
    participant Database

    User->>IssueLinkViewSet: Create/Update IssueLink (with URL)
    IssueLinkViewSet->>Database: Save IssueLink
    IssueLinkViewSet->>CeleryWorker: crawl_work_item_link_title.delay(id, url)
    CeleryWorker->>ExternalSite: HTTP GET url
    CeleryWorker->>ExternalSite: Fetch favicon
    CeleryWorker->>Database: Update IssueLink with title and favicon data
Loading

Suggested labels

🧹chore

Suggested reviewers

  • sriramveeraghanta

Poem

A rabbit hops from link to link,
Fetching titles in a blink.
Favicons gathered, base64 delight,
All in the background, out of sight.
With every hop, the data grows—
A metadata meadow, where knowledge flows! 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@makeplane
Copy link

makeplane bot commented May 25, 2025

Pull Request Linked with Plane Work Items

Comment Automatically Generated by Plane

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

156-158: Use proper logging instead of print statements.

Using print() for error logging is not ideal for production applications as it doesn't integrate with logging frameworks and can't be easily managed.

Replace print with proper logging:

+import logging
+
+logger = logging.getLogger(__name__)

     except Exception as e:
-        print(f"Failed to fetch favicon: {e}")
+        logger.warning(f"Failed to fetch favicon: {e}")
         return None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84fc81d and efbccea.

📒 Files selected for processing (2)
  • apiserver/plane/app/views/issue/link.py (2 hunks)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (2)
apiserver/plane/app/views/issue/link.py (2)

48-50: Good integration of the background task.

The task is properly integrated:

  • Called asynchronously with .delay()
  • Executed after the model is saved, ensuring the ID exists
  • Parameters correctly extracted from serializer data
  • Maintains the existing activity logging flow

18-18: Import integration looks good.

The import is correctly placed with other background task imports and follows the project's import organization.

@dheeru0198
Copy link
Member

@sangeethailango Go through the comments from code rabbit and make necessary changes. Also add type hints

@sriramveeraghanta
Copy link
Member Author

@sangeethailango
I think in that function there is a url join function. That might not work. So we need to extract the base url from the provided value.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
apiserver/plane/bgtasks/work_item_link_task.py (4)

53-60: Consider merging favicon-related functions

As suggested in the previous review, consider moving find_favicon_url logic into fetch_and_encode_favicon for better encapsulation. This would also allow you to handle the default favicon case more cleanly.

Refactor to merge the functions:

-        # Find favicon URL
-        favicon_url = find_favicon_url(soup, url)
-
-        # Fetch and encode favicon
-        favicon_base64 = (
-            fetch_and_encode_favicon(favicon_url, headers)
-            if favicon_url
-            else DEFAULT_FAVICON
-        )
+        # Fetch and encode favicon
+        favicon_base64 = fetch_and_encode_favicon(soup, url, headers)

And update fetch_and_encode_favicon to include the URL finding logic:

def fetch_and_encode_favicon(soup, base_url, headers):
    """
    Find, fetch and encode favicon as base64.
    """
    # Find favicon URL logic here
    favicon_url = None
    favicon_selectors = [
        'link[rel="icon"]',
        'link[rel="shortcut icon"]',
        'link[rel="apple-touch-icon"]',
        'link[rel="apple-touch-icon-precomposed"]',
    ]
    
    for selector in favicon_selectors:
        favicon_tag = soup.select_one(selector)
        if favicon_tag and favicon_tag.get("href"):
            favicon_url = urljoin(base_url, favicon_tag["href"])
            break
    
    # Try fallback if no favicon found
    if not favicon_url:
        parsed_url = urlparse(base_url)
        fallback_url = f"{parsed_url.scheme}://{parsed_url.netloc}/favicon.ico"
        try:
            response = requests.head(fallback_url, timeout=5)
            if response.status_code == 200:
                favicon_url = fallback_url
        except Exception:
            pass
    
    # Return default if no favicon URL found
    if not favicon_url:
        return DEFAULT_FAVICON
    
    # Fetch and encode favicon
    try:
        response = requests.get(favicon_url, headers=headers, timeout=10)
        response.raise_for_status()
        content_type = response.headers.get("content-type", "image/x-icon")
        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
        return f"data:{content_type};base64,{favicon_base64}"
    except Exception as e:
        print(f"Failed to fetch favicon: {e}")
        return DEFAULT_FAVICON

Also applies to: 94-131, 133-160


15-23: ⚠️ Potential issue

Add error handling and set the title field

The function has two issues:

  1. No error handling if the IssueLink doesn't exist
  2. The title field is not being set despite the function name suggesting it should be

Apply this fix to handle errors and set the title field:

 @shared_task
 def crawl_work_item_link_title(id, url):
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        meta_dict = json.loads(meta_data)
+        issue_link = IssueLink.objects.get(id=id)
+        issue_link.title = meta_dict.get("title")
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")

25-92: ⚠️ Potential issue

Add URL validation to prevent SSRF attacks

The function lacks URL validation, which could be exploited for Server-Side Request Forgery (SSRF) attacks to access internal services.

Add URL validation before making requests:

 def crawl_work_item_link_title_and_favicon(url):
     """
     Crawls a URL to extract the title and favicon.
 
     Args:
         url (str): The URL to crawl
 
     Returns:
         str: JSON string containing title and base64-encoded favicon
     """
     try:
+        # Validate URL scheme and prevent internal network access
+        parsed = urlparse(url)
+        if parsed.scheme not in ['http', 'https']:
+            raise ValueError("Only HTTP and HTTPS URLs are allowed")
+        
+        # Prevent access to private IP ranges
+        import ipaddress
+        try:
+            ip = ipaddress.ip_address(parsed.hostname)
+            if ip.is_private or ip.is_loopback or ip.is_reserved:
+                raise ValueError("Access to private/internal networks is not allowed")
+        except ValueError:
+            # Not an IP address, continue with domain validation
+            pass
+
         # Set up headers to mimic a real browser
         headers = {
             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"  # noqa: E501
         }

94-131: ⚠️ Potential issue

Return None instead of base64 string

The function returns DEFAULT_FAVICON (a base64-encoded string) on lines 128 and 130, but the calling code expects a URL or None. This will cause fetch_and_encode_favicon() to fail when trying to fetch the base64 string as a URL.

Fix by returning None instead of DEFAULT_FAVICON:

     # Check if fallback exists
     try:
         response = requests.head(fallback_url, timeout=5)
         if response.status_code == 200:
             return fallback_url
     except Exception:
-        return DEFAULT_FAVICON
+        return None
 
-    return DEFAULT_FAVICON
+    return None
🧹 Nitpick comments (1)
apiserver/plane/app/views/issue/link.py (1)

62-65: Remove unnecessary re-fetching of issue link

The re-fetching and re-serialization of the issue link immediately after calling the asynchronous background task is unnecessary. Since crawl_work_item_link_title is an async task, it won't have completed by the time these lines execute, so the fetched data won't include the updated title/metadata.

The original serializer data already contains the correct response.

Remove the redundant re-fetch in both methods:

             notification=True,
             origin=base_host(request=request, is_app=True),
         )
-
-        issue_link = self.get_queryset().get(id=serializer.data.get("id"))
-        serializer = IssueLinkSerializer(issue_link)
-
         return Response(serializer.data, status=status.HTTP_201_CREATED)
             notification=True,
             origin=base_host(request=request, is_app=True),
         )
-        issue_link = self.get_queryset().get(id=serializer.data.get("id"))
-        serializer = IssueLinkSerializer(issue_link)
-
         return Response(serializer.data, status=status.HTTP_200_OK)

Also applies to: 95-98

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efbccea and 9d097d7.

📒 Files selected for processing (2)
  • apiserver/plane/app/views/issue/link.py (5 hunks)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

116-116:

Details

✅ Verification successful

Verify URL joining handles edge cases correctly

The PR comments mention a potential issue with URL joining. While urljoin should handle most cases correctly, let's verify it handles edge cases properly.


🌐 Web query:

Does Python's urljoin function correctly handle edge cases like protocol-relative URLs (//example.com/favicon.ico) and absolute URLs when joining with a base URL?

💡 Result:

Python's urljoin function from the urllib.parse module is designed to construct absolute URLs by combining a base URL with another URL, but it has specific behaviors when handling edge cases.

Protocol-Relative URLs

When joining a protocol-relative URL (starting with //) with a base URL, urljoin correctly preserves the scheme (protocol) from the base URL while using the hostname from the relative URL:

from urllib.parse import urljoin

# Base URL with http scheme, joined with protocol-relative URL
result = urljoin('http://www.cwi.nl/%7Eguido/Python.html', '//www.python.org/%7Eguido')
# Result: 'http://www.python.org/%7Eguido'

In this example, the function takes the http scheme from the base URL and combines it with the protocol-relative URL, resulting in a proper absolute URL[10].

Absolute URLs

When the second argument to urljoin is an absolute URL (containing a scheme like http: or https:), the function will return that URL largely unchanged, ignoring most parts of the base URL:

from urllib.parse import urljoin

# Joining with an absolute URL
base = "http://example.com/path/"
url = "https://another-site.com/resource"
result = urljoin(base, url)
# Result: 'https://another-site.com/resource'

Edge Cases and Limitations

There are some known issues and edge cases:

  1. Path-relative URLs with . and ..: There are documented issues when joining two path-relative URLs that involve dot notation. This has been reported as a bug in the Python issue tracker[6].

  2. URLs with scheme but no authority: URLs like http:g are treated in a special way for backward compatibility, being translated to http:///g, which can lead to unexpected results[4].

  3. Relative paths with drive letters: Earlier versions had issues with joining relative paths when a drive was present, but this was fixed in Python updates[8].

Practical Usage in Web Scraping

For web scraping, urljoin is commonly used to convert relative URLs found in HTML to absolute URLs:

from urllib.parse import urljoin
from bs4 import BeautifulSoup

def make_absolute_links(url, soup):
    # Convert relative links in <a> tags
    for a in soup.findAll('a'):
        if a.get('href'):
            a['href'] = urljoin(url, a.get('href'))
    
    # Similar processing for other tags with links
    for img in soup.findAll('img'):
        if img.get('src'):
            img['src'] = urljoin(url, img.get('src'))

This pattern is commonly used in web scrapers to ensure all links are absolute before following them[7][5].

Despite some edge cases, urljoin generally works correctly for the most common URL joining scenarios, making it a valuable tool for URL manipulation in Python.

Citations:


URL joining usage is sound

We’ve confirmed that Python’s urljoin correctly handles:

  • Protocol-relative URLs (//example.com/favicon.ico retains the base scheme)
  • Absolute URLs (returns the href unchanged)

While there are rare edge cases (dot segments, scheme-only URLs), they’re unlikely for typical favicon href values. No changes are needed here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
apiserver/plane/bgtasks/work_item_link_task.py (2)

16-24: ⚠️ Potential issue

Add error handling for missing IssueLink records.

The task will crash if the IssueLink with the given ID doesn't exist, which is possible in distributed systems where records might be deleted between task creation and execution.

 @shared_task
 def crawl_work_item_link_title(id, url):
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        issue_link = IssueLink.objects.get(id=id)
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")

144-148: ⚠️ Potential issue

Fix DEFAULT_FAVICON handling logic.

When find_favicon_url returns None, the code sets favicon_url to DEFAULT_FAVICON (a base64 string) and then tries to make an HTTP request to it, which will fail since it's not a URL.

     try:
         favicon_url = find_favicon_url(soup, url)
         if favicon_url is None:
-            favicon_url = DEFAULT_FAVICON
-
-        response = requests.get(favicon_url, headers=headers, timeout=10)
-        response.raise_for_status()
-
-        # Get content type
-        content_type = response.headers.get("content-type", "image/x-icon")
-
-        # Convert to base64
-        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
-
-        # Return as data URI
-        return {
-            "favicon_url": favicon_url,
-            "favicon_base64": f"data:{content_type};base64,{favicon_base64}",
-        }
+            # Use default favicon when no URL is found
+            return {
+                "favicon_url": None,
+                "favicon_base64": f"data:image/svg+xml;base64,{DEFAULT_FAVICON}",
+            }
+        
+        response = requests.get(favicon_url, headers=headers, timeout=10)
+        response.raise_for_status()
+
+        # Get content type
+        content_type = response.headers.get("content-type", "image/x-icon")
+
+        # Convert to base64
+        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
+
+        # Return as data URI
+        return {
+            "favicon_url": favicon_url,
+            "favicon_base64": f"data:{content_type};base64,{favicon_base64}",
+        }
🧹 Nitpick comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

7-7: Remove unused import.

The json module is imported but never used in the code.

-import json
🧰 Tools
🪛 Ruff (0.11.9)

7-7: json imported but unused

Remove unused import: json

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d097d7 and 4ebbe00.

📒 Files selected for processing (1)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
apiserver/plane/bgtasks/work_item_link_task.py

7-7: json imported but unused

Remove unused import: json

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Analyze (javascript)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
apiserver/plane/bgtasks/work_item_link_task.py (2)

26-91: Good implementation with SSRF protection.

The function correctly implements URL validation to prevent SSRF attacks and returns a proper dictionary structure with comprehensive error handling.


93-130: LGTM - Fixed return type issue.

The function now correctly returns None instead of DEFAULT_FAVICON when no favicon URL is found, addressing the previous type mismatch issue.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
apiserver/plane/bgtasks/work_item_link_task.py (2)

169-171: Improve error handling return value.

Returning None on error will cause issues for the caller who expects a dictionary with specific keys (favicon_url and favicon_base64).

This matches the previous review feedback. Apply this fix:

     except Exception as e:
         print(f"Failed to fetch favicon: {e}")
-        return None
+        return {
+            "favicon_url": None,
+            "favicon_base64": f"data:image/svg+xml;base64,{DEFAULT_FAVICON}",
+        }

18-26: ⚠️ Potential issue

Add error handling and missing title assignment.

This function has two critical issues:

  1. No error handling for IssueLink.objects.get(id=id) which will raise DoesNotExist if the record doesn't exist
  2. Missing title assignment - the crawled title should be assigned to issue_link.title

Apply this diff to fix both issues:

 @shared_task
 def crawl_work_item_link_title(id: str, url: str) -> None:
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        issue_link = IssueLink.objects.get(id=id)
+        
+        issue_link.title = meta_data.get("title")
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")
🧹 Nitpick comments (2)
apiserver/plane/bgtasks/work_item_link_task.py (2)

28-37: Fix docstring inconsistency.

The function signature indicates it returns Dict[str, Any] but the docstring states it returns a "JSON string". The implementation actually returns a dictionary.

Update the docstring:

     """
     Crawls a URL to extract the title and favicon.
 
     Args:
         url (str): The URL to crawl
 
     Returns:
-        str: JSON string containing title and base64-encoded favicon
+        Dict[str, Any]: Dictionary containing title, favicon, url, and favicon_url
     """

96-97: Remove debug print statements.

These debug print statements should be removed from production code.

 def find_favicon_url(soup: BeautifulSoup, base_url: str) -> Optional[str]:
-    print(soup, "PRint soup")
-    print(base_url, "BaseURL")
     """
     Find the favicon URL from HTML soup.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ebbe00 and e7bbedf.

📒 Files selected for processing (1)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

1-16: LGTM! Clean imports and constant definition.

The imports are well-organized and include all necessary dependencies. The DEFAULT_FAVICON constant provides a good fallback SVG icon.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
apiserver/plane/bgtasks/work_item_link_task.py (3)

38-49: Security vulnerability: Incomplete URL validation.

The URL validation is missing scheme validation and has a potential bug when parsed.hostname is None, as identified in previous reviews.

Apply this diff to fix the security issues:

     try:
         # Prevent access to private IP ranges
         parsed = urlparse(url)
+        
+        # Validate URL scheme
+        if parsed.scheme not in ['http', 'https']:
+            raise ValueError("Only HTTP and HTTPS URLs are allowed")
+        
+        # Validate hostname exists
+        if not parsed.hostname:
+            raise ValueError("Invalid URL: no hostname found")
 
         try:
             ip = ipaddress.ip_address(parsed.hostname)
             if ip.is_private or ip.is_loopback or ip.is_reserved:
                 raise ValueError("Access to private/internal networks is not allowed")
-        except ValueError:
+        except ValueError as ve:
+            # Re-raise if it's our security check
+            if "private/internal networks" in str(ve):
+                raise
             # Not an IP address, continue with domain validation
             pass

149-156: Critical bug: Using base64 string as URL.

When find_favicon_url returns None, the code incorrectly assigns DEFAULT_FAVICON (a base64 string) to favicon_url and then attempts to fetch it as a URL, which will cause a request failure. This issue was identified in previous reviews but remains unfixed.

Apply this diff to handle the None case properly:

     try:
         favicon_url = find_favicon_url(soup, url)
         if favicon_url is None:
-            favicon_url = DEFAULT_FAVICON
-
-        response = requests.get(favicon_url, headers=headers, timeout=10)
-        response.raise_for_status()
-
-        # Get content type
-        content_type = response.headers.get("content-type", "image/x-icon")
-
-        # Convert to base64
-        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
-
-        # Return as data URI
-        return {
-            "favicon_url": favicon_url,
-            "favicon_base64": f"data:{content_type};base64,{favicon_base64}",
-        }
+            # Use default favicon directly
+            return {
+                "favicon_url": None,
+                "favicon_base64": f"data:image/svg+xml;base64,{DEFAULT_FAVICON}",
+            }
+        
+        response = requests.get(favicon_url, headers=headers, timeout=10)
+        response.raise_for_status()
+
+        # Get content type
+        content_type = response.headers.get("content-type", "image/x-icon")
+
+        # Convert to base64
+        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
+
+        # Return as data URI
+        return {
+            "favicon_url": favicon_url,
+            "favicon_base64": f"data:{content_type};base64,{favicon_base64}",
+        }

18-26: ⚠️ Potential issue

Critical: Missing title assignment and error handling.

The shared task is missing the title assignment that was discussed in previous reviews and lacks proper error handling for database operations.

Apply this diff to fix both issues:

 @shared_task
 def crawl_work_item_link_title(id: str, url: str) -> None:
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        issue_link = IssueLink.objects.get(id=id)
+        issue_link.title = meta_data.get("title")
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")
🧹 Nitpick comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

96-97: Remove debug print statements.

The debug print statements should be removed from production code.

Apply this diff to remove the debug statements:

 def find_favicon_url(soup: BeautifulSoup, base_url: str) -> Optional[str]:
-    print(soup, "PRint soup")
-    print(base_url, "BaseURL")
     """
     Find the favicon URL from HTML soup.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7bbedf and aea4c32.

📒 Files selected for processing (1)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
apiserver/plane/bgtasks/work_item_link_task.py (3)

39-49: ⚠️ Potential issue

Add URL scheme validation and fix IP validation bug.

The current implementation has the same security issues identified in previous reviews:

  1. Missing URL scheme validation - should only allow HTTP/HTTPS
  2. IP validation bug: if parsed.hostname is None, ipaddress.ip_address(None) will raise ValueError

Apply this diff to fix both issues:

     try:
         # Prevent access to private IP ranges
         parsed = urlparse(url)
+        
+        # Validate URL scheme
+        if parsed.scheme not in ['http', 'https']:
+            raise ValueError("Only HTTP and HTTPS URLs are allowed")
+        
+        # Validate hostname exists
+        if not parsed.hostname:
+            raise ValueError("Invalid URL: no hostname found")
 
         try:
             ip = ipaddress.ip_address(parsed.hostname)
             if ip.is_private or ip.is_loopback or ip.is_reserved:
                 raise ValueError("Access to private/internal networks is not allowed")
-        except ValueError:
+        except ValueError as ve:
+            # Re-raise if it's our security check
+            if "private/internal networks" in str(ve):
+                raise
             # Not an IP address, continue with domain validation
             pass

147-166: ⚠️ Potential issue

Critical bug: Using base64 string as URL.

This is the same critical issue identified in previous reviews that remains unfixed. When find_favicon_url returns None, the code sets favicon_url to DEFAULT_FAVICON (a base64 string) and then tries to fetch it as a URL on line 152. This will cause a request failure.

Apply this diff to handle the None case properly:

     try:
         favicon_url = find_favicon_url(soup, url)
         if favicon_url is None:
-            favicon_url = DEFAULT_FAVICON
-
-        response = requests.get(favicon_url, headers=headers, timeout=10)
-        response.raise_for_status()
-
-        # Get content type
-        content_type = response.headers.get("content-type", "image/x-icon")
-
-        # Convert to base64
-        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
-
-        # Return as data URI
-        return {
-            "favicon_url": favicon_url,
-            "favicon_base64": f"data:{content_type};base64,{favicon_base64}",
-        }
+            # Use default favicon directly
+            return {
+                "favicon_url": None,
+                "favicon_base64": f"data:image/svg+xml;base64,{DEFAULT_FAVICON}",
+            }
+        
+        response = requests.get(favicon_url, headers=headers, timeout=10)
+        response.raise_for_status()
+
+        # Get content type
+        content_type = response.headers.get("content-type", "image/x-icon")
+
+        # Convert to base64
+        favicon_base64 = base64.b64encode(response.content).decode("utf-8")
+
+        # Return as data URI
+        return {
+            "favicon_url": favicon_url,
+            "favicon_base64": f"data:{content_type};base64,{favicon_base64}",
+        }

18-26: ⚠️ Potential issue

Add error handling for database operations.

The task lacks error handling for IssueLink.objects.get(id=id) which will raise DoesNotExist if the record is not found, causing the background task to fail.

Apply this diff to add proper error handling:

 @shared_task
 def crawl_work_item_link_title(id: str, url: str) -> None:
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        issue_link = IssueLink.objects.get(id=id)
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")
🧹 Nitpick comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

35-37: Fix incorrect docstring return type.

The docstring states the function returns a "JSON string" but it actually returns Dict[str, Any].

Apply this diff to correct the docstring:

     Returns:
-        str: JSON string containing title and base64-encoded favicon
+        Dict[str, Any]: Dictionary containing title and base64-encoded favicon
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aea4c32 and 5d1a6b1.

📒 Files selected for processing (1)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

95-132: LGTM! Favicon URL handling has been corrected.

The function now properly returns None when no favicon is found, addressing the previous feedback about not returning DEFAULT_FAVICON as a URL.

@sangeethailango
Copy link
Member

@sangeethailango I think in that function there is a url join function. That might not work. So we need to extract the base url from the provided value.

@sriramveeraghanta I have checked it. If href is an absolute URL, urljoin returns it directly. If not, it joins the href with the base_url and returns the result.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
apiserver/plane/bgtasks/work_item_link_task.py (2)

38-49: Add missing URL scheme validation.

The URL validation is incomplete and has the same issues identified in past comments. The code lacks URL scheme validation and has potential issues with hostname validation.

     try:
         # Prevent access to private IP ranges
         parsed = urlparse(url)
+        
+        # Validate URL scheme
+        if parsed.scheme not in ['http', 'https']:
+            raise ValueError("Only HTTP and HTTPS URLs are allowed")
+        
+        # Validate hostname exists
+        if not parsed.hostname:
+            raise ValueError("Invalid URL: no hostname found")

         try:
             ip = ipaddress.ip_address(parsed.hostname)
             if ip.is_private or ip.is_loopback or ip.is_reserved:
                 raise ValueError("Access to private/internal networks is not allowed")
-        except ValueError:
+        except ValueError as ve:
+            # Re-raise if it's our security check
+            if "private/internal networks" in str(ve) or "Only HTTP" in str(ve) or "Invalid URL" in str(ve):
+                raise
             # Not an IP address, continue with domain validation
             pass

18-25: ⚠️ Potential issue

Add error handling for IssueLink.objects.get().

The task function lacks error handling for the database query, which could cause the background task to fail if the IssueLink record doesn't exist.

 @shared_task
 def crawl_work_item_link_title(id: str, url: str) -> None:
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        issue_link = IssueLink.objects.get(id=id)
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")
🧹 Nitpick comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

170-170: Remove unused variable assignment.

Static analysis correctly identifies that the exception variable e is assigned but never used in the first except block.

-    except Exception as e:
+    except Exception:
🧰 Tools
🪛 Ruff (0.11.9)

170-170: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1a6b1 and bb5db0a.

📒 Files selected for processing (1)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
apiserver/plane/bgtasks/work_item_link_task.py

170-170: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Analyze (javascript)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

67-67:

Details

❌ Incorrect review comment

Handle potential None return from fetch_and_encode_favicon.

The fetch_and_encode_favicon function can return None in some error cases, but the code assumes it always returns a dictionary. This could cause a TypeError when accessing favicon_base64["favicon_base64"].


🏁 Script executed:

#!/bin/bash
# Check if fetch_and_encode_favicon can return None in any code paths
ast-grep --pattern $'def fetch_and_encode_favicon($$$) {
  $$$
  return None
  $$$
}'

Length of output: 86


🏁 Script executed:

#!/bin/bash
# Locate and print the definition of fetch_and_encode_favicon
file=$(rg -l "def fetch_and_encode_favicon" -n)
if [[ -z "$file" ]]; then
  echo "No definition found for fetch_and_encode_favicon"
  exit 0
fi
echo "Function defined in: $file"
# Print the first 200 lines to capture its return logic
sed -n '1,200p' "$file"

Length of output: 6289


fetch_and_encode_favicon never returns None – guard is unnecessary

The fetch_and_encode_favicon function always returns a dictionary in every code path (even on errors it falls back to a default dict), so there’s no case where it returns None. You can safely access favicon_base64["favicon_base64"] without an extra None check.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
apiserver/plane/bgtasks/work_item_link_task.py (2)

38-49: Add URL scheme validation and fix hostname validation bug.

The security validation is incomplete and has bugs that were previously identified.

The issues are:

  1. Missing URL scheme validation (should only allow HTTP/HTTPS)
  2. If parsed.hostname is None, ipaddress.ip_address(None) will raise ValueError
     try:
         # Prevent access to private IP ranges
         parsed = urlparse(url)
+        
+        # Validate URL scheme
+        if parsed.scheme not in ['http', 'https']:
+            raise ValueError("Only HTTP and HTTPS URLs are allowed")
+        
+        # Validate hostname exists
+        if not parsed.hostname:
+            raise ValueError("Invalid URL: no hostname found")

         try:
             ip = ipaddress.ip_address(parsed.hostname)
             if ip.is_private or ip.is_loopback or ip.is_reserved:
                 raise ValueError("Access to private/internal networks is not allowed")
-        except ValueError:
+        except ValueError as ve:
+            # Re-raise if it's our security check
+            if "private/internal networks" in str(ve) or "Only HTTP" in str(ve) or "Invalid URL" in str(ve):
+                raise
             # Not an IP address, continue with domain validation
             pass

18-25: ⚠️ Potential issue

Add error handling for database operations.

The function lacks error handling for IssueLink.objects.get(id=id) which will raise DoesNotExist if the record is not found, causing the background task to fail.

 @shared_task
 def crawl_work_item_link_title(id: str, url: str) -> None:
-    meta_data = crawl_work_item_link_title_and_favicon(url)
-    issue_link = IssueLink.objects.get(id=id)
-
-    issue_link.metadata = meta_data
-
-    issue_link.save()
+    try:
+        meta_data = crawl_work_item_link_title_and_favicon(url)
+        issue_link = IssueLink.objects.get(id=id)
+        issue_link.metadata = meta_data
+        issue_link.save()
+    except IssueLink.DoesNotExist:
+        print(f"IssueLink with id {id} not found")
+    except Exception as e:
+        print(f"Failed to update issue link {id}: {e}")
🧹 Nitpick comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

28-37: Fix inconsistent return type documentation.

The docstring incorrectly states the function returns a "JSON string" but it actually returns a Dict[str, Any] as indicated by the type hint and implementation.

     """
     Crawls a URL to extract the title and favicon.

     Args:
         url (str): The URL to crawl

     Returns:
-        str: JSON string containing title and base64-encoded favicon
+        Dict[str, Any]: Dictionary containing title, favicon, and metadata
     """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 46808d5 and d87d2be.

📒 Files selected for processing (1)
  • apiserver/plane/bgtasks/work_item_link_task.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.11.9)
apiserver/plane/bgtasks/work_item_link_task.py

63-63: f-string without any placeholders

Remove extraneous f prefix

(F541)


169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (1)
apiserver/plane/bgtasks/work_item_link_task.py (1)

103-190: Good modular structure and error handling improvements.

The separation of concerns between find_favicon_url and fetch_and_encode_favicon functions is well-implemented. The error handling with fallback to default favicon and appropriate timeout values (2 seconds) address previous feedback effectively.

🧰 Tools
🪛 Ruff (0.11.9)

169-169: f-string without any placeholders

Remove extraneous f prefix

(F541)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
web/core/components/issues/issue-detail/links/link-item.tsx (1)

67-70: Review subtitle display logic for potential redundancy.

The subtitle display shows linkTitle (metadata title) alongside linkDetail.title. This might create redundancy or confusion if both titles are similar or identical.

Consider this alternative approach to avoid redundant information:

-              {linkDetail.title && linkDetail.title !== "" ? linkDetail.title : linkDetail.url}
-
-              {linkTitle && linkTitle !== "" && <span className="text-custom-text-400 text-xs">{linkTitle}</span>}
+              {linkDetail.title && linkDetail.title !== "" ? linkDetail.title : linkDetail.url}
+
+              {linkTitle && linkTitle !== "" && linkTitle !== linkDetail.title && (
+                <span className="text-custom-text-400 text-xs">{linkTitle}</span>
+              )}

This ensures the metadata title is only shown when it's different from the user-provided title, reducing visual clutter.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bc2679 and 3b8b340.

📒 Files selected for processing (1)
  • web/core/components/issues/issue-detail/links/link-item.tsx (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
web/core/components/issues/issue-detail/links/link-item.tsx (1)
packages/ui/src/tooltip/tooltip.tsx (1)
  • Tooltip (36-110)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript)
🔇 Additional comments (3)
web/core/components/issues/issue-detail/links/link-item.tsx (3)

5-5: Import addition looks good.

The Link icon import is correctly added for the fallback icon display.


40-42: Good implementation of metadata extraction.

The optional chaining correctly handles cases where metadata might not be available, and the type annotations are appropriate.


65-65: Verify flex container layout.

The flex layout addition supports the new subtitle feature correctly.

@sangeethailango
Copy link
Member

This PR contains both frontend and backend changes for this feature, and all the code in the frontend has also been approved by prateek in a separate PR.

@sriramveeraghanta sriramveeraghanta merged commit 445c819 into preview May 30, 2025
5 of 6 checks passed
@sriramveeraghanta sriramveeraghanta deleted the feat-crawl-work-item-link branch May 30, 2025 13:14
lifeiscontent pushed a commit that referenced this pull request Aug 18, 2025
* feat: added a python bg task to crawl work item links for title and description

* fix: return meta_data in the response

* fix: add validation for accessing IP ranges

* fix: remove json.dumps

* fix: handle exception by returning None

* refactor: call find_favicon_url inside fetch_and_encode_favicon function

* chore: type hints

* fix: Handle None

* fix: remove print statementsg

* chore: added favicon and title of links

* fix: return null if no title found

* Update apiserver/plane/bgtasks/work_item_link_task.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* fix: remove exception handling

* fix: reduce timeout seconds

* fix: handle timeout exception

* fix: remove request timeout handling

* feat: add Link icon to issue detail links and update rendering logic

* fix: use logger for exception

---------

Co-authored-by: sangeethailango <sangeethailango21@gmail.com>
Co-authored-by: JayashTripathy <jayashtripathy371@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants