Skip to content

New source: Web pages #8

@Tonemon

Description

@Tonemon

Add a new web source type, which is able to:

  • Accept a valid URL,
  • Fetch the page using python packages like trafilatura or readability-lxml,
  • Strip the boilerplate,
  • Ingest the cleaned text.

This would enable companies to ingest their own content from their blog posts, documentation sites, articles, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions