-
Notifications
You must be signed in to change notification settings - Fork 0
crawl and index existing indieweb/smallweb adjacent blogs by brazilian authors
License
guites/brcrawl
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
# BR Crawl
I swear I'll do a better write up sometime soon.
## Adding new URLs from a list
Duplicated feeds are ignored. Domains that block scrapping via robots.txt will be skipped.
## If it's a list of urls
Any page will work, doesn't need to be the home.
Use the `ExternalUrlsSpider` crawler. The output will be a .jsonl file with `rss_url` and `domain` of each website listed.
```
# on the scraper/ directory
uv run scrapy crawl rss -a urls_file=urls.txt -o rss.jsonl
```
Import the resulting .jsonl file into the backend's database using the `flask import-feeds` command.
```
# on the backend/ directory
uv run flask import-feeds rss.jsonl
```
# If it's a list of valid rss feeds
Format it to .jsonl before importing with `flask import-feeds`:
```
# on the backend/ directory
jq -R -n -c '[inputs] | map({rss_url: .}) | .[]' rss_urls.txt > rss.jsonl
uv run flask import-feeds rss.jsonl
```
## Generating the website
Always use the full list of imported feed_urls. Order randomly to reduce chances of hammering a small provider.
```
# on the backend/ directory
sqlite3 brcrawl.sqlite3
.output ../website/feeds.txt
SELECT feed_url FROM feeds ORDER BY RANDOM();
.output stdout
.quit
```
Now use the generated feeds.txt file to run the `build.sh` command from the website directory.
```
# on the website/ directory
./build.sh feeds.txt
```
The resulting .html files can be deployed (eg. via github pages or vps with nginx).
About
crawl and index existing indieweb/smallweb adjacent blogs by brazilian authors