GitHub - guites/brcrawl: crawl and index existing indieweb/smallweb adjacent blogs by brazilian authors

guites / brcrawl Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

crawl and index existing indieweb/smallweb adjacent blogs by brazilian authors

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
backend		backend
scraper		scraper
website		website
.gitignore		.gitignore
LICENSE		LICENSE
README		README

Repository files navigation

# BR Crawl

I swear I'll do a better write up sometime soon.

## Adding new URLs from a list

Duplicated feeds are ignored. Domains that block scrapping via robots.txt will be skipped.

## If it's a list of urls

Any page will work, doesn't need to be the home.

Use the `ExternalUrlsSpider` crawler. The output will be a .jsonl file with `rss_url` and `domain` of each website listed.

```
# on the scraper/ directory
uv run scrapy crawl rss -a urls_file=urls.txt -o rss.jsonl
```

Import the resulting .jsonl file into the backend's database using the `flask import-feeds` command.

```
# on the backend/ directory
uv run flask import-feeds rss.jsonl
```

# If it's a list of valid rss feeds

Format it to .jsonl before importing with `flask import-feeds`:

```
# on the backend/ directory
jq -R -n -c '[inputs] | map({rss_url: .}) | .[]' rss_urls.txt > rss.jsonl
uv run flask import-feeds rss.jsonl
```

## Generating the website

Always use the full list of imported feed_urls. Order randomly to reduce chances of hammering a small provider.

```
# on the backend/ directory
sqlite3 brcrawl.sqlite3

.output ../website/feeds.txt
SELECT feed_url FROM feeds ORDER BY RANDOM();
.output stdout
.quit
```

Now use the generated feeds.txt file to run the `build.sh` command from the website directory.

```
# on the website/ directory

./build.sh feeds.txt
```

The resulting .html files can be deployed (eg. via github pages or vps with nginx).