Skip to content
/ brcrawl Public

crawl and index existing indieweb/smallweb adjacent blogs by brazilian authors

License

Notifications You must be signed in to change notification settings

guites/brcrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# BR Crawl

I swear I'll do a better write up sometime soon.

## Adding new URLs from a list

Duplicated feeds are ignored. Domains that block scrapping via robots.txt will be skipped.

## If it's a list of urls

Any page will work, doesn't need to be the home.

Use the `ExternalUrlsSpider` crawler. The output will be a .jsonl file with `rss_url` and `domain` of each website listed.

```
# on the scraper/ directory
uv run scrapy crawl rss -a urls_file=urls.txt -o rss.jsonl
```

Import the resulting .jsonl file into the backend's database using the `flask import-feeds` command.

```
# on the backend/ directory
uv run flask import-feeds rss.jsonl
```

# If it's a list of valid rss feeds

Format it to .jsonl before importing with `flask import-feeds`:

```
# on the backend/ directory
jq -R -n -c '[inputs] | map({rss_url: .}) | .[]' rss_urls.txt > rss.jsonl
uv run flask import-feeds rss.jsonl
```

## Generating the website

Always use the full list of imported feed_urls. Order randomly to reduce chances of hammering a small provider.

```
# on the backend/ directory
sqlite3 brcrawl.sqlite3

.output ../website/feeds.txt
SELECT feed_url FROM feeds ORDER BY RANDOM();
.output stdout
.quit
```

Now use the generated feeds.txt file to run the `build.sh` command from the website directory.

```
# on the website/ directory

./build.sh feeds.txt
```

The resulting .html files can be deployed (eg. via github pages or vps with nginx).

About

crawl and index existing indieweb/smallweb adjacent blogs by brazilian authors

Resources

License

Stars

Watchers

Forks