Search engine for Smalltalk source code and videos.
Soogle indexes Smalltalk packages from repositories across multiple dialects (Pharo, Squeak, Cuis, GemStone, VisualWorks, GNU Smalltalk, Dolphin, and more). It also indexes Smalltalk videos — conference talks, tutorials, and screencasts. Everything is searchable through a clean web interface at soogle.org.
- Full-text search across Smalltalk packages
- Filter by dialect, source site, or category
- Browse indexed sources and recently added packages
- Video index with search, dialect filtering, and sort by views/date
- LLM-powered quality review to filter false positives (C#/.NET repos, PLC code, "small talk" videos)
- Auto-detection of Smalltalk dialect and package categories
- SEO sitemaps for package and video pages
- Submit new Smalltalk sites for indexing
Packages are scraped from:
- GitHub —
language:Smalltalksearch with date segmentation to work around the 1,000-result API limit - SqueakSource — Seaside-based project listings
- SmalltalkHub — Static project archive
- Rosetta Code — MediaWiki API, Smalltalk code blocks
- VS Knowledge Base — Visual Smalltalk source code library
- SqueakMap — Package registry
- Lukas Renggli's archive — Monticello
.mczfiles - SourceForge — Smalltalk projects directory
- Launchpad — Smalltalk branches via REST API
- Web discovery — SerpAPI web search finds new sources automatically
Videos are scraped from:
- YouTube search — SerpAPI queries for Smalltalk-related videos
- Known playlists — Pharo MOOC (English and French)
Trusted video channels (esugboard, Cincom Smalltalk, etc.) are always accepted. Videos about conversation "small talk" or GemStone jewelry are filtered out.
The pipeline has four phases that move data from raw scrapes to what users see on the site.
Each scraper fetches metadata from its source and writes rows into the scrape_raw staging table. Every row gets a SHA-256 checksum so unchanged entries are skipped on future runs. The GitHub scraper handles rate limits automatically (30 req/min for search, sleeps and retries on 403).
python -m scrape process reads pending rows from scrape_raw and for each one:
- Detects the Smalltalk dialect from GitHub topics and name/description keywords (pharo, squeak, cuis, etc.) with a confidence score
- Auto-categorizes into one or more of 19 categories (web, database, testing, ui/graphics, etc.) via keyword matching
- Applies quality gates — skips repos with no description, no stars, and unknown dialect; rejects names matching known non-Smalltalk patterns (Arduino, TensorFlow, Unity, etc.)
- Does an atomic upsert — inserts/updates the package and its categories in a single transaction so data is always consistent
python -m scrape llm-review sends batches of packages to Claude for quality review. The LLM catches false positives that regex alone misses:
- C# / .NET projects (GitHub's linguist confuses
.cschangesets with Smalltalk) - IEC 61131-3 Structured Text / PLC code (
.stextension overlap) - ML/NLP research using StringTemplate
.stfiles - Unity game projects
Packages marked "block" are added to a blocklist and deleted. Packages marked "keep" are stamped with the model name that reviewed them.
python -m scrape video-review does the same for videos — blocks conversation-skills videos, design-pattern talks that only mention Smalltalk in passing, GemStone jewelry content, and spam.
Both commands support a model tier system (haiku < sonnet < opus). The --scope upgrade flag re-reviews items that were previously reviewed by a lower-tier model, so you can upgrade quality without reprocessing everything.
The Django app reads directly from the MySQL database:
- Home page — package count, video count, dialect breakdown, recently added packages
- Search — full-text search with filters for dialect, source site, and category; sort by relevance, stars, update date, or name
- Package detail — description, README excerpt, metadata (license, stars, forks, topics, dates), categories
- Videos — searchable gallery with dialect filter, sort by views or date
- Sources — lists all indexed sites with package counts
- Submit — users can submit new Smalltalk URLs for indexing
Runs the free scrapers and all processing. Intended for daily cron:
github --incremental— fetch repos updated in the last 30 daysweb all— scrape SqueakSource, SmalltalkHub, Rosetta Code, VSKBcustom all— scrape SqueakMap, SourceForge, Launchpad, Lukas Renggliprocess— movescrape_rawrows intopackagesanalyze— LLM assessment of newly discovered domains (if ANTHROPIC_API_KEY set)llm-review— quality review new packages (if ANTHROPIC_API_KEY set)video-review— quality review new videos (if ANTHROPIC_API_KEY set)status— print pipeline stats
Runs paid-API scrapers (SerpAPI free tier: ~100 searches/month), then calls daily.bash:
discover serpapi— web search for new Smalltalk code sourcesyoutube— search YouTube for Smalltalk videos + scrape known playlists- Runs
daily.bash(all free scrapers + processing)
soogle/
daily.bash Daily cron script (free scrapers + processing)
weekly.bash Weekly cron script (paid APIs + daily.bash)
requirements.txt requests, pymysql, beautifulsoup4
db/schema.sql Full database schema and seed data
scrape/
__main__.py CLI entry point (python -m scrape <command>)
config.py DB connection, API keys, rate limits
db.py Database helpers (blocklist, dedup, transactions)
models.py LLM model tier system (haiku < sonnet < opus)
github.py GitHub scraper with date segmentation
web.py Web scrapers (SqueakSource, SmalltalkHub, Rosetta, VSKB, discovery)
custom.py Custom scrapers (SqueakMap, Lukas Renggli, SourceForge, Launchpad)
youtube.py YouTube video scraper
processor.py scrape_raw -> packages processing pipeline
llm_review.py LLM quality review for packages and videos
analyze.py LLM domain analysis for discovered sites
web/
manage.py
soogle_web/ Django project settings, URLs, WSGI
search/
models.py Django ORM models (read-only mappings)
views.py View handlers (search, detail, videos, sources, SEO)
urls.py URL routing
templates/search/ HTML templates (base, index, results, detail, videos, etc.)
www/ Static files (CSS, images)
python -m scrape github [--incremental]
python -m scrape web <source> # squeaksource | smalltalkhub | rosettacode | vskb | all
python -m scrape custom <source> # squeakmap | lukas_renggli | sourceforge | launchpad | all
python -m scrape youtube [--playlists-only]
python -m scrape discover <engine> # brave | serpapi | bing | ddg
python -m scrape process [--limit N]
python -m scrape analyze [--limit N] [--show] [--min-score 50]
python -m scrape llm-review [--model M] [--scope S] [--limit N] [--fetch-only] [--review-only]
python -m scrape video-review [--model M] [--scope S] [--limit N]
python -m scrape block <external_id> [--site github] [--reason '...']
python -m scrape status
- Python 3.10+
- MySQL / MariaDB
pip install -r requirements.txt(requests, pymysql, beautifulsoup4)
Environment variables:
SOOGLE_DB_PASS— MySQL passwordGITHUB_TOKEN— GitHub API token (required for github scraper)SERPAPI_KEY— SerpAPI key (required for weekly.bash: discovery + youtube)ANTHROPIC_API_KEY— Anthropic API key (required for LLM review, analyze)
# Run the daily pipeline
./daily.bash
# Run the weekly pipeline (requires SERPAPI_KEY)
./weekly.bash
# Run individual commands
python -m scrape github --incremental
python -m scrape process
python -m scrape llm-review --model claude-haiku-4-5-20251001 --scope unreviewed
# Run the web server
cd web
python manage.py runserverOther repos in this collection:
- iospharo — Pharo Smalltalk VM for iOS and Mac Catalyst (interpreter-only, low-bit oop encoding for ASLR compatibility).
- pharo-headless-test — Headless Pharo test runner with a fake GUI; clicks menus, takes screenshots, runs SUnit without a display.
- validate_smalltalk_image — Standalone validator and export tool for Spur-format Smalltalk image files (heap integrity, SHA-256 manifests, reference graphs).
- claude-skills — Open source skills for Claude Code: reusable knowledge and algorithms packaged as
.claude/skills/markdown files.
GPL-3.0 — see LICENSE for details.