Bring-your-own-data social network analysis. Given a SQLite database of profiles, follows, comments, and posts plus the corresponding media files on disk, this package encodes content with CLIP, computes per-user embedding clouds, finds communities, and renders an interactive graph viewer.
This package does not scrape, crawl, or call any platform API. You collect the data however you want; this processes it.
A data-processing pipeline that operates on a fixed schema (see Architecture). The schema is generic enough to fit any directed social network — Instagram, Twitter/X, TikTok, Mastodon, or any platform with profiles + follow edges + posts + comments.
The pipeline gives you:
- CLIP embeddings — every image, video first-frame, and caption encoded into a 1024-dim vector (ViT-H/14, dfn5b)
- Per-user variance — focused vs eclectic content identity
- Wasserstein distance — distributional similarity between two users' embedding clouds
- Louvain community detection on the follow graph, with semantic homogeneity scoring (mean within-community Wasserstein distance)
- Photo matching API — upload an image, get cosine-ranked profile matches
- Interactive HTML viewer — graph colored by community, side panel with profile media, drag-drop photo matching
- CLIP gender classifier (optional service) — zero-shot pfp + name+bio classification, no external APIs
- You populate
data/network.dbwith profiles, follows, comments, posts (schema in the Architecture wiki page) and put media files underdata/downloads/<username>/. analyze encodewalks the downloads, runs CLIP on every image and video first-frame, stores 1024-dim float32 vectors in theembeddingstable.analyze metricscomputes per-user cloud statistics (centroid, variance) used for fast filtering.analyze distancescomputes the full n-dimensional Wasserstein distance between every pair of user embedding clouds.analyze communitiesruns Louvain on the follow graph, then scores each community's semantic homogeneity using the precomputed pairwise distances.python -m socialcloud.analysis.visualizeproducesdata/output/network.html— a self-contained interactive viewer.uvicorn socialcloud.analysis.server:app --port 8002exposes a photo-matching endpoint backed by the embedding table.
pip install -e . # runtime
pip install -e ".[dev]" # with ruff + pytestThis installs one console script: analyze. CUDA is recommended for CLIP encoding.
# 1. Get your data into data/network.db (your collector, your rules)
# 2. Drop media under data/downloads/<username>/
analyze encode # populate the embeddings table
analyze metrics # per-user cloud metrics
analyze distances # pairwise Wasserstein distances
analyze communities # Louvain + homogeneity scoring
python -m socialcloud.analysis.visualize
# → open data/output/network.html
uvicorn socialcloud.analysis.server:app --port 8002
# POST /match with {"image_base64": "..."} → ranked matchesA FastAPI service that runs a zero-shot CLIP classifier over a profile picture (75% weight) and the name+bio text (25% weight). No external APIs, no quotas.
uvicorn socialcloud.gender_detection.main:app --port 8000
# POST /classify {"username": "...", "display_name": "...", "bio": "...", "pfp_base64": "..."}Persists results in data/gender_detection.db. Useful if your collector hasn't already filled the gender column on profiles.
docker-compose.yml ships two services: gender_detection and analysis. Both require an NVIDIA GPU.
docker compose up -d gender_detection
docker compose run --rm analysis encode
docker compose run --rm analysis communitiesAll persistent state lives under ./data, mounted as a single /data volume per service.
CI publishes both services to the GitHub Container Registry on every push to main:
docker pull ghcr.io/t-e-lawrence/socialcloud/analysis:latest
docker pull ghcr.io/t-e-lawrence/socialcloud/gender-detection:latestImages are also tagged with the short commit SHA and (on tagged releases) vX.Y.Z semver tags.
Full documentation lives in the wiki:
- Architecture — schema, package layout
- Analysis — embeddings, distances, communities, visualization
- API Reference — gender_detection + analysis endpoints
- Docker — compose usage, GPU, env vars
MIT — see LICENSE.