Tiny HTTP service that returns Vietnamese word2vec similarity and nearest
neighbors — Vietnamese sibling of word2sim. Same endpoint
shapes; swap URLs and it's a drop-in replacement.
Backed by PhoW2V (VinAI / Dat Quoc Nguyen), the largest pretrained Vietnamese word vectors available. Chosen over PhoBERT for this purpose because word2vec's similarity distribution is wide enough to drive a Semantle-style warmth meter, whereas raw transformer embeddings saturate at the top.
License note. PhoW2V's research-only license forbids public redistribution, so this service doesn't — and can't — embed or auto-download the vectors from any public URL. You supply your own private mirror (typically a Nextcloud instance you control). See Quick start.
- FastAPI + uvicorn
- gensim (loads PhoW2V
.txtfiles; caches a binary.binalongside for 5× faster restarts)
PhoW2V ships in four flavors. Pick one per deployment.
| Variant | Dims | Size | Best for |
|---|---|---|---|
word-100 |
100 | ~400MB | low-RAM hosts, compound-aware |
word-300 |
300 | ~1.2GB | default — best quality, compound-aware |
syllable-100 |
100 | ~50MB | single-syllable guesses, tiny footprint |
syllable-300 |
300 | ~150MB | single-syllable guesses, richer vectors |
The "word" variants expect underscore-joined compounds (sinh_viên);
the "syllable" variants have no multi-token keys. The canonicalizer
tries both forms, but the client should pre-segment for the word variant
if it wants reliable coverage of compounds.
| Method | Path | Purpose |
|---|---|---|
| GET | /health |
liveness probe |
| GET | /similarity?a=X&b=Y |
cosine similarity between two keys |
| GET | /neighbors?word=X&topn=10 |
nearest-neighbor keys with scores |
| GET | /vocab?word=X |
check in-vocab; return canonical form |
| GET | /random |
random vocab key, filtered for game-friendliness |
Response shape is identical to word2sim.
curl 'http://localhost:8001/similarity?a=con_chó&b=con_mèo'
# {"a":"con_chó","b":"con_mèo","canonical_a":"con_chó","canonical_b":"con_mèo",
# "in_vocab_a":true,"in_vocab_b":true,"similarity":0.78}
curl 'http://localhost:8001/neighbors?word=đại_học&topn=5'
curl 'http://localhost:8001/vocab?word=con%20ch%C3%B3' # "con chó" → tries "con_chó"
# {"word":"con chó","canonical":"con_chó","in_vocab":true}
curl 'http://localhost:8001/random?min_rank=500&max_rank=20000&min_len=3&max_len=12'Out-of-vocab returns in_vocab:false and similarity:null. Lookup
tries exact → lowercase → space-to-underscore variants.
-
Get the vectors once. Download from the upstream Google Drive mirror (the one linked from the PhoW2V README — the original
public.vinai.ioURLs are dead). Four zips; keep the one matching your chosen variant. -
Host the zip somewhere a plain
GETcan reach it. Options:- Nextcloud public share with file upload, then use the
/downloadendpoint:https://cloud.example.com/s/<token>/download. The share token acts as the capability; leave it unguessable and unlisted. - Any signed/pre-signed URL from your object store (S3, R2, BackBlaze B2), or your own HTTP(S) endpoint.
The service sends no auth headers — any authentication must be baked into the URL itself. This keeps the code minimal and puts hosting policy on the operator.
- Nextcloud public share with file upload, then use the
-
Configure env. Copy
.env.exampleto.envand setMODEL_URL:cp .env.example .env # edit .env: # MODEL_URL=https://cloud.example.com/s/abc123XYZ/download
-
Boot.
docker compose up --build
First boot streams ~1.2GB (word-300d) into the
phow2v-cachevolume, then parses ~60s. A binary.binis written alongside so later restarts load in ~10s. Health check start period is 10 min to cover the first-boot cost.
If you've already downloaded the .txt locally and don't want to
re-upload anywhere, skip MODEL_URL entirely and mount the file. In
docker-compose.yml, uncomment the bind mount:
volumes:
- phow2v-cache:/data/phow2v
- ./models/word2vec_vi_words_300dims.txt:/data/phow2v/word2vec_vi_words_300dims.txt:roThen docker compose up boots straight into parse — no download step.
Host the desired zip and update .env:
MODEL_URL=https://cloud.example.com/s/<token-for-syllables-100>/download
MODEL_PATH=/data/phow2v/word2vec_vi_syllables_100dims.txtDelete the phow2v-cache volume when switching, otherwise the stale
.bin from the previous variant loads instead:
docker compose down -v && docker compose up --build| Var | Default | Meaning |
|---|---|---|
MODEL_URL |
"" |
URL that serves the zip via a plain GET. Bake any auth into the URL. Optional if MODEL_PATH is already populated via a bind mount. |
MODEL_PATH |
/data/phow2v/word2vec_vi_words_300dims.txt |
Where the text-format vectors live. A binary .bin sibling is written on first parse. |
The service does not authenticate its callers. Put it behind a reverse proxy (Caddy, nginx, Cloudflare Tunnel) if you need access control.
phow2sim/
├── app/
│ ├── main.py # FastAPI routes
│ └── vectors.py # PhoW2V loader + canonicalize + similarity/neighbors/random
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .env.example # copy to .env and set MODEL_URL