predictive-bigrams

Top-50k ordered bigram frequency tables per language, derived from the Leipzig Corpora Collection. Built to power next-word prediction in the OnitKeyboard iOS keyboard extension.

Files

One file per language under data/:

data/<iso2>_bigrams_50k.txt

Format: one bigram per line, tab-separated, lowercased, top-50k by raw frequency:

prev<TAB>next<TAB>frequency
of	the	115739
in	the	101042
to	be	28519
...

Languages

27 languages, a subset of DictionaryManager.supportedLanguages in OnitKeyboard:

ca, cs, da, de, el, en, es, fi, fr, he, hr, hu, id, it, ja, ko, nb, nl, pl, pt, ro, ru, sv, tr, uk, vi, zh.

ar, hi, th are intentionally excluded: Leipzig's data for those languages produces too few usable bigrams (small corpus tier, Devanagari tokenisation limitations, or Thai text mixed with English noise). Their SymSpell unigram dictionaries still ship via FrequencyWords.

Build pipeline

The Python pipeline is two scripts:

discover.py — for each ISO-2 language code, probes HEAD requests to find the best available Leipzig corpus archive. Genre priority: web > newscrawl > news > mixed > wikipedia. Falls back through sentence tiers (1M → 300K → 100K → 30K → 10K) if the preferred tier is absent. Writes discovery.tsv.
build.py — reads discovery.tsv, downloads each archive (cached under downloads/), streams <corpus>-words.txt and <corpus>-co_n.txt out of the tar.gz, joins IDs to surface forms, filters to word-like tokens, and writes data/<iso2>_bigrams_50k.txt.

python3 discover.py > discovery.tsv
python3 build.py

The downloads/ directory holds the raw Leipzig archives (~6.5 GB total); it's gitignored to keep the repo small.

Source & license

Bigram counts are derived from the Leipzig Corpora Collection (https://wortschatz.uni-leipzig.de/). Derivatives of the Leipzig corpora are redistributed under CC BY, with attribution to the Leipzig project.

Each derived file in data/ is a filtered, lowercased, top-K subset of the original co_n.txt joined against words.txt — no sentences or personal data from the underlying sources are reproduced.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
README.md		README.md
build.py		build.py
discover.py		discover.py
discovery.tsv		discovery.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

predictive-bigrams

Files

Languages

Build pipeline

Source & license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

predictive-bigrams

Files

Languages

Build pipeline

Source & license

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages