Skip to content

synth-inc/predictive-bigrams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

predictive-bigrams

Top-50k ordered bigram frequency tables per language, derived from the Leipzig Corpora Collection. Built to power next-word prediction in the OnitKeyboard iOS keyboard extension.

Files

One file per language under data/:

data/<iso2>_bigrams_50k.txt

Format: one bigram per line, tab-separated, lowercased, top-50k by raw frequency:

prev<TAB>next<TAB>frequency
of	the	115739
in	the	101042
to	be	28519
...

Languages

27 languages, a subset of DictionaryManager.supportedLanguages in OnitKeyboard:

ca, cs, da, de, el, en, es, fi, fr, he, hr, hu, id, it, ja, ko, nb, nl, pl, pt, ro, ru, sv, tr, uk, vi, zh.

ar, hi, th are intentionally excluded: Leipzig's data for those languages produces too few usable bigrams (small corpus tier, Devanagari tokenisation limitations, or Thai text mixed with English noise). Their SymSpell unigram dictionaries still ship via FrequencyWords.

Build pipeline

The Python pipeline is two scripts:

  • discover.py — for each ISO-2 language code, probes HEAD requests to find the best available Leipzig corpus archive. Genre priority: web > newscrawl > news > mixed > wikipedia. Falls back through sentence tiers (1M300K100K30K10K) if the preferred tier is absent. Writes discovery.tsv.
  • build.py — reads discovery.tsv, downloads each archive (cached under downloads/), streams <corpus>-words.txt and <corpus>-co_n.txt out of the tar.gz, joins IDs to surface forms, filters to word-like tokens, and writes data/<iso2>_bigrams_50k.txt.
python3 discover.py > discovery.tsv
python3 build.py

The downloads/ directory holds the raw Leipzig archives (~6.5 GB total); it's gitignored to keep the repo small.

Source & license

Bigram counts are derived from the Leipzig Corpora Collection (https://wortschatz.uni-leipzig.de/). Derivatives of the Leipzig corpora are redistributed under CC BY, with attribution to the Leipzig project.

Each derived file in data/ is a filtered, lowercased, top-K subset of the original co_n.txt joined against words.txt — no sentences or personal data from the underlying sources are reproduced.

About

Top-50k ordered bigram frequency tables per language, derived from the Leipzig Corpora Collection. Powers next-word prediction in OnitKeyboard.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages