Skip to content

Simple and Performant Language detection library for NodeJS

License

Notifications You must be signed in to change notification settings

plaidbean/tinyld-old

 
 

Repository files navigation

TinyLD

npm npm CDN Download License

logo

Tiny Language Detector, simply detect the language of a unicode UTF-8 text:

  • pure javascript, no api call, and no dependency (node and browser compatible)
  • alternative to libraries like CLD
  • blazing fast and low memory footprint (unlike ML methods)
  • support 62 languages (30 for the web version)
  • format ISO 639-1

Extra


Getting Started

Install

yarn add tinyld # or npm install --save tinyld

API

import { detect, detectAll } from 'tinyld'

// Detect
detect('これは日本語です.') // ja
detect('and this is english.') // en

// DetectAll
detectAll('ceci est un text en francais.')
// [ { lang: 'fr', accuracy: 0.5238 }, { lang: 'ro', accuracy: 0.3802 }, ... ]

More Information


TinyLD CLI

tinyld This is the text that I want to check
# [ { lang: 'en', accuracy: 1 } ]

More Information


Benchmark

Benchmark done on tatoeba dataset (~9M sentences) on 16 of the most common languages.

Library Script Properly Identified Improperly identified Not identified Avg Execution Time Disk Size
TinyLD yarn bench:tinyld 96.1747% 2.6938% 1.1315% 0.1315ms. 778KB
TinyLD Web yarn bench:tinyld-light 92.1169% 3.9536% 3.9295% 0.0616ms. 89KB
node-cld yarn bench:cld 88.9148% 1.7489% 9.3363% 0.0612ms. > 10MB
node-lingua yarn bench:lingua 82.3157% 0.2158% 17.4685% 0.7085ms. ~100MB
franc yarn bench:franc 68.7783% 26.3432% 4.8785% 0.1381ms. 267KB
franc-min yarn bench:franc-min 65.5163% 23.5794% 10.9044% 0.0614ms. 119KB
languagedetect yarn bench:languagedetect 61.6068% 12.295% 26.0982% 0.1585ms. 240KB

Remark

  • For each category, top3 results are in Bold
  • Language evaluated in this benchmark:
    • Asia: jpn, cmn, kor, hin
    • Europe: fra, spa, por, ita, nld, eng, deu, fin, rus
    • Middle east: , tur, heb, ara
  • This kind of benchmark is not perfect and % can vary over time, but it gives a good idea of overall performances

Conclusion

Recommended

  • For NodeJS: TinyLD or node-cld (fast and accurate)
  • For Browser: TinyLD Light or franc-min (small, decent accuracy, franc is less accurate but support more languages)

Not recommended

  • node-lingua is just too big and slow
  • languagedetect is light but just not accurate enough, really focused on indo-european languages (support kazakh but not chinese, korean or japanese)

About

Simple and Performant Language detection library for NodeJS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 89.7%
  • JavaScript 10.3%