SpaCy pipelines for Estonian are trained on Estonian UD v2.5 treebank using Python 3.6, spaCy 3.0, PyTorch 1.7.1 and follow Universal Dependencies tagsets for part-of-speech, morphology and syntactic relations.
Pipelines:
et_dep_ud_sm: components tok2vec, morphologizer, parseret_dep_ud_estbert: components transformer (EstBERT), morphologizer, parseret_dep_ud_xlmroberta: components transformer (xlm-roberta-base), morphologizer parser
SpaCy pipeline'id eesti keele jaoks on treenitud EstUD v2.5 puudepangal, kasutades Python 3.6 ja teeke spaCy 3.0 ja PyTorch 1.7.1. Märgenduses kasutatakse Universal Dependencies projekti morfoloogiamärgendeid (rohkem infot siin ja siin) ja süntaksimärgendeid (rohkem infot siin).
Kõik pipeline'id võimaldavad parsida sõnaliike, morfoloogilist informatsiooni (v.a lemmad) ja süntaktilist informatsiooni. Transformereid kasutavad spaCy mudelid on paljude suhete puhul paremad kui Stanford NLP avaldatud Stanza mudel. LAS-skoori järgi on edukaim XLM-RoBERTat kasutav mudel. SpaCy on enamasti parem sagedasemate suhte puhul, kuid vähem täpne nt mõningate täiendite puhul. Tokeniseerimine, lausestus ja sõnaliigid on seevastu paremad Stanza mudelil. Täpsemad skoorid on saadaval failis model_comparison.md. Eamasti on mudelite skooride vahe väiksem kui 5%, suuremad erinevused tulevad esile harvaesinevamate suhete (nt discourse ja fixed) puhul. Kui GPU-d pole võimalik kasutada või ei soovi mahukat mudelit, on seega kindlasti mõistlik kasutada Stanza mudelit. GPU puhul võiks mudeli valida vastavalt sellele, mis parsimisel parasjagu kõige enam huvitab.
Pipeline'ide kirjeldused:
et_dep_ud_sm: sisaldab komponente tok2vec, morphologizer, parseret_dep_ud_estbert: sisaldab komponente transformer (EstBERT), morphologizer, parseret_dep_ud_xlmroberta: sisaldab komponente transformer (xlm-roberta-base), morphologizer parser
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
For using pipelines Python 3.6+ must be used and spaCy 3.0 must be installed to according to whether you want to use CPU or GPU (strongly recommended for transformer pipelines). Following are some instructions for installing spaCy via pip and conda, for more detailed information refer to spaCy installation instructions.
For using model without transformer on CPU, installing pipeline with
pip install https://github.com/EstSyntax/EstSpaCy/releases/download/v1.0/et_dep_ud_sm-1.0.0.tar.gz
installs all needed
requirements. Otherwise spaCy and other packages should be installed followingly
Using pip:
pip install -U pip setuptools wheel
pip install -U spacy
or (in case of transformer pipeline)
pip install -U pip setuptools wheel
pip install -U spacy[transformers]
Using conda:
conda install -c conda-forge spacy
in case of transformer pipeline spacy-transformers must be installed via pip:
pip install sentencepiece
pip install protobuf
pip install spacy-transformers
Using pip:
Correct CUDA version must be specified in the CUDA extra (cuda102 in the example).
It is used for installing the correct version of cupy.
pip install -U pip setuptools wheel
pip install -U spacy[cuda102]
or (in case of transformer pipeline)
pip install sentencepiece
pip install protobuf
pip install -U spacy[cuda102,transformers]
Using conda:
conda install -c conda-forge spacy
conda install -c conda-forge cupy
# in case of transformer pipeline:
pip install sentencepiece
pip install protobuf
pip install spacy-transformers
Pipelines can be installed via pip:
- et_dep_ud_xlmroberta
pip install https://github.com/EstSyntax/EstSpaCy/releases/download/v1.0/et_dep_ud_xlmroberta-1.0.0.tar.gz
- et_dep_ud_estbert
pip install https://github.com/EstSyntax/EstSpaCy/releases/download/v1.0/et_dep_ud_estbert-1.0.0.tar.gz
- et_dep_ud_sm
pip install https://github.com/EstSyntax/EstSpaCy/releases/download/v1.0/et_dep_ud_sm-1.0.0.tar.gz
or by saving chosen model from releases and installing by pointing to local file, e.g:
pip install /Users/you/et_dep_ud_sm-1.0.0.tar.gz
After installation pipelines can be used simply by importing and loading.
CPU:
import et_dep_ud_xlmroberta
nlp = et_dep_ud_xlmroberta.load()
doc = nlp('Jazz oli midagi rohkemat kui lihtsalt üks musitseerimisviis.')GPU:
import et_dep_ud_xlmroberta
import spacy
spacy.require_gpu()
# or spacy.prefer_gpu()
nlp = et_dep_ud_xlmroberta.load() # Estonian Language object
doc = nlp('Jazz oli midagi rohkemat kui lihtsalt üks musitseerimisviis.') # Doc objectDoc object can be iterated to extract Token objects. Following properties and attributes can be extracted from tokens:
Token.i: index of tokenToken.morph: MorphAnalysis object that store a single morphological analysisToken.pos_: attribute for par-of-speech tagsToken.head: Token object of current node's headToken.dep_: arc label between current token and its head.
Parser allows to iterate over sentences using property Doc.sents.
NB! Syntax iterator for extracting noun chunks is not implemented.
For comprehensive list of properties and attributes refer to spaCy API .
Example of outputting text in format similar to conllu by iterating through sentences and tokens:
import et_dep_ud_xlmroberta
import spacy
spacy.require_gpu()
nlp = et_dep_ud_xlmroberta.load()
doc = nlp('Jazz oli midagi rohkemat kui lihtsalt üks musitseerimisviis.'
'Kui enne sõda sümboliseeris saksofoniga neeger paljude õpetatud meeste '
'silmis Õhtumaa allakäiku, siis sõjajärgses Euroopas hakkas ta tähistama '
'pigem vabanemist lämmatavast konventsioonide koorikust.')
for sentence in doc.sents:
for word in sentence:
print('\t'.join([str(word.i), str(word), '_', word.pos_, '_',
str(word.morph), str(word.head.i), word.dep_, '_', '_']))
print()Using CoNLL 208 Shared Task evaluation script. More detailed scores for dependency lables can be found in model_comparison.md.
| Metric | Precision | Recall | F1 Score | AligndAcc |
|---|---|---|---|---|
| Tokens | 97.70 | 99.14 | 98.41 | |
| Sentences | 89.56 | 87.80 | 88.67 | |
| Words | 97.70 | 99.14 | 98.41 | |
| UPOS | 88.09 | 89.40 | 88.74 | 90.17 |
| UFeats | 77.88 | 79.03 | 78.45 | 79.71 |
| UAS | 70.79 | 71.84 | 71.31 | 72.46 |
| LAS | 62.71 | 63.64 | 63.17 | 64.19 |
| CLAS | 57.42 | 58.86 | 58.13 | 59.64 |
| MLAS | 45.22 | 46.35 | 45.78 | 46.96 |
| BLEX | 0.00 | 0.00 | 0.00 | 0.00 |
| Metric | Precision | Recall | F1 Score | AligndAcc |
|---|---|---|---|---|
| Tokens | 97.70 | 99.14 | 98.41 | |
| Sentences | 92.54 | 91.44 | 91.99 | |
| Words | 97.70 | 99.14 | 98.41 | |
| UPOS | 94.69 | 96.09 | 95.38 | 96.92 |
| UFeats | 94.15 | 95.55 | 94.84 | 96.37 |
| UAS | 86.21 | 87.49 | 86.84 | 88.25 |
| LAS | 83.47 | 84.71 | 84.09 | 85.44 |
| CLAS | 81.39 | 82.95 | 82.16 | 84.05 |
| MLAS | 76.28 | 77.74 | 77.00 | 78.77 |
| Metric | Precision | Recall | F1 Score | AligndAcc |
|---|---|---|---|---|
| Tokens | 97.70 | 99.14 | 98.41 | |
| Sentences | 94.81 | 95.58 | 95.20 | |
| Words | 97.70 | 99.14 | 98.41 | |
| UPOS | 95.43 | 96.84 | 96.13 | 97.68 |
| UFeats | 94.50 | 95.90 | 95.20 | 96.73 |
| UAS | 88.08 | 89.38 | 88.73 | 90.16 |
| LAS | 85.49 | 86.75 | 86.11 | 87.50 |
| CLAS | 83.71 | 85.32 | 84.51 | 86.45 |
| MLAS | 79.60 | 81.14 | 80.36 | 82.21 |
