Matcher does not match "LOWER" pattern when the vocabulary is loaded from disk #5668

hertelm · 2020-06-29T14:23:32Z

hertelm
Jun 29, 2020

I ran into an issue with the entity linking example from https://spacy.io/usage/training#entity-linker.
Apparently, the Matcher used in the EntityRuler does not match the "LOWER" patterns.

How to reproduce the behaviour

I created and saved the knowledge base and vocabulary with the provided code, by calling python3 create_kb.py en_core_web_lg -o kb (using create_kb.py from https://spacy.io/usage/training#kb).
Then, i intend to start the training with python3 train_entity_linker.py kb/kb kb/vocab (using train_entity_linker.py from https://spacy.io/usage/training#entity-linker-model),
but I get [E188] Could not match the gold entity links to entities in the doc.

By printing the entities in the training documents, I found that the EntityRuler does not recognize any entity. However, when I change nlp = spacy.blank("en", vocab=vocab) into

nlp = spacy.blank("en")
nlp.vocab = vocab

everything works and I get the expected output.

Minimum reproduction example

I was able to further narrow down the problem, and found it looks like a bug with the "LOWER" patterns in the Matcher, which only occurs when the vocabulary is provided in the spacy.blank(...) method. This is a minimum example:

import spacy
from spacy.matcher import Matcher
from spacy.vocab import Vocab


vocab_model = spacy.load("en_core_web_lg")
vocab_model.vocab.to_disk("vocab")
vocab = Vocab().from_disk("vocab")

nlp = spacy.blank("en", vocab=vocab)

matcher = Matcher(nlp.vocab)
# pairs of identifiers and patterns for the matcher:
patterns = [("ceo", [{"TEXT": "CEO"}]),
            ("san_francisco", [{"TEXT": "San"}, {"TEXT": "Francisco"}]),
            ("apple", [{"LOWER": "apple"}]),
            ("russ_cochran", [{"LOWER": "russ"}, {"LOWER": "cochran"}])]
for identifier, pattern in patterns:
    matcher.add(identifier, None, pattern)

text = "Russ Cochran from San Francisco is the CEO of Apple."

doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

The expected outcome is

674883140622982222 russ_cochran 0 2 Russ Cochran
10446637608577279376 san_francisco 3 5 San Francisco
18285542385449738063 ceo 7 8 CEO
8566208034543834098 apple 9 10 Apple

but the actual outcome is the following, where only the "TEXT" patterns matched, but not the "LOWER" patterns:

10446637608577279376 san_francisco 3 5 San Francisco
18285542385449738063 ceo 7 8 CEO

Again, the fix from above works. Also, providing the vocabulary without saving and loading from disk works:
nlp = spacy.blank("en", vocab=vocab_model.vocab)
However, as the code examples show, the Matcher should also work when the vocabulary is loaded from disk and provided in spacy.blank(...), but this currently fails.

Your Environment

spaCy version: 2.3.0
Platform: Linux-4.15.0-108-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9

Answered by adrianeboyd

Jun 30, 2020

The to_disk() methods save all the model data that can be serialized safely but don't save the entire object. In particular, they don't save any of the methods that are part of the language configuration in spacy/lang/lg. To load the object back in an identical state, you need the language-specific initialization with the language settings from the library plus the data from the saved model. See https://spacy.io/usage/saving-loading#pipeline and https://spacy.io/usage/processing-pipelines#pipelines .

The crucial part is this:

vocab = Vocab()

When you initialize a Vocab without any of its arguments, it's missing many of its normal language-specific settings including the lexical attribute…

View full answer

adrianeboyd · 2020-06-30T09:05:17Z

adrianeboyd
Jun 30, 2020

The to_disk() methods save all the model data that can be serialized safely but don't save the entire object. In particular, they don't save any of the methods that are part of the language configuration in spacy/lang/lg. To load the object back in an identical state, you need the language-specific initialization with the language settings from the library plus the data from the saved model. See https://spacy.io/usage/saving-loading#pipeline and https://spacy.io/usage/processing-pipelines#pipelines .

The crucial part is this:

vocab = Vocab()

When you initialize a Vocab without any of its arguments, it's missing many of its normal language-specific settings including the lexical attribute settings related to lower, which are part of lex_attr_getters.

Instead you can use:

nlp = spacy.blank("en")
nlp.vocab.from_disk("/path/to/vocab")

Or if you want to see more explicitly how it's using the English settings when creating the vocab, another alternative is:

from spacy.lang.en import EnglishDefaults

vocab = EnglishDefaults.create_vocab().from_disk("/path/to/vocab")
nlp = spacy.blank("en", vocab=vocab)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Matcher does not match "LOWER" pattern when the vocabulary is loaded from disk #5668

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Matcher does not match "LOWER" pattern when the vocabulary is loaded from disk #5668

Uh oh!

Uh oh!

hertelm Jun 29, 2020

How to reproduce the behaviour

Minimum reproduction example

Your Environment

Replies: 1 comment

Uh oh!

adrianeboyd Jun 30, 2020

hertelm
Jun 29, 2020

adrianeboyd
Jun 30, 2020