Matcher does not match "LOWER" pattern when the vocabulary is loaded from disk #5668
-
|
I ran into an issue with the entity linking example from https://spacy.io/usage/training#entity-linker. How to reproduce the behaviourI created and saved the knowledge base and vocabulary with the provided code, by calling By printing the entities in the training documents, I found that the EntityRuler does not recognize any entity. However, when I change nlp = spacy.blank("en")
nlp.vocab = vocabeverything works and I get the expected output. Minimum reproduction exampleI was able to further narrow down the problem, and found it looks like a bug with the "LOWER" patterns in the Matcher, which only occurs when the vocabulary is provided in the spacy.blank(...) method. This is a minimum example: import spacy
from spacy.matcher import Matcher
from spacy.vocab import Vocab
vocab_model = spacy.load("en_core_web_lg")
vocab_model.vocab.to_disk("vocab")
vocab = Vocab().from_disk("vocab")
nlp = spacy.blank("en", vocab=vocab)
matcher = Matcher(nlp.vocab)
# pairs of identifiers and patterns for the matcher:
patterns = [("ceo", [{"TEXT": "CEO"}]),
("san_francisco", [{"TEXT": "San"}, {"TEXT": "Francisco"}]),
("apple", [{"LOWER": "apple"}]),
("russ_cochran", [{"LOWER": "russ"}, {"LOWER": "cochran"}])]
for identifier, pattern in patterns:
matcher.add(identifier, None, pattern)
text = "Russ Cochran from San Francisco is the CEO of Apple."
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)The expected outcome is but the actual outcome is the following, where only the "TEXT" patterns matched, but not the "LOWER" patterns: Again, the fix from above works. Also, providing the vocabulary without saving and loading from disk works: Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
The The crucial part is this: When you initialize a Instead you can use: Or if you want to see more explicitly how it's using the English settings when creating the vocab, another alternative is: |
Beta Was this translation helpful? Give feedback.
The
to_disk()methods save all the model data that can be serialized safely but don't save the entire object. In particular, they don't save any of the methods that are part of the language configuration inspacy/lang/lg. To load the object back in an identical state, you need the language-specific initialization with the language settings from the library plus the data from the saved model. See https://spacy.io/usage/saving-loading#pipeline and https://spacy.io/usage/processing-pipelines#pipelines .The crucial part is this:
When you initialize a
Vocabwithout any of its arguments, it's missing many of its normal language-specific settings including the lexical attribute…