Add rule lower lemmatization by jademlc · Pull Request #1 · jademlc/spaCy

jademlc · 2022-09-07T08:06:30Z

The special lemmatization will lowercase the whole text, even proper nouns

Description

Added a function to define a special_lemmatization that returns text.lower() even when the token is detected as a proper nouns. This change is useful when using spaCy Matchers with noisy texts containing a lot of uppercases. This was discussed in this issue: explosion#11051
Using the special lemmatizer, the lemma of 'CAT' will be 'cat' instead of 'CAT'.

Types of change

This change is a enhancement of the lemmatizer.

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

YaYaB · 2022-09-07T12:38:34Z

        self.cache[cache_key] = forms
        return forms
+
+    def special_lemmatize(self, token: Token) -> List[str]:


should be better renaming it rule_lower_lemmatize to have something explicit

YaYaB · 2022-09-07T12:51:45Z

+        if not forms:
+            forms.append(orig)
+        self.cache[cache_key] = forms
+        return forms


a lot of duplicate code for only the removing of

if univ_pos == "propn": return [string] else:

I would copy paste the rule_lemmatizer to __rule_lemmatizer_ private class with a lower argument.

# __ as prefix to define it as private def __rule_lemmatizer__(self, token: Token, tolower: boolean = False): ... ... if not tolower and univ_pos == "propn": return [string] else: return [string.lower()] ... ...

and then create

def rule_lemmatize(self, token: Token) -> List[str]: """Lemmatize using a rule-based approach. token (Token): The token to lemmatize. RETURNS (list): The available lemmas for the string. DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize """ self.__rule_lemmatize__(token) def rule_lower_lemmatize(self, token: Token) -> List[str]: """Lemmatize using a special approach that lowercases the whole text token (Token): The token to lemmatize. RETURNS (list): The available lemmas for the string. DOCS: https://spacy.io/api/lemmatizer#rule_lower_lemmatize """ self.__rule_lemmatize__(token, tolower=True)

YaYaB · 2022-09-07T12:52:19Z

        elif self.mode == "rule":
            self.lemmatize = self.rule_lemmatize
+        elif self.mode == "special":
+            self.lemmatize = self.special_lemmatize


elif self.mode == "rule_lower": self.lemmatize = self.rule_lower_lemmatize

The special lemmatization will lowercase the whole text, even proper nouns

YaYaB reviewed Sep 7, 2022

View reviewed changes

jademlc force-pushed the jademlc-patch-1 branch from 9cd5373 to 7cbb15d Compare September 7, 2022 14:43

jademlc changed the title ~~(WIP) Add special lemmatization~~ Add rule lower lemmatization Sep 7, 2022

Add special lemmatization

5ed72c5

The special lemmatization will lowercase the whole text, even proper nouns

jademlc force-pushed the jademlc-patch-1 branch from 7cbb15d to 5ed72c5 Compare September 7, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rule lower lemmatization#1

Add rule lower lemmatization#1
jademlc wants to merge 1 commit into
masterfrom
jademlc-patch-1

jademlc commented Sep 7, 2022 •

edited

Loading

Uh oh!

Uh oh!

YaYaB Sep 7, 2022

Uh oh!

YaYaB Sep 7, 2022

Uh oh!

YaYaB Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jademlc commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of change

Checklist

Uh oh!

Uh oh!

YaYaB Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

YaYaB Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

YaYaB Sep 7, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jademlc commented Sep 7, 2022 •

edited

Loading