Skip to content

Add rule lower lemmatization#1

Open
jademlc wants to merge 1 commit into
masterfrom
jademlc-patch-1
Open

Add rule lower lemmatization#1
jademlc wants to merge 1 commit into
masterfrom
jademlc-patch-1

Conversation

@jademlc
Copy link
Copy Markdown
Owner

@jademlc jademlc commented Sep 7, 2022

The special lemmatization will lowercase the whole text, even proper nouns

Description

Added a function to define a special_lemmatization that returns text.lower() even when the token is detected as a proper nouns. This change is useful when using spaCy Matchers with noisy texts containing a lot of uppercases. This was discussed in this issue: explosion#11051
Using the special lemmatizer, the lemma of 'CAT' will be 'cat' instead of 'CAT'.

Types of change

This change is a enhancement of the lemmatizer.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Comment thread spacy/pipeline/lemmatizer.py
Comment thread spacy/pipeline/lemmatizer.py Outdated
self.cache[cache_key] = forms
return forms

def special_lemmatize(self, token: Token) -> List[str]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be better renaming it rule_lower_lemmatize to have something explicit

Comment thread spacy/pipeline/lemmatizer.py Outdated
if not forms:
forms.append(orig)
self.cache[cache_key] = forms
return forms
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of duplicate code for only the removing of

  if univ_pos == "propn":
      return [string]
  else:

I would copy paste the rule_lemmatizer to __rule_lemmatizer_ private class with a lower argument.

# __ as prefix to define it as private
def __rule_lemmatizer__(self, token: Token, tolower: boolean = False):
...
...
if not tolower and univ_pos == "propn":
    return [string]
else:
    return [string.lower()]
...
...

and then create

def rule_lemmatize(self, token: Token) -> List[str]:
        """Lemmatize using a rule-based approach.
        token (Token): The token to lemmatize.
        RETURNS (list): The available lemmas for the string.
        DOCS: https://spacy.io/api/lemmatizer#rule_lemmatize
        """
    self.__rule_lemmatize__(token)
    
def rule_lower_lemmatize(self, token: Token) -> List[str]:
        """Lemmatize using a special approach that lowercases the whole text
        token (Token): The token to lemmatize.
        RETURNS (list): The available lemmas for the string.
        DOCS: https://spacy.io/api/lemmatizer#rule_lower_lemmatize
        """
    self.__rule_lemmatize__(token, tolower=True)

Comment thread spacy/pipeline/lemmatizer.py Outdated
elif self.mode == "rule":
self.lemmatize = self.rule_lemmatize
elif self.mode == "special":
self.lemmatize = self.special_lemmatize
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif self.mode == "rule_lower":
    self.lemmatize = self.rule_lower_lemmatize

@jademlc jademlc changed the title (WIP) Add special lemmatization Add rule lower lemmatization Sep 7, 2022
The special lemmatization will lowercase the whole text, even proper nouns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants