Skip to content

No new annotation when keywords are repeated (Window strategy) #18

@scossin

Description

@scossin
from iamsystem import Matcher
matcher = Matcher.build(
    keywords=["cancer"]
)
text = "cancer cancer"
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# cancer	0 6	cancer

It outputs a single annotation although the word 'cancer' is repeated twice. This behavior was explained in a comment in the code:

# Don't create multiple annotations for the same transition

         Don't create multiple annotations for the same transition. For example 'cancer cancer' with keyword 'cancer': if an annotation was created for the first 'cancer' occurrence, don't create a new one for the second occurrence.

The rationale was to avoid the creation of two annotations for repeated words when the window is large:

from iamsystem import Matcher
matcher = Matcher.build(
    keywords=["cancer de prostate"],
    w=20
)
text = "cancer de prostate token token token token prostate"
annots = matcher.annot_text(text=text)
for annot in annots:
    print(annot)
# cancer de prostate	0 18	cancer de prostate

However, this is not appropriate for all use cases and is not the behavior a user expects; therefore multiple sequences of words that match a keyword should be annotated several times by default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions