Skip to content

English text normalization MoneyFst conflict with SerialFst and small weight does not take effect #126

@trunglebka

Description

@trunglebka

Rule conflicting between MoneyFst and SerialFst tagger

Steps/Code to reproduce bug

Command:

python nemo_text_processing/text_normalization/normalize.py --verbose --text 'Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is $5, each bottle of peanut butter is $3'

Output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is dollar three

Expected behavior

Expected output:

Thank you for the quantities. Now, lets talk about the pricing. The price for each canned salmon is five dollars, each bottle of peanut butter is three dollar

Environment overview

  • Environment location: Bare-metal
  • Method of NeMo install: pip install

Environment details

  • OS version: Fedora 38
  • PyTorch version: 2.0.0
  • Python version: 3.10.10

Additional information
I found that there is a conflict between MoneyFst and SerialFst taggers.
Both tagger returns the same weight==2404.29785
Computed using pynini.shortestdistance(tagged_lattice, delta=10**-8)[-1]})


Due to the this code:
classify = (
pynutil.add_weight(whitelist_graph, 1.01)
| pynutil.add_weight(time_graph, 1.1)
| pynutil.add_weight(date_graph, 1.09)
| pynutil.add_weight(decimal_graph, 1.1)
| pynutil.add_weight(measure_graph, 1.1)
| pynutil.add_weight(cardinal_graph, 1.1)
| pynutil.add_weight(ordinal_graph, 1.1)
| pynutil.add_weight(money_graph, 1.1)
| pynutil.add_weight(telephone_graph, 1.1)
| pynutil.add_weight(electonic_graph, 1.1)
| pynutil.add_weight(fraction_graph, 1.1)
| pynutil.add_weight(range_graph, 1.1)
| pynutil.add_weight(serial_graph, 1.1001) # should be higher than the rest of the classes

I think that serial_graph's weight should be higher money_graph but it is not, so I disabled MoneyFst to get the weight from SerialFst (changed its olabel to ensure that the weight is from the best path contains SerialFst) for this text and here is the weight with corresponding SerialFst's weight in ClassifyFst.classify:

1.1000	2404.29785
1.1001	2404.29785
1.1002	2404.2981
1.1003	2404.29858
1.1004	2404.29858
1.1005	2404.29883
1.1006	2404.29883
1.1007	2404.29907

English is not my native language, so please forgive me if there is any ambiguity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions