Skip to content

fix: split trailing punctuation in sentence mode (#21)#25

Merged
Alex-Wengg merged 1 commit intomainfrom
fix/issue-21-trailing-punctuation
Apr 27, 2026
Merged

fix: split trailing punctuation in sentence mode (#21)#25
Alex-Wengg merged 1 commit intomainfrom
fix/issue-21-trailing-punctuation

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 27, 2026

Summary

  • Fixes Punctuation behind number caused number not processed #21 — trailing punctuation glued to a word ("eight,") used to block the cardinal/aviation taggers in sentence mode, producing wrong output like "United 780 eight," instead of "United 788,".
  • Adds a pretokenize helper that peels leading/trailing ASCII punctuation (, . ; : ! ? ( ) [ ] { } ") into separate pretokens while remembering each token's original leading separator, so output spacing matches input exactly.
  • Factors the three duplicated sliding-window loops (normalize_sentence_inner, tn_normalize_sentence_with_max_span, tn_normalize_sentence_with_max_span_lang) into a single shared sentence_loop that consumes pretokens.

Behavior

Input Before After
United seven eighty eight, please ... United 780 eight, please ... United 788, please ...
United seven eighty eight. please ... United 780 eight. please ... United 788. please ...
United seven eighty eight , please ... already worked United 788 , please ... (unchanged)
I have (twenty one) apples! I have (twenty one) apples! I have (21) apples!
don't eat twenty one apples don't eat 21 apples don't eat 21 apples (unchanged)

Apostrophes and hyphens are intentionally excluded from the split set so contractions (don't) and hyphenated words (twenty-one) stay intact.

Test plan

  • cargo fmt
  • cargo build --features ffi
  • cargo test --features ffi — full suite passes (lib + en/de/fr/es/zh/ja/hi/lang/extensive/ffi/doc)
  • 4 new regression tests in tests/en_tests.rs:
    • test_issue_21_trailing_comma
    • test_issue_21_trailing_period
    • test_issue_21_space_before_punct_preserved
    • test_issue_21_other_punctuation (parens, quotes, contractions)

Open in Devin Review

`normalize_sentence` and the TN sentence loops tokenized via
`split_whitespace`, which left punctuation attached to the previous word
(e.g. `"eight,"`). The cardinal/aviation taggers then never saw a clean
number phrase, producing wrong output like `"United 780 eight,"` for
`"United seven eighty eight, please ..."`.

Add a `pretokenize` helper that peels off leading/trailing ASCII
punctuation (`, . ; : ! ? ( ) [ ] { } "`) into separate pretokens while
remembering each token's original leading separator (`" "` or `""`), and
factor the three identical sliding-window loops (`normalize_sentence_inner`,
`tn_normalize_sentence_with_max_span`, `tn_normalize_sentence_with_max_span_lang`)
into a single shared `sentence_loop`. Output is reconstructed using the
preserved separators so spacing matches the input exactly: `"eight,"`
becomes `["eight", ","]` internally and re-emerges as `"788,"`, while
`"eight , please"` (explicit space) still emits `"788 , please"`.

Apostrophes and hyphens are intentionally excluded from the split set so
contractions ("don't") and hyphenated words ("twenty-one") stay intact.
@Alex-Wengg Alex-Wengg merged commit bd70a7c into main Apr 27, 2026
5 checks passed
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread src/lib.rs
Comment on lines +1115 to +1119
let span: String = pretokens[i..end]
.iter()
.map(|p| p.text.as_str())
.collect::<Vec<_>>()
.join(" ");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Pretokenizer breaks TN whitelist abbreviations ending with '.' in sentence mode

The pretokenize function splits trailing periods from whitespace tokens (e.g., "Dr."["Dr", "."]). In sentence_loop, pretokens are then joined with spaces to form spans for the parser (.join(" ")), so the parser sees "Dr ." instead of "Dr.". This causes two classes of TN sentence-mode regressions:

  1. Entries with only the period form (e.g., "e.g.", "etc.", "Prof.", "Inc.", "approx." — ~25 entries in the English TN whitelist alone): the abbreviation is never matched. "e.g." → pretokens ["e.g", "."] → single-pretoken span "e.g" has no whitelist hit, two-pretoken span "e.g ." also has no hit → abbreviation passes through unnormalized.

  2. Entries with both forms (e.g., "Dr"/"Dr.", "vs"/"vs."): the bare form matches, but the period remains as a separate pretoken, producing a spurious period: "Dr. Smith""doctor. Smith" instead of "doctor Smith".

This affects all 7 TN language whitelists (en, fr, es, de, zh, hi, ja) in tn_normalize_sentence, tn_normalize_sentence_with_max_span, and tn_normalize_sentence_with_max_span_lang.

Trace for "e.g." through pretokenize + sentence_loop

word = "e.g." → chars [('e',0), ('.',1), ('g',2), ('.',3)]
start scan: 'e' not punct → start=0
end scan: chars[3]='.' is punct → end=3; chars[2]='g' not punct → stop
Core: word[0..3] = "e.g", trailing: "."
Pretokens: ["e.g", "."]

In sentence_loop:
Span [0..2]: "e.g" + " " + "." = "e.g ." → WHITELIST.get("e.g .") → None
Span [0..1]: "e.g" → WHITELIST.get("e.g") → None (only "e.g." is in the whitelist)
Fallback: output "e.g" then ".", no normalization occurs

Prompt for agents
The root cause is that sentence_loop joins all pretokens with spaces when constructing spans for the parser. This destroys the original adjacency of split-off punctuation: pretokens ["Dr", "."] from the word "Dr." become span "Dr ." instead of "Dr.", breaking TN whitelist matching.

The fix needs to reconstruct the original text when building spans, not unconditionally insert spaces. The Pretoken.sep field already records the original separator ("" for adjacent/split pretokens, " " for whitespace-separated). The span construction at src/lib.rs:1115-1119 could use sep values for pretokens after the first:

  pretokens[i..end].iter().enumerate().fold(String::new(), |mut s, (j, p)| {
      if j > 0 { s.push_str(p.sep); }
      s.push_str(&p.text);
      s
  })

However, this would make the span include the glued punctuation (e.g., "twenty one,"), which was the original problem the pretokenizer aimed to fix. A more nuanced approach is needed, such as:

1. Reconstruct spans using sep (preserving adjacency) so parsers see original text like "Dr." or "e.g.", and ALSO try the sub-span without trailing punctuation pretokens so "twenty one" can still be tried without the comma.

2. Alternatively, make the pretokenizer TN-aware: skip splitting when the full token matches a whitelist entry.

3. Or exclude '.' from is_split_punct and handle trailing periods differently (e.g., the parser strips trailing periods before matching, or the sentence_loop retries without the last character on failure).

All TN language whitelists in src/tn/{en,fr,es,de,zh,hi,ja}/whitelist.rs are affected. Affected public APIs: tn_normalize_sentence, tn_normalize_sentence_with_max_span, tn_normalize_sentence_with_max_span_lang.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg added a commit that referenced this pull request Apr 27, 2026
…ching (PR #25 review) (#26)

Issue #22: bare "second" was always normalized to "2nd" by the ordinal
tagger, breaking sentences like "Give me a second to check." Adds an
opt-in `disable_bare_second` flag on NormalizeOptions that skips the
ordinal tagger only for the single-token, case-insensitive word
"second". Compound forms ("twenty second" -> "22nd") and date contexts
are unaffected. Default is `false` so existing behavior is preserved.

Plumbed through: NormalizeOptions builder, parse_span,
normalize_sentence_inner, normalize_inner, FFI signatures
(nemo_normalize_with_options, nemo_normalize_sentence_with_options),
WASM bindings, Swift wrapper, and C headers.

Also addresses Devin AI review on PR #25
(#25 (review)):
the new pretokenizer splits trailing punctuation off of words so ITN
can match "twenty one," as "twenty one", but the shared sentence_loop
re-joined pretokens with a literal space. That made the TN whitelist
see "Dr ." instead of "Dr.", so abbreviation entries like "e.g.",
"Prof.", "Inc.", "etc.", "vs." stopped matching, and both-form entries
like "Dr"/"Dr." left an orphaned period ("I see Dr. Smith" -> "I see
doctor. Smith"). sentence_loop now reconstructs each candidate span
using the per-pretoken `sep`, preserving original adjacency. The
longest-span-first iteration still tries shorter spans, so trailing
punctuation cases ("twenty one,") continue to work.

Tests:
- tests/en_tests.rs: 3 issue #22 regression tests (default unchanged,
  sentence-mode opt-in, single-expression opt-in).
- tests/en_tn_tests.rs: test_pr25_tn_abbreviation_regression covering
  Dr., vs., e.g., Inc., Co., Prof.
- src/ffi.rs: updated existing FFI tests and added
  test_ffi_normalize_sentence_with_options_disable_bare_second.
- All 1070 tests pass with --features ffi.
@Alex-Wengg Alex-Wengg mentioned this pull request Apr 27, 2026
4 tasks
Alex-Wengg added a commit that referenced this pull request Apr 27, 2026
- feat: unified NormalizeOptions API + fix #23 compound concat (#24)
- fix: split trailing punctuation in sentence mode (#21) (#25)
- fix: add disable_bare_second flag (#22) + restore TN abbreviation matching (#26)
- fix(ci): pass --features to cargo via -- separator in wasm-pack (#27)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Punctuation behind number caused number not processed

1 participant