fix: split trailing punctuation in sentence mode (#21)#25
Conversation
`normalize_sentence` and the TN sentence loops tokenized via
`split_whitespace`, which left punctuation attached to the previous word
(e.g. `"eight,"`). The cardinal/aviation taggers then never saw a clean
number phrase, producing wrong output like `"United 780 eight,"` for
`"United seven eighty eight, please ..."`.
Add a `pretokenize` helper that peels off leading/trailing ASCII
punctuation (`, . ; : ! ? ( ) [ ] { } "`) into separate pretokens while
remembering each token's original leading separator (`" "` or `""`), and
factor the three identical sliding-window loops (`normalize_sentence_inner`,
`tn_normalize_sentence_with_max_span`, `tn_normalize_sentence_with_max_span_lang`)
into a single shared `sentence_loop`. Output is reconstructed using the
preserved separators so spacing matches the input exactly: `"eight,"`
becomes `["eight", ","]` internally and re-emerges as `"788,"`, while
`"eight , please"` (explicit space) still emits `"788 , please"`.
Apostrophes and hyphens are intentionally excluded from the split set so
contractions ("don't") and hyphenated words ("twenty-one") stay intact.
| let span: String = pretokens[i..end] | ||
| .iter() | ||
| .map(|p| p.text.as_str()) | ||
| .collect::<Vec<_>>() | ||
| .join(" "); |
There was a problem hiding this comment.
🔴 Pretokenizer breaks TN whitelist abbreviations ending with '.' in sentence mode
The pretokenize function splits trailing periods from whitespace tokens (e.g., "Dr." → ["Dr", "."]). In sentence_loop, pretokens are then joined with spaces to form spans for the parser (.join(" ")), so the parser sees "Dr ." instead of "Dr.". This causes two classes of TN sentence-mode regressions:
-
Entries with only the period form (e.g.,
"e.g.","etc.","Prof.","Inc.","approx."— ~25 entries in the English TN whitelist alone): the abbreviation is never matched."e.g."→ pretokens["e.g", "."]→ single-pretoken span"e.g"has no whitelist hit, two-pretoken span"e.g ."also has no hit → abbreviation passes through unnormalized. -
Entries with both forms (e.g.,
"Dr"/"Dr.","vs"/"vs."): the bare form matches, but the period remains as a separate pretoken, producing a spurious period:"Dr. Smith"→"doctor. Smith"instead of"doctor Smith".
This affects all 7 TN language whitelists (en, fr, es, de, zh, hi, ja) in tn_normalize_sentence, tn_normalize_sentence_with_max_span, and tn_normalize_sentence_with_max_span_lang.
Trace for "e.g." through pretokenize + sentence_loop
word = "e.g." → chars [('e',0), ('.',1), ('g',2), ('.',3)]
start scan: 'e' not punct → start=0
end scan: chars[3]='.' is punct → end=3; chars[2]='g' not punct → stop
Core: word[0..3] = "e.g", trailing: "."
Pretokens: ["e.g", "."]
In sentence_loop:
Span [0..2]: "e.g" + " " + "." = "e.g ." → WHITELIST.get("e.g .") → None
Span [0..1]: "e.g" → WHITELIST.get("e.g") → None (only "e.g." is in the whitelist)
Fallback: output "e.g" then ".", no normalization occurs
Prompt for agents
The root cause is that sentence_loop joins all pretokens with spaces when constructing spans for the parser. This destroys the original adjacency of split-off punctuation: pretokens ["Dr", "."] from the word "Dr." become span "Dr ." instead of "Dr.", breaking TN whitelist matching.
The fix needs to reconstruct the original text when building spans, not unconditionally insert spaces. The Pretoken.sep field already records the original separator ("" for adjacent/split pretokens, " " for whitespace-separated). The span construction at src/lib.rs:1115-1119 could use sep values for pretokens after the first:
pretokens[i..end].iter().enumerate().fold(String::new(), |mut s, (j, p)| {
if j > 0 { s.push_str(p.sep); }
s.push_str(&p.text);
s
})
However, this would make the span include the glued punctuation (e.g., "twenty one,"), which was the original problem the pretokenizer aimed to fix. A more nuanced approach is needed, such as:
1. Reconstruct spans using sep (preserving adjacency) so parsers see original text like "Dr." or "e.g.", and ALSO try the sub-span without trailing punctuation pretokens so "twenty one" can still be tried without the comma.
2. Alternatively, make the pretokenizer TN-aware: skip splitting when the full token matches a whitelist entry.
3. Or exclude '.' from is_split_punct and handle trailing periods differently (e.g., the parser strips trailing periods before matching, or the sentence_loop retries without the last character on failure).
All TN language whitelists in src/tn/{en,fr,es,de,zh,hi,ja}/whitelist.rs are affected. Affected public APIs: tn_normalize_sentence, tn_normalize_sentence_with_max_span, tn_normalize_sentence_with_max_span_lang.
Was this helpful? React with 👍 or 👎 to provide feedback.
…ching (PR #25 review) (#26) Issue #22: bare "second" was always normalized to "2nd" by the ordinal tagger, breaking sentences like "Give me a second to check." Adds an opt-in `disable_bare_second` flag on NormalizeOptions that skips the ordinal tagger only for the single-token, case-insensitive word "second". Compound forms ("twenty second" -> "22nd") and date contexts are unaffected. Default is `false` so existing behavior is preserved. Plumbed through: NormalizeOptions builder, parse_span, normalize_sentence_inner, normalize_inner, FFI signatures (nemo_normalize_with_options, nemo_normalize_sentence_with_options), WASM bindings, Swift wrapper, and C headers. Also addresses Devin AI review on PR #25 (#25 (review)): the new pretokenizer splits trailing punctuation off of words so ITN can match "twenty one," as "twenty one", but the shared sentence_loop re-joined pretokens with a literal space. That made the TN whitelist see "Dr ." instead of "Dr.", so abbreviation entries like "e.g.", "Prof.", "Inc.", "etc.", "vs." stopped matching, and both-form entries like "Dr"/"Dr." left an orphaned period ("I see Dr. Smith" -> "I see doctor. Smith"). sentence_loop now reconstructs each candidate span using the per-pretoken `sep`, preserving original adjacency. The longest-span-first iteration still tries shorter spans, so trailing punctuation cases ("twenty one,") continue to work. Tests: - tests/en_tests.rs: 3 issue #22 regression tests (default unchanged, sentence-mode opt-in, single-expression opt-in). - tests/en_tn_tests.rs: test_pr25_tn_abbreviation_regression covering Dr., vs., e.g., Inc., Co., Prof. - src/ffi.rs: updated existing FFI tests and added test_ffi_normalize_sentence_with_options_disable_bare_second. - All 1070 tests pass with --features ffi.
Summary
"eight,") used to block the cardinal/aviation taggers in sentence mode, producing wrong output like"United 780 eight,"instead of"United 788,".pretokenizehelper that peels leading/trailing ASCII punctuation (, . ; : ! ? ( ) [ ] { } ") into separate pretokens while remembering each token's original leading separator, so output spacing matches input exactly.normalize_sentence_inner,tn_normalize_sentence_with_max_span,tn_normalize_sentence_with_max_span_lang) into a single sharedsentence_loopthat consumes pretokens.Behavior
United seven eighty eight, please ...United 780 eight, please ...United 788, please ...United seven eighty eight. please ...United 780 eight. please ...United 788. please ...United seven eighty eight , please ...United 788 , please ...(unchanged)I have (twenty one) apples!I have (twenty one) apples!I have (21) apples!don't eat twenty one applesdon't eat 21 applesdon't eat 21 apples(unchanged)Apostrophes and hyphens are intentionally excluded from the split set so contractions (
don't) and hyphenated words (twenty-one) stay intact.Test plan
cargo fmtcargo build --features fficargo test --features ffi— full suite passes (lib + en/de/fr/es/zh/ja/hi/lang/extensive/ffi/doc)tests/en_tests.rs:test_issue_21_trailing_commatest_issue_21_trailing_periodtest_issue_21_space_before_punct_preservedtest_issue_21_other_punctuation(parens, quotes, contractions)