fix: split trailing punctuation in sentence mode (#21) by Alex-Wengg · Pull Request #25 · FluidInference/text-processing-rs

Alex-Wengg · 2026-04-27T04:33:18Z

Summary

Fixes Punctuation behind number caused number not processed #21 — trailing punctuation glued to a word ("eight,") used to block the cardinal/aviation taggers in sentence mode, producing wrong output like "United 780 eight," instead of "United 788,".
Adds a pretokenize helper that peels leading/trailing ASCII punctuation (, . ; : ! ? ( ) [ ] { } ") into separate pretokens while remembering each token's original leading separator, so output spacing matches input exactly.
Factors the three duplicated sliding-window loops (normalize_sentence_inner, tn_normalize_sentence_with_max_span, tn_normalize_sentence_with_max_span_lang) into a single shared sentence_loop that consumes pretokens.

Behavior

Input	Before	After
`United seven eighty eight, please ...`	`United 780 eight, please ...`	`United 788, please ...`
`United seven eighty eight. please ...`	`United 780 eight. please ...`	`United 788. please ...`
`United seven eighty eight , please ...`	already worked	`United 788 , please ...` (unchanged)
`I have (twenty one) apples!`	`I have (twenty one) apples!`	`I have (21) apples!`
`don't eat twenty one apples`	`don't eat 21 apples`	`don't eat 21 apples` (unchanged)

Apostrophes and hyphens are intentionally excluded from the split set so contractions (don't) and hyphenated words (twenty-one) stay intact.

Test plan

cargo fmt
cargo build --features ffi
cargo test --features ffi — full suite passes (lib + en/de/fr/es/zh/ja/hi/lang/extensive/ffi/doc)
4 new regression tests in tests/en_tests.rs:
- test_issue_21_trailing_comma
- test_issue_21_trailing_period
- test_issue_21_space_before_punct_preserved
- test_issue_21_other_punctuation (parens, quotes, contractions)

`normalize_sentence` and the TN sentence loops tokenized via `split_whitespace`, which left punctuation attached to the previous word (e.g. `"eight,"`). The cardinal/aviation taggers then never saw a clean number phrase, producing wrong output like `"United 780 eight,"` for `"United seven eighty eight, please ..."`. Add a `pretokenize` helper that peels off leading/trailing ASCII punctuation (`, . ; : ! ? ( ) [ ] { } "`) into separate pretokens while remembering each token's original leading separator (`" "` or `""`), and factor the three identical sliding-window loops (`normalize_sentence_inner`, `tn_normalize_sentence_with_max_span`, `tn_normalize_sentence_with_max_span_lang`) into a single shared `sentence_loop`. Output is reconstructed using the preserved separators so spacing matches the input exactly: `"eight,"` becomes `["eight", ","]` internally and re-emerges as `"788,"`, while `"eight , please"` (explicit space) still emits `"788 , please"`. Apostrophes and hyphens are intentionally excluded from the split set so contractions ("don't") and hyphenated words ("twenty-one") stay intact.

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-04-27T04:45:40Z

+            let span: String = pretokens[i..end]
+                .iter()
+                .map(|p| p.text.as_str())
+                .collect::<Vec<_>>()
+                .join(" ");


🔴 Pretokenizer breaks TN whitelist abbreviations ending with '.' in sentence mode

The pretokenize function splits trailing periods from whitespace tokens (e.g., "Dr." → ["Dr", "."]). In sentence_loop, pretokens are then joined with spaces to form spans for the parser (.join(" ")), so the parser sees "Dr ." instead of "Dr.". This causes two classes of TN sentence-mode regressions:

Entries with only the period form (e.g., "e.g.", "etc.", "Prof.", "Inc.", "approx." — ~25 entries in the English TN whitelist alone): the abbreviation is never matched. "e.g." → pretokens ["e.g", "."] → single-pretoken span "e.g" has no whitelist hit, two-pretoken span "e.g ." also has no hit → abbreviation passes through unnormalized.

Entries with both forms (e.g., "Dr"/"Dr.", "vs"/"vs."): the bare form matches, but the period remains as a separate pretoken, producing a spurious period: "Dr. Smith" → "doctor. Smith" instead of "doctor Smith".

This affects all 7 TN language whitelists (en, fr, es, de, zh, hi, ja) in tn_normalize_sentence, tn_normalize_sentence_with_max_span, and tn_normalize_sentence_with_max_span_lang.

Trace for "e.g." through pretokenize + sentence_loop

word = "e.g." → chars [('e',0), ('.',1), ('g',2), ('.',3)]
start scan: 'e' not punct → start=0
end scan: chars[3]='.' is punct → end=3; chars[2]='g' not punct → stop
Core: word[0..3] = "e.g", trailing: "."
Pretokens: ["e.g", "."]

In sentence_loop:
Span [0..2]: "e.g" + " " + "." = "e.g ." → WHITELIST.get("e.g .") → None
Span [0..1]: "e.g" → WHITELIST.get("e.g") → None (only "e.g." is in the whitelist)
Fallback: output "e.g" then ".", no normalization occurs

Prompt for agents

The root cause is that sentence_loop joins all pretokens with spaces when constructing spans for the parser. This destroys the original adjacency of split-off punctuation: pretokens ["Dr", "."] from the word "Dr." become span "Dr ." instead of "Dr.", breaking TN whitelist matching. The fix needs to reconstruct the original text when building spans, not unconditionally insert spaces. The Pretoken.sep field already records the original separator ("" for adjacent/split pretokens, " " for whitespace-separated). The span construction at src/lib.rs:1115-1119 could use sep values for pretokens after the first: pretokens[i..end].iter().enumerate().fold(String::new(), |mut s, (j, p)| { if j > 0 { s.push_str(p.sep); } s.push_str(&p.text); s }) However, this would make the span include the glued punctuation (e.g., "twenty one,"), which was the original problem the pretokenizer aimed to fix. A more nuanced approach is needed, such as: 1. Reconstruct spans using sep (preserving adjacency) so parsers see original text like "Dr." or "e.g.", and ALSO try the sub-span without trailing punctuation pretokens so "twenty one" can still be tried without the comma. 2. Alternatively, make the pretokenizer TN-aware: skip splitting when the full token matches a whitelist entry. 3. Or exclude '.' from is_split_punct and handle trailing periods differently (e.g., the parser strips trailing periods before matching, or the sentence_loop retries without the last character on failure). All TN language whitelists in src/tn/{en,fr,es,de,zh,hi,ja}/whitelist.rs are affected. Affected public APIs: tn_normalize_sentence, tn_normalize_sentence_with_max_span, tn_normalize_sentence_with_max_span_lang.

Was this helpful? React with 👍 or 👎 to provide feedback.

…ching (PR #25 review) (#26) Issue #22: bare "second" was always normalized to "2nd" by the ordinal tagger, breaking sentences like "Give me a second to check." Adds an opt-in `disable_bare_second` flag on NormalizeOptions that skips the ordinal tagger only for the single-token, case-insensitive word "second". Compound forms ("twenty second" -> "22nd") and date contexts are unaffected. Default is `false` so existing behavior is preserved. Plumbed through: NormalizeOptions builder, parse_span, normalize_sentence_inner, normalize_inner, FFI signatures (nemo_normalize_with_options, nemo_normalize_sentence_with_options), WASM bindings, Swift wrapper, and C headers. Also addresses Devin AI review on PR #25 (#25 (review)): the new pretokenizer splits trailing punctuation off of words so ITN can match "twenty one," as "twenty one", but the shared sentence_loop re-joined pretokens with a literal space. That made the TN whitelist see "Dr ." instead of "Dr.", so abbreviation entries like "e.g.", "Prof.", "Inc.", "etc.", "vs." stopped matching, and both-form entries like "Dr"/"Dr." left an orphaned period ("I see Dr. Smith" -> "I see doctor. Smith"). sentence_loop now reconstructs each candidate span using the per-pretoken `sep`, preserving original adjacency. The longest-span-first iteration still tries shorter spans, so trailing punctuation cases ("twenty one,") continue to work. Tests: - tests/en_tests.rs: 3 issue #22 regression tests (default unchanged, sentence-mode opt-in, single-expression opt-in). - tests/en_tn_tests.rs: test_pr25_tn_abbreviation_regression covering Dr., vs., e.g., Inc., Co., Prof. - src/ffi.rs: updated existing FFI tests and added test_ffi_normalize_sentence_with_options_disable_bare_second. - All 1070 tests pass with --features ffi.

- feat: unified NormalizeOptions API + fix #23 compound concat (#24) - fix: split trailing punctuation in sentence mode (#21) (#25) - fix: add disable_bare_second flag (#22) + restore TN abbreviation matching (#26) - fix(ci): pass --features to cargo via -- separator in wasm-pack (#27)

Alex-Wengg merged commit bd70a7c into main Apr 27, 2026
5 checks passed

devin-ai-integration Bot reviewed Apr 27, 2026

View reviewed changes

Alex-Wengg mentioned this pull request Apr 27, 2026

fix: add disable_bare_second flag (#22) + restore TN abbreviation matching #26

Merged

8 tasks

Alex-Wengg mentioned this pull request Apr 27, 2026

chore(release): v0.2.2 #28

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: split trailing punctuation in sentence mode (#21)#25

fix: split trailing punctuation in sentence mode (#21)#25
Alex-Wengg merged 1 commit intomainfrom
fix/issue-21-trailing-punctuation

Alex-Wengg commented Apr 27, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Apr 27, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Test plan

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Apr 27, 2026 •

edited by devin-ai-integration Bot

Loading