Skip to content

fix: add disable_bare_second flag (#22) + restore TN abbreviation matching#26

Merged
Alex-Wengg merged 1 commit intomainfrom
fix/issue-22-disable-bare-second
Apr 27, 2026
Merged

fix: add disable_bare_second flag (#22) + restore TN abbreviation matching#26
Alex-Wengg merged 1 commit intomainfrom
fix/issue-22-disable-bare-second

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 27, 2026

Summary

  • Closes Should not have conversion: second -> 2nd in "Give me a second to check" #22: adds opt-in disable_bare_second flag on NormalizeOptions so the bare word second (case-insensitive, single-token) is left alone instead of being normalized to 2nd. Compound ordinals (twenty second22nd) and date contexts are unaffected. Default is false so existing behavior is preserved.
  • Also addresses Devin AI review on PR #25: the new pretokenizer split trailing punctuation off of words so ITN could match twenty one, as twenty one, but the shared sentence_loop re-joined pretokens with a literal space. That made the TN whitelist see Dr . instead of Dr., so abbreviation entries like e.g., Prof., Inc., etc., vs. stopped matching, and both-form entries like Dr/Dr. left an orphaned period (I see Dr. SmithI see doctor. Smith).

Issue #22 fix

disable_bare_second is plumbed through:

  • NormalizeOptions builder (with_disable_bare_second)
  • parse_span and normalize_sentence_inner / normalize_inner
  • FFI: nemo_normalize_with_options, nemo_normalize_sentence_with_options (extra uint32_t disable_bare_second arg)
  • WASM bindings (normalizeWithOptions, normalizeSentenceWithOptions)
  • Swift wrapper (disableBareSecond parameter)
  • C headers in both `swift/include` and `swift-test/Sources/CNemoTextProcessing/include`
let opts = NormalizeOptions::new().with_disable_bare_second(true);
normalize_sentence_with_options(\"Give me a second to check.\", opts);
// -> \"Give me a second to check.\"

normalize_sentence_with_options(\"He finished in the twenty second place\", opts);
// -> \"He finished in the 22nd place\"

PR #25 review fix

sentence_loop now reconstructs each candidate span using the per-pretoken sep, preserving original adjacency. The longest-span-first iteration still tries shorter spans, so trailing-punctuation cases like twenty one,21, continue to work.

for end in (i + 1..=max_end).rev() {
    let mut span = String::new();
    for (idx, p) in pretokens[i..end].iter().enumerate() {
        if idx > 0 { span.push_str(p.sep); }
        span.push_str(&p.text);
    }
    // ...
}

After the fix:

  • I see Dr. Smith today.I see doctor Smith today.
  • e.g. she is herefor example she is here
  • Inc. and Co.incorporated and company
  • Prof. Jonesprofessor Jones
  • vs. themversus them

Test plan

  • cargo fmt
  • cargo build --features ffi
  • cargo test --features ffi — 1070 tests pass, 0 failures
  • tests/en_tests.rs::test_issue_22_default_behavior_unchanged
  • tests/en_tests.rs::test_issue_22_sentence_disable_bare_second
  • tests/en_tests.rs::test_issue_22_single_expression_disable_bare_second
  • tests/en_tn_tests.rs::test_pr25_tn_abbreviation_regression (Dr., vs., e.g., Inc., Co., Prof.)
  • src/ffi.rs::test_ffi_normalize_sentence_with_options_disable_bare_second

Open in Devin Review

…ching (PR #25 review)

Issue #22: bare "second" was always normalized to "2nd" by the ordinal
tagger, breaking sentences like "Give me a second to check." Adds an
opt-in `disable_bare_second` flag on NormalizeOptions that skips the
ordinal tagger only for the single-token, case-insensitive word
"second". Compound forms ("twenty second" -> "22nd") and date contexts
are unaffected. Default is `false` so existing behavior is preserved.

Plumbed through: NormalizeOptions builder, parse_span,
normalize_sentence_inner, normalize_inner, FFI signatures
(nemo_normalize_with_options, nemo_normalize_sentence_with_options),
WASM bindings, Swift wrapper, and C headers.

Also addresses Devin AI review on PR #25
(#25 (review)):
the new pretokenizer splits trailing punctuation off of words so ITN
can match "twenty one," as "twenty one", but the shared sentence_loop
re-joined pretokens with a literal space. That made the TN whitelist
see "Dr ." instead of "Dr.", so abbreviation entries like "e.g.",
"Prof.", "Inc.", "etc.", "vs." stopped matching, and both-form entries
like "Dr"/"Dr." left an orphaned period ("I see Dr. Smith" -> "I see
doctor. Smith"). sentence_loop now reconstructs each candidate span
using the per-pretoken `sep`, preserving original adjacency. The
longest-span-first iteration still tries shorter spans, so trailing
punctuation cases ("twenty one,") continue to work.

Tests:
- tests/en_tests.rs: 3 issue #22 regression tests (default unchanged,
  sentence-mode opt-in, single-expression opt-in).
- tests/en_tn_tests.rs: test_pr25_tn_abbreviation_regression covering
  Dr., vs., e.g., Inc., Co., Prof.
- src/ffi.rs: updated existing FFI tests and added
  test_ffi_normalize_sentence_with_options_disable_bare_second.
- All 1070 tests pass with --features ffi.
@Alex-Wengg Alex-Wengg merged commit 8934a57 into main Apr 27, 2026
5 checks passed
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

@Alex-Wengg Alex-Wengg mentioned this pull request Apr 27, 2026
4 tasks
Alex-Wengg added a commit that referenced this pull request Apr 27, 2026
- feat: unified NormalizeOptions API + fix #23 compound concat (#24)
- fix: split trailing punctuation in sentence mode (#21) (#25)
- fix: add disable_bare_second flag (#22) + restore TN abbreviation matching (#26)
- fix(ci): pass --features to cargo via -- separator in wasm-pack (#27)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should not have conversion: second -> 2nd in "Give me a second to check"

1 participant