fix: add disable_bare_second flag (#22) + restore TN abbreviation matching#26
Merged
Alex-Wengg merged 1 commit intomainfrom Apr 27, 2026
Merged
Conversation
…ching (PR #25 review) Issue #22: bare "second" was always normalized to "2nd" by the ordinal tagger, breaking sentences like "Give me a second to check." Adds an opt-in `disable_bare_second` flag on NormalizeOptions that skips the ordinal tagger only for the single-token, case-insensitive word "second". Compound forms ("twenty second" -> "22nd") and date contexts are unaffected. Default is `false` so existing behavior is preserved. Plumbed through: NormalizeOptions builder, parse_span, normalize_sentence_inner, normalize_inner, FFI signatures (nemo_normalize_with_options, nemo_normalize_sentence_with_options), WASM bindings, Swift wrapper, and C headers. Also addresses Devin AI review on PR #25 (#25 (review)): the new pretokenizer splits trailing punctuation off of words so ITN can match "twenty one," as "twenty one", but the shared sentence_loop re-joined pretokens with a literal space. That made the TN whitelist see "Dr ." instead of "Dr.", so abbreviation entries like "e.g.", "Prof.", "Inc.", "etc.", "vs." stopped matching, and both-form entries like "Dr"/"Dr." left an orphaned period ("I see Dr. Smith" -> "I see doctor. Smith"). sentence_loop now reconstructs each candidate span using the per-pretoken `sep`, preserving original adjacency. The longest-span-first iteration still tries shorter spans, so trailing punctuation cases ("twenty one,") continue to work. Tests: - tests/en_tests.rs: 3 issue #22 regression tests (default unchanged, sentence-mode opt-in, single-expression opt-in). - tests/en_tn_tests.rs: test_pr25_tn_abbreviation_regression covering Dr., vs., e.g., Inc., Co., Prof. - src/ffi.rs: updated existing FFI tests and added test_ffi_normalize_sentence_with_options_disable_bare_second. - All 1070 tests pass with --features ffi.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
second->2ndin "Give me a second to check" #22: adds opt-indisable_bare_secondflag onNormalizeOptionsso the bare wordsecond(case-insensitive, single-token) is left alone instead of being normalized to2nd. Compound ordinals (twenty second→22nd) and date contexts are unaffected. Default isfalseso existing behavior is preserved.twenty one,astwenty one, but the sharedsentence_loopre-joined pretokens with a literal space. That made the TN whitelist seeDr .instead ofDr., so abbreviation entries likee.g.,Prof.,Inc.,etc.,vs.stopped matching, and both-form entries likeDr/Dr.left an orphaned period (I see Dr. Smith→I see doctor. Smith).Issue #22 fix
disable_bare_secondis plumbed through:NormalizeOptionsbuilder (with_disable_bare_second)parse_spanandnormalize_sentence_inner/normalize_innernemo_normalize_with_options,nemo_normalize_sentence_with_options(extrauint32_t disable_bare_secondarg)normalizeWithOptions,normalizeSentenceWithOptions)disableBareSecondparameter)PR #25 review fix
sentence_loopnow reconstructs each candidate span using the per-pretokensep, preserving original adjacency. The longest-span-first iteration still tries shorter spans, so trailing-punctuation cases liketwenty one,→21,continue to work.After the fix:
I see Dr. Smith today.→I see doctor Smith today.e.g. she is here→for example she is hereInc. and Co.→incorporated and companyProf. Jones→professor Jonesvs. them→versus themTest plan
cargo fmtcargo build --features fficargo test --features ffi— 1070 tests pass, 0 failurestests/en_tests.rs::test_issue_22_default_behavior_unchangedtests/en_tests.rs::test_issue_22_sentence_disable_bare_secondtests/en_tests.rs::test_issue_22_single_expression_disable_bare_secondtests/en_tn_tests.rs::test_pr25_tn_abbreviation_regression(Dr., vs., e.g., Inc., Co., Prof.)src/ffi.rs::test_ffi_normalize_sentence_with_options_disable_bare_second