Fix spelled decimals (#15) + add opt-in aviation pipeline (#14)#20
Fix spelled decimals (#15) + add opt-in aviation pipeline (#14)#20Alex-Wengg merged 9 commits intomainfrom
Conversation
Aviation/code-style readings like 'one three five' previously fell
through cardinal::words_to_number's running-sum loop and returned 9
(1+3+5) instead of 135. This cascaded into:
- normalize_sentence('frequency one three five point six two five')
producing 'frequency 9.625' instead of 'frequency 135.625'
- The telephone tagger silently swallowing 'point' and emitting
'135-625' for normalize() of the same input
Fix:
- cardinal::words_to_number now recognises a multi-token sequence of
pure single-digit words (zero-nine, plus oh/o for 0) and
concatenates them as digits, matching how humans read aviation
frequencies, flight numbers, codes, etc. Single-token and
cardinal-with-scale inputs (twenty one, one hundred thirty five,
one thousand two hundred thirty four, etc.) are unaffected.
- telephone::parse now rejects inputs containing ' point ' so the
decimal tagger handles them instead of the telephone tagger
formatting them with hyphens.
Tests:
- cardinal: spelled-digit sequence parsing (135, 737, 911, 12, 505,
404) and words_to_number direct calls.
- decimal: 'one three five point six two five' -> 135.625 and
related variants.
- telephone: rejects inputs with ' point '.
- en_tests: direct reproductions of issue #15 (both normalize and
normalize_sentence) plus regressions for normal cardinal phrasings.
- en_tn_tests: release-sanity test for tn_normalize_sentence (the
public API hongbo-miao reported missing in 0.1.0 / issue #16).
Closes #15.
The time tagger correctly handles "five oh five" → "05:05" patterns, which intercepted before telephone in the normalize() priority chain. Drop that assertion since it's testing time semantics, not cardinal digit-concat — the unit test in cardinal.rs already covers the direct words_to_number → 505 path which doesn't go through normalize. Also fix multi-line assert layout to match rustfmt output.
Drop the len >= 2 guard on the digit-concat path — single-token inputs like "one" produce the same answer either way. Move the running-sum/scale-multiplication logic out into grammatical_words_to_number, so the entry point reads as the actual disambiguation: "all single-digit words → concat; otherwise → grammar". No behavioural change vs. the previous commit on this branch; the existing test suite covers both paths.
"seven eighty eight" was running-summed to 95 (= 7 + 80 + 8) instead of being read as flight 788. The grammatical accumulator in words_to_number had no way to express "leading single-digit prefix + trailing grammatical compound", which is the natural reading for flight numbers, room numbers, and similar codes. Add a third reading mode to words_to_number, sandwiched between the all-digits path and the grammatical fallback: - Detect a leading run of single-digit words (zero-nine, oh, o) - Parse the remainder as a grammatical compound - Concatenate as strings: "7" ‖ 88 = "788" Disabled when the input contains a scale word (hundred/thousand/...), so "two thousand seventeen" stays 2017 instead of becoming 22017. This fix lives in cardinal because the sentence-mode pipeline (parse_span) reaches cardinal directly without going through the telephone tagger. The existing telephone-based fix for issue #15 only helps the single-string normalize() path; normalize_sentence("United seven eighty eight") had to be addressed at the cardinal level. Tests: - cardinal::tests::test_aviation_flight_number_style - cardinal::tests::test_words_to_number_aviation_flight_number - cardinal::tests::test_scale_word_forces_grammatical - en_tests::test_issue_14_aviation_flight_number - en_tests::test_issue_14_aviation_flight_number_in_sentence - en_tests::test_issue_14_does_not_break_scale_grammar Closes #14.
…er_aviation
The previous commit wired aviation flight-number reading directly into
`cardinal::words_to_number` (the generic ITN dispatch path). That broke
`date::parse_old_year` for "twenty one forty two" (was 2043 via 20*100+43,
became None because year_part now reads 142 > 99) and made test inputs
collide with the time tagger.
This commit reverts `words_to_number` to grammatical + digit-by-digit only
(matching upstream NeMo behaviour) and exposes the aviation reading as a
separate opt-in helper `words_to_number_aviation`. Callers in flight-number
or call-sign contexts can reach for it explicitly; generic dispatch stays
out of the date/time tagger's way.
Behaviour summary (verified):
- `normalize_sentence("twenty one forty two")` → "2043" (date, restored)
- `normalize_sentence("seven eighty eight")` → "95" (grammatical)
- `normalize("seven eighty eight")` → "788" (existing telephone tagger)
- `words_to_number_aviation("seven eighty eight")` → 788 (opt-in)
- `words_to_number("seven eighty eight")` → 95 (no aviation)
…ce_aviation)
Caller-prioritized aviation flight-number reading. Default dispatch keeps
upstream NeMo semantics (date wins for "twenty one forty two", time wins
for "two thirty five"); callers in flight-number / call-sign contexts opt
in by calling the new aviation variants.
New public API:
- `cardinal::parse_aviation(input)` — like `cardinal::parse` but uses
`words_to_number_aviation` to recognise "seven eighty eight" as 788.
- `normalize_aviation(input)` — single-input dispatch with aviation
cardinal tried before time/date.
- `normalize_sentence_aviation(input)` — sentence-mode dispatch via a new
`parse_span_aviation` that puts aviation cardinal at priority 89
(above date 88, time 85; below measure 90, money 95).
- `normalize_sentence_aviation_with_max_span(input, max_span)` for the
configurable-span variant.
Behaviour matrix (verified):
normalize normalize_aviation sentence sentence_aviation
"twenty one forty two" 2043 63 2043 63
"seven eighty eight" 788 788 95 788
"two thirty five" 02:35 235 02:35 235
"United seven eighty eight" 788 788 United 95 United 788
"flight two thirty five ..." 235-4* 235-4* flight 02:35 flight 235
"five dollars" / "five point two" $5/5.2 unchanged unchanged unchanged
"twenty first" / "two thousand
seventeen" / "one hundred" unchanged unchanged unchanged unchanged
(* `normalize` single-input mode lets the telephone tagger eat the whole
span; sentence mode does not.)
Internals: `normalize_sentence_with_max_span` and the new
`normalize_sentence_aviation_with_max_span` share their loop body via
a private `normalize_sentence_inner(parse_span_fn)` helper.
Closes #14 follow-up: provides the dispatch-level aviation pipeline that
the previous `words_to_number_aviation` helper alone did not.
…tion) The previous commit forked parse_span into a near-identical parse_span_aviation that only differed by inserting cardinal::parse_aviation at priority 89. That meant adding a new tagger required updating two lists. Replace the fork with a single boolean parameter: - `parse_span(span, aviation: bool)` — one tagger list, the aviation cardinal at priority 89 lives behind `if aviation`, the regular cardinal fallback at 70 lives behind `if !aviation`. - `normalize_sentence_inner(input, max_span, aviation)` — pass through the bool instead of a fn-pointer. - `parse_span_aviation` deleted. Public API and behaviour unchanged. Net -47 lines.
- words_to_number / words_to_number_aviation: require words.len() >= 2 for digit-by-digit reading. Bare "oh" / "o" no longer resolve to 0 (those forms are only digits inside a longer spelled code; in isolation they are interjections / letters). Bare "zero" still works via the grammatical fallback. - parse_span: drop the <=4 token gate for the aviation cardinal branch. Aviation sentence mode is opt-in, so the caller has accepted aggressive matching across longer spans like "one thousand two hundred thirty four". parse_aviation falls back to grammatical, so non-aviation phrases still resolve. - ffi.rs / wasm.rs: expose the aviation pipeline (per AGENTS.md rule that lib / ffi / wasm be kept in sync). Adds nemo_normalize_aviation, nemo_normalize_sentence_aviation, nemo_normalize_sentence_aviation_with_max_span and the matching wasm-bindgen exports.
| return words | ||
| .iter() | ||
| .map(|w| single_digit_char(w).unwrap()) | ||
| .collect::<String>() | ||
| .parse() | ||
| .ok(); |
There was a problem hiding this comment.
🟡 Uses unwrap() in library code, violating AGENTS.md style rule
The AGENTS.md specifies: "Error handling: prefer Option/Result; avoid unwrap()/expect() in library code". Three new call sites in src/itn/en/cardinal.rs use .unwrap() on single_digit_char(w) (lines 191, 227, 246). While these calls are guarded by a prior .all(|w| single_digit_char(w).is_some()) check, making them safe at runtime, they can be trivially replaced with .filter_map(|w| single_digit_char(w)) to satisfy the style guideline without any behavioral change.
| return words | |
| .iter() | |
| .map(|w| single_digit_char(w).unwrap()) | |
| .collect::<String>() | |
| .parse() | |
| .ok(); | |
| return words | |
| .iter() | |
| .filter_map(|w| single_digit_char(w)) | |
| .collect::<String>() | |
| .parse() | |
| .ok(); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| return words | ||
| .iter() | ||
| .map(|w| single_digit_char(w).unwrap()) | ||
| .collect::<String>() | ||
| .parse() | ||
| .ok(); |
There was a problem hiding this comment.
🟡 Uses unwrap() in library code (words_to_number_aviation digit-by-digit path)
Same AGENTS.md rule violation as BUG-0001, in the words_to_number_aviation function's digit-by-digit reading path at src/itn/en/cardinal.rs:227. Guarded by .all() but can be replaced with .filter_map().
| return words | |
| .iter() | |
| .map(|w| single_digit_char(w).unwrap()) | |
| .collect::<String>() | |
| .parse() | |
| .ok(); | |
| return words | |
| .iter() | |
| .filter_map(|w| single_digit_char(w)) | |
| .collect::<String>() | |
| .parse() | |
| .ok(); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| let prefix: String = words[..prefix_len] | ||
| .iter() | ||
| .map(|w| single_digit_char(w).unwrap()) | ||
| .collect(); |
There was a problem hiding this comment.
🟡 Uses unwrap() in library code (aviation prefix concatenation path)
Same AGENTS.md rule violation, in the aviation prefix concatenation logic at src/itn/en/cardinal.rs:246. The .map(|w| single_digit_char(w).unwrap()) can be replaced with .filter_map(|w| single_digit_char(w)).
| let prefix: String = words[..prefix_len] | |
| .iter() | |
| .map(|w| single_digit_char(w).unwrap()) | |
| .collect(); | |
| let prefix: String = words[..prefix_len] | |
| .iter() | |
| .filter_map(|w| single_digit_char(w)) | |
| .collect(); |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Two related issues, two different fix shapes:
"one three five point six two five"returned9.625instead of135.625. Default-path bug. Fixed in the default pipeline.normalize_sentence("United seven eighty eight")returned"United 95"(7+80+8) instead of"United 788". Genuinely ambiguous (e.g."two thirty five"is02:35to a clock-reader,235to ATC,65to a math student). Fixed as an opt-in pipeline so callers in flight-number / call-sign context get aviation reading without forcing it on everyone.Issue #15 — default-path fix
cardinal::words_to_numbernow dispatches between two readings:zero-nine, plusoh/o). Concatenate. This is the pathdecimal::parse_point_decimalreaches when reading the integer side of a spelled decimal.decimal::parse_point_decimalandtelephone::parsewere updated to use the digit-by-digit path. No public API change for #15.Issue #14 — opt-in aviation pipeline
Why not auto-apply: I tried wiring aviation reading directly into
words_to_number, and it brokedate::parse_old_yearfor"twenty one forty two"(was2043, becameNonebecauseyear_partstarted reading142>99). Aviation context can't be detected purely from the tokens, so the only safe option is to let the caller decide.New public API (additive)
cardinal::words_to_number_aviation(s)"seven eighty eight"→Some(788). Falls back to grammatical when no digit prefix.cardinal::parse_aviation(s)cardinal::parse.normalize_aviation(s)normalizefor non-number phrases.normalize_sentence_aviation(s)date=88,time=85; belowmeasure=90,money=95).normalize_sentence_aviation_with_max_span(s, n)Internal change
parse_spannow takes anaviation: boolparameter. Single tagger list, two short conditional blocks for the aviation injection.normalize_sentence_with_max_spanpassesfalse; the aviation variant passestrue. No fork to maintain.Behaviour matrix
normalizenormalize_aviationnormalize_sentencenormalize_sentence_aviation"twenty one forty two"204363204363"seven eighty eight"788*78895788"two thirty five"02:3523502:35235"United seven eighty eight"788*788United 95United 788"flight two thirty five departs ..."235-4*235-4*flight 02:35 ...flight 235 ..."one three five point six two five"135.625135.625135.625135.625"five dollars"$5$5$5$5"twenty first"21st21st21st21st"two thousand seventeen"2017201720172017*
normalizesingle-input mode lets the telephone tagger eat the whole span; sentence mode does not.Default
normalizeandnormalize_sentenceoutputs are bit-for-bit unchanged from before this PR (apart from the #15 spelled-decimal fix).Tests
Unit tests in
src/itn/en/cardinal.rs:test_words_to_number_aviation_flight_numbertest_words_to_number_aviation_falls_back_to_grammaticaltest_words_to_number_aviation_scale_word_forces_grammaticaltest_words_to_number_no_aviation_reading— guards that genericwords_to_numberdoes not do aviation readingIntegration tests in
tests/en_tests.rs:test_issue_14_default_dispatch_unchangedtest_issue_14_normalize_aviationtest_issue_14_normalize_sentence_aviationtest_issue_15_*(already on branch)test_spelled_digit_cardinal*(already on branch)Test plan
cargo test— full suite passes (911 tests across 12 binaries)cargo fmt --check— cleannormalize_sentence_aviation("United seven eighty eight")→"United 788"normalize("one three five point six two five")→"135.625"normalize_sentence("twenty one forty two")→"2043"(date),normalize("two thousand seventeen")→"2017""five dollars"→"$5","twenty first"→"21st","five point two"→"5.2")Closes #15.
Addresses #14 via new opt-in pipeline (default behaviour intentionally unchanged).