Skip to content

Fix spelled decimals (#15) + add opt-in aviation pipeline (#14)#20

Merged
Alex-Wengg merged 9 commits intomainfrom
fix/issue-15-spelled-decimals
Apr 26, 2026
Merged

Fix spelled decimals (#15) + add opt-in aviation pipeline (#14)#20
Alex-Wengg merged 9 commits intomainfrom
fix/issue-15-spelled-decimals

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 26, 2026

Summary

Two related issues, two different fix shapes:

  • Issue Failed on case "frequency one three five point six two five" #15 — spelled decimals (e.g. aviation radio frequencies): "one three five point six two five" returned 9.625 instead of 135.625. Default-path bug. Fixed in the default pipeline.
  • Issue normalize_sentence failed on "United seven eighty eight" #14 — flight-number reading: normalize_sentence("United seven eighty eight") returned "United 95" (7+80+8) instead of "United 788". Genuinely ambiguous (e.g. "two thirty five" is 02:35 to a clock-reader, 235 to ATC, 65 to a math student). Fixed as an opt-in pipeline so callers in flight-number / call-sign context get aviation reading without forcing it on everyone.

Issue #15 — default-path fix

cardinal::words_to_number now dispatches between two readings:

  1. Digit-by-digit — every token is a single-digit word (zero-nine, plus oh/o). Concatenate. This is the path decimal::parse_point_decimal reaches when reading the integer side of a spelled decimal.
  2. Grammatical — original running-sum + scale-multiplication accumulator. Unchanged from upstream NeMo.

decimal::parse_point_decimal and telephone::parse were updated to use the digit-by-digit path. No public API change for #15.

Issue #14 — opt-in aviation pipeline

Why not auto-apply: I tried wiring aviation reading directly into words_to_number, and it broke date::parse_old_year for "twenty one forty two" (was 2043, became None because year_part started reading 142 > 99). Aviation context can't be detected purely from the tokens, so the only safe option is to let the caller decide.

New public API (additive)

Function Purpose
cardinal::words_to_number_aviation(s) Low-level: digit-prefix + grammatical-compound reader. "seven eighty eight"Some(788). Falls back to grammatical when no digit prefix.
cardinal::parse_aviation(s) String wrapper, mirrors cardinal::parse.
normalize_aviation(s) Single-input dispatch. Aviation cardinal tried before time/date; falls through to standard normalize for non-number phrases.
normalize_sentence_aviation(s) Sentence-mode dispatch. Aviation cardinal at priority 89 (above date=88, time=85; below measure=90, money=95).
normalize_sentence_aviation_with_max_span(s, n) Configurable-span variant.

Internal change

parse_span now takes an aviation: bool parameter. Single tagger list, two short conditional blocks for the aviation injection. normalize_sentence_with_max_span passes false; the aviation variant passes true. No fork to maintain.

fn parse_span(span: &str, aviation: bool) -> Option<(String, u8)> {
    // ... money / measure ...
    if aviation && token_count <= 4 {
        if let Some(r) = cardinal::parse_aviation(span) { return Some((r, 89)); }
    }
    // ... date / time / electronic / decimal / ordinal ...
    if !aviation && token_count <= 4 {
        if let Some(r) = cardinal::parse(span) { return Some((r, 70)); }
    }
    None
}

Behaviour matrix

normalize normalize_aviation normalize_sentence normalize_sentence_aviation
"twenty one forty two" 2043 63 2043 63
"seven eighty eight" 788* 788 95 788
"two thirty five" 02:35 235 02:35 235
"United seven eighty eight" 788* 788 United 95 United 788
"flight two thirty five departs ..." 235-4* 235-4* flight 02:35 ... flight 235 ...
"one three five point six two five" 135.625 135.625 135.625 135.625
"five dollars" $5 $5 $5 $5
"twenty first" 21st 21st 21st 21st
"two thousand seventeen" 2017 2017 2017 2017

* normalize single-input mode lets the telephone tagger eat the whole span; sentence mode does not.

Default normalize and normalize_sentence outputs are bit-for-bit unchanged from before this PR (apart from the #15 spelled-decimal fix).

Tests

Unit tests in src/itn/en/cardinal.rs:

  • test_words_to_number_aviation_flight_number
  • test_words_to_number_aviation_falls_back_to_grammatical
  • test_words_to_number_aviation_scale_word_forces_grammatical
  • test_words_to_number_no_aviation_reading — guards that generic words_to_number does not do aviation reading

Integration tests in tests/en_tests.rs:

  • test_issue_14_default_dispatch_unchanged
  • test_issue_14_normalize_aviation
  • test_issue_14_normalize_sentence_aviation
  • test_issue_15_* (already on branch)
  • test_spelled_digit_cardinal* (already on branch)

Test plan

  • cargo test — full suite passes (911 tests across 12 binaries)
  • cargo fmt --check — clean
  • Issue normalize_sentence failed on "United seven eighty eight" #14 reporter's exact input via opt-in: normalize_sentence_aviation("United seven eighty eight")"United 788"
  • Issue Failed on case "frequency one three five point six two five" #15 reporter's exact input: normalize("one three five point six two five")"135.625"
  • Regression: normalize_sentence("twenty one forty two")"2043" (date), normalize("two thousand seventeen")"2017"
  • Money / measure / decimal / ordinal preserved in aviation pipeline ("five dollars""$5", "twenty first""21st", "five point two""5.2")

Closes #15.
Addresses #14 via new opt-in pipeline (default behaviour intentionally unchanged).

Aviation/code-style readings like 'one three five' previously fell
through cardinal::words_to_number's running-sum loop and returned 9
(1+3+5) instead of 135. This cascaded into:

  - normalize_sentence('frequency one three five point six two five')
    producing 'frequency 9.625' instead of 'frequency 135.625'
  - The telephone tagger silently swallowing 'point' and emitting
    '135-625' for normalize() of the same input

Fix:

  - cardinal::words_to_number now recognises a multi-token sequence of
    pure single-digit words (zero-nine, plus oh/o for 0) and
    concatenates them as digits, matching how humans read aviation
    frequencies, flight numbers, codes, etc. Single-token and
    cardinal-with-scale inputs (twenty one, one hundred thirty five,
    one thousand two hundred thirty four, etc.) are unaffected.

  - telephone::parse now rejects inputs containing ' point ' so the
    decimal tagger handles them instead of the telephone tagger
    formatting them with hyphens.

Tests:

  - cardinal: spelled-digit sequence parsing (135, 737, 911, 12, 505,
    404) and words_to_number direct calls.
  - decimal: 'one three five point six two five' -> 135.625 and
    related variants.
  - telephone: rejects inputs with ' point '.
  - en_tests: direct reproductions of issue #15 (both normalize and
    normalize_sentence) plus regressions for normal cardinal phrasings.
  - en_tn_tests: release-sanity test for tn_normalize_sentence (the
    public API hongbo-miao reported missing in 0.1.0 / issue #16).

Closes #15.
The time tagger correctly handles "five oh five" → "05:05" patterns,
which intercepted before telephone in the normalize() priority chain.
Drop that assertion since it's testing time semantics, not cardinal
digit-concat — the unit test in cardinal.rs already covers the
direct words_to_number → 505 path which doesn't go through normalize.

Also fix multi-line assert layout to match rustfmt output.
Drop the len >= 2 guard on the digit-concat path — single-token inputs
like "one" produce the same answer either way.

Move the running-sum/scale-multiplication logic out into
grammatical_words_to_number, so the entry point reads as the actual
disambiguation: "all single-digit words → concat; otherwise →
grammar".

No behavioural change vs. the previous commit on this branch; the
existing test suite covers both paths.
devin-ai-integration[bot]

This comment was marked as resolved.

"seven eighty eight" was running-summed to 95 (= 7 + 80 + 8) instead of
being read as flight 788. The grammatical accumulator in
words_to_number had no way to express "leading single-digit prefix +
trailing grammatical compound", which is the natural reading for flight
numbers, room numbers, and similar codes.

Add a third reading mode to words_to_number, sandwiched between the
all-digits path and the grammatical fallback:

- Detect a leading run of single-digit words (zero-nine, oh, o)
- Parse the remainder as a grammatical compound
- Concatenate as strings: "7" ‖ 88 = "788"

Disabled when the input contains a scale word (hundred/thousand/...),
so "two thousand seventeen" stays 2017 instead of becoming 22017.

This fix lives in cardinal because the sentence-mode pipeline
(parse_span) reaches cardinal directly without going through the
telephone tagger. The existing telephone-based fix for issue #15 only
helps the single-string normalize() path; normalize_sentence("United
seven eighty eight") had to be addressed at the cardinal level.

Tests:
- cardinal::tests::test_aviation_flight_number_style
- cardinal::tests::test_words_to_number_aviation_flight_number
- cardinal::tests::test_scale_word_forces_grammatical
- en_tests::test_issue_14_aviation_flight_number
- en_tests::test_issue_14_aviation_flight_number_in_sentence
- en_tests::test_issue_14_does_not_break_scale_grammar

Closes #14.
@Alex-Wengg Alex-Wengg changed the title fix(en): treat spelled-digit sequences as digit concatenation (#15) Fix spoken→digit cardinal readings (issues #14, #15) Apr 26, 2026
…er_aviation

The previous commit wired aviation flight-number reading directly into
`cardinal::words_to_number` (the generic ITN dispatch path). That broke
`date::parse_old_year` for "twenty one forty two" (was 2043 via 20*100+43,
became None because year_part now reads 142 > 99) and made test inputs
collide with the time tagger.

This commit reverts `words_to_number` to grammatical + digit-by-digit only
(matching upstream NeMo behaviour) and exposes the aviation reading as a
separate opt-in helper `words_to_number_aviation`. Callers in flight-number
or call-sign contexts can reach for it explicitly; generic dispatch stays
out of the date/time tagger's way.

Behaviour summary (verified):
- `normalize_sentence("twenty one forty two")` → "2043" (date, restored)
- `normalize_sentence("seven eighty eight")` → "95" (grammatical)
- `normalize("seven eighty eight")` → "788" (existing telephone tagger)
- `words_to_number_aviation("seven eighty eight")` → 788 (opt-in)
- `words_to_number("seven eighty eight")` → 95 (no aviation)
…ce_aviation)

Caller-prioritized aviation flight-number reading. Default dispatch keeps
upstream NeMo semantics (date wins for "twenty one forty two", time wins
for "two thirty five"); callers in flight-number / call-sign contexts opt
in by calling the new aviation variants.

New public API:
- `cardinal::parse_aviation(input)` — like `cardinal::parse` but uses
  `words_to_number_aviation` to recognise "seven eighty eight" as 788.
- `normalize_aviation(input)` — single-input dispatch with aviation
  cardinal tried before time/date.
- `normalize_sentence_aviation(input)` — sentence-mode dispatch via a new
  `parse_span_aviation` that puts aviation cardinal at priority 89
  (above date 88, time 85; below measure 90, money 95).
- `normalize_sentence_aviation_with_max_span(input, max_span)` for the
  configurable-span variant.

Behaviour matrix (verified):
                                normalize  normalize_aviation  sentence  sentence_aviation
"twenty one forty two"             2043          63             2043           63
"seven eighty eight"                788         788               95          788
"two thirty five"                 02:35         235            02:35          235
"United seven eighty eight"         788         788          United 95   United 788
"flight two thirty five ..."     235-4*       235-4*       flight 02:35  flight 235
"five dollars" / "five point two"   $5/5.2  unchanged       unchanged    unchanged
"twenty first" / "two thousand
 seventeen" / "one hundred"     unchanged   unchanged       unchanged    unchanged

(* `normalize` single-input mode lets the telephone tagger eat the whole
   span; sentence mode does not.)

Internals: `normalize_sentence_with_max_span` and the new
`normalize_sentence_aviation_with_max_span` share their loop body via
a private `normalize_sentence_inner(parse_span_fn)` helper.

Closes #14 follow-up: provides the dispatch-level aviation pipeline that
the previous `words_to_number_aviation` helper alone did not.
…tion)

The previous commit forked parse_span into a near-identical
parse_span_aviation that only differed by inserting cardinal::parse_aviation
at priority 89. That meant adding a new tagger required updating two lists.

Replace the fork with a single boolean parameter:
- `parse_span(span, aviation: bool)` — one tagger list, the aviation
  cardinal at priority 89 lives behind `if aviation`, the regular cardinal
  fallback at 70 lives behind `if !aviation`.
- `normalize_sentence_inner(input, max_span, aviation)` — pass through the
  bool instead of a fn-pointer.
- `parse_span_aviation` deleted.

Public API and behaviour unchanged. Net -47 lines.
devin-ai-integration[bot]

This comment was marked as resolved.

@Alex-Wengg Alex-Wengg changed the title Fix spoken→digit cardinal readings (issues #14, #15) Fix spelled decimals (#15) + add opt-in aviation pipeline (#14) Apr 26, 2026
- words_to_number / words_to_number_aviation: require words.len() >= 2
  for digit-by-digit reading. Bare "oh" / "o" no longer resolve to 0
  (those forms are only digits inside a longer spelled code; in
  isolation they are interjections / letters). Bare "zero" still works
  via the grammatical fallback.
- parse_span: drop the <=4 token gate for the aviation cardinal branch.
  Aviation sentence mode is opt-in, so the caller has accepted aggressive
  matching across longer spans like "one thousand two hundred thirty four".
  parse_aviation falls back to grammatical, so non-aviation phrases still
  resolve.
- ffi.rs / wasm.rs: expose the aviation pipeline (per AGENTS.md rule that
  lib / ffi / wasm be kept in sync). Adds nemo_normalize_aviation,
  nemo_normalize_sentence_aviation,
  nemo_normalize_sentence_aviation_with_max_span and the matching
  wasm-bindgen exports.
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 15 additional findings in Devin Review.

Open in Devin Review

Comment thread src/itn/en/cardinal.rs
Comment on lines +189 to +194
return words
.iter()
.map(|w| single_digit_char(w).unwrap())
.collect::<String>()
.parse()
.ok();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Uses unwrap() in library code, violating AGENTS.md style rule

The AGENTS.md specifies: "Error handling: prefer Option/Result; avoid unwrap()/expect() in library code". Three new call sites in src/itn/en/cardinal.rs use .unwrap() on single_digit_char(w) (lines 191, 227, 246). While these calls are guarded by a prior .all(|w| single_digit_char(w).is_some()) check, making them safe at runtime, they can be trivially replaced with .filter_map(|w| single_digit_char(w)) to satisfy the style guideline without any behavioral change.

Suggested change
return words
.iter()
.map(|w| single_digit_char(w).unwrap())
.collect::<String>()
.parse()
.ok();
return words
.iter()
.filter_map(|w| single_digit_char(w))
.collect::<String>()
.parse()
.ok();
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread src/itn/en/cardinal.rs
Comment on lines +225 to +230
return words
.iter()
.map(|w| single_digit_char(w).unwrap())
.collect::<String>()
.parse()
.ok();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Uses unwrap() in library code (words_to_number_aviation digit-by-digit path)

Same AGENTS.md rule violation as BUG-0001, in the words_to_number_aviation function's digit-by-digit reading path at src/itn/en/cardinal.rs:227. Guarded by .all() but can be replaced with .filter_map().

Suggested change
return words
.iter()
.map(|w| single_digit_char(w).unwrap())
.collect::<String>()
.parse()
.ok();
return words
.iter()
.filter_map(|w| single_digit_char(w))
.collect::<String>()
.parse()
.ok();
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread src/itn/en/cardinal.rs
Comment on lines +244 to +247
let prefix: String = words[..prefix_len]
.iter()
.map(|w| single_digit_char(w).unwrap())
.collect();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Uses unwrap() in library code (aviation prefix concatenation path)

Same AGENTS.md rule violation, in the aviation prefix concatenation logic at src/itn/en/cardinal.rs:246. The .map(|w| single_digit_char(w).unwrap()) can be replaced with .filter_map(|w| single_digit_char(w)).

Suggested change
let prefix: String = words[..prefix_len]
.iter()
.map(|w| single_digit_char(w).unwrap())
.collect();
let prefix: String = words[..prefix_len]
.iter()
.filter_map(|w| single_digit_char(w))
.collect();
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@Alex-Wengg Alex-Wengg merged commit 9035284 into main Apr 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed on case "frequency one three five point six two five"

1 participant