Fix spelled decimals (#15) + add opt-in aviation pipeline (#14) by Alex-Wengg · Pull Request #20 · FluidInference/text-processing-rs

Alex-Wengg · 2026-04-26T16:33:19Z

Summary

Two related issues, two different fix shapes:

Issue Failed on case "frequency one three five point six two five" #15 — spelled decimals (e.g. aviation radio frequencies): "one three five point six two five" returned 9.625 instead of 135.625. Default-path bug. Fixed in the default pipeline.
Issue normalize_sentence failed on "United seven eighty eight" #14 — flight-number reading: normalize_sentence("United seven eighty eight") returned "United 95" (7+80+8) instead of "United 788". Genuinely ambiguous (e.g. "two thirty five" is 02:35 to a clock-reader, 235 to ATC, 65 to a math student). Fixed as an opt-in pipeline so callers in flight-number / call-sign context get aviation reading without forcing it on everyone.

Issue #15 — default-path fix

cardinal::words_to_number now dispatches between two readings:

Digit-by-digit — every token is a single-digit word (zero-nine, plus oh/o). Concatenate. This is the path decimal::parse_point_decimal reaches when reading the integer side of a spelled decimal.
Grammatical — original running-sum + scale-multiplication accumulator. Unchanged from upstream NeMo.

decimal::parse_point_decimal and telephone::parse were updated to use the digit-by-digit path. No public API change for #15.

Issue #14 — opt-in aviation pipeline

Why not auto-apply: I tried wiring aviation reading directly into words_to_number, and it broke date::parse_old_year for "twenty one forty two" (was 2043, became None because year_part started reading 142 > 99). Aviation context can't be detected purely from the tokens, so the only safe option is to let the caller decide.

New public API (additive)

Function	Purpose
`cardinal::words_to_number_aviation(s)`	Low-level: digit-prefix + grammatical-compound reader. `"seven eighty eight"` → `Some(788)`. Falls back to grammatical when no digit prefix.
`cardinal::parse_aviation(s)`	String wrapper, mirrors `cardinal::parse`.
`normalize_aviation(s)`	Single-input dispatch. Aviation cardinal tried before time/date; falls through to standard `normalize` for non-number phrases.
`normalize_sentence_aviation(s)`	Sentence-mode dispatch. Aviation cardinal at priority 89 (above `date`=88, `time`=85; below `measure`=90, `money`=95).
`normalize_sentence_aviation_with_max_span(s, n)`	Configurable-span variant.

Internal change

parse_span now takes an aviation: bool parameter. Single tagger list, two short conditional blocks for the aviation injection. normalize_sentence_with_max_span passes false; the aviation variant passes true. No fork to maintain.

fn parse_span(span: &str, aviation: bool) -> Option<(String, u8)> {
    // ... money / measure ...
    if aviation && token_count <= 4 {
        if let Some(r) = cardinal::parse_aviation(span) { return Some((r, 89)); }
    }
    // ... date / time / electronic / decimal / ordinal ...
    if !aviation && token_count <= 4 {
        if let Some(r) = cardinal::parse(span) { return Some((r, 70)); }
    }
    None
}

Behaviour matrix

	`normalize`	`normalize_aviation`	`normalize_sentence`	`normalize_sentence_aviation`
`"twenty one forty two"`	`2043`	`63`	`2043`	`63`
`"seven eighty eight"`	`788`*	`788`	`95`	`788`
`"two thirty five"`	`02:35`	`235`	`02:35`	`235`
`"United seven eighty eight"`	`788`*	`788`	`United 95`	`United 788`
`"flight two thirty five departs ..."`	`235-4`*	`235-4`*	`flight 02:35 ...`	`flight 235 ...`
`"one three five point six two five"`	`135.625`	`135.625`	`135.625`	`135.625`
`"five dollars"`	`$5`	`$5`	`$5`	`$5`
`"twenty first"`	`21st`	`21st`	`21st`	`21st`
`"two thousand seventeen"`	`2017`	`2017`	`2017`	`2017`

* normalize single-input mode lets the telephone tagger eat the whole span; sentence mode does not.

Default normalize and normalize_sentence outputs are bit-for-bit unchanged from before this PR (apart from the #15 spelled-decimal fix).

Tests

Unit tests in src/itn/en/cardinal.rs:

test_words_to_number_aviation_flight_number
test_words_to_number_aviation_falls_back_to_grammatical
test_words_to_number_aviation_scale_word_forces_grammatical
test_words_to_number_no_aviation_reading — guards that generic words_to_number does not do aviation reading

Integration tests in tests/en_tests.rs:

test_issue_14_default_dispatch_unchanged
test_issue_14_normalize_aviation
test_issue_14_normalize_sentence_aviation
test_issue_15_* (already on branch)
test_spelled_digit_cardinal* (already on branch)

Test plan

cargo test — full suite passes (911 tests across 12 binaries)
cargo fmt --check — clean
Issue normalize_sentence failed on "United seven eighty eight" #14 reporter's exact input via opt-in: normalize_sentence_aviation("United seven eighty eight") → "United 788"
Issue Failed on case "frequency one three five point six two five" #15 reporter's exact input: normalize("one three five point six two five") → "135.625"
Regression: normalize_sentence("twenty one forty two") → "2043" (date), normalize("two thousand seventeen") → "2017"
Money / measure / decimal / ordinal preserved in aviation pipeline ("five dollars" → "$5", "twenty first" → "21st", "five point two" → "5.2")

Closes #15.
Addresses #14 via new opt-in pipeline (default behaviour intentionally unchanged).

Aviation/code-style readings like 'one three five' previously fell through cardinal::words_to_number's running-sum loop and returned 9 (1+3+5) instead of 135. This cascaded into: - normalize_sentence('frequency one three five point six two five') producing 'frequency 9.625' instead of 'frequency 135.625' - The telephone tagger silently swallowing 'point' and emitting '135-625' for normalize() of the same input Fix: - cardinal::words_to_number now recognises a multi-token sequence of pure single-digit words (zero-nine, plus oh/o for 0) and concatenates them as digits, matching how humans read aviation frequencies, flight numbers, codes, etc. Single-token and cardinal-with-scale inputs (twenty one, one hundred thirty five, one thousand two hundred thirty four, etc.) are unaffected. - telephone::parse now rejects inputs containing ' point ' so the decimal tagger handles them instead of the telephone tagger formatting them with hyphens. Tests: - cardinal: spelled-digit sequence parsing (135, 737, 911, 12, 505, 404) and words_to_number direct calls. - decimal: 'one three five point six two five' -> 135.625 and related variants. - telephone: rejects inputs with ' point '. - en_tests: direct reproductions of issue #15 (both normalize and normalize_sentence) plus regressions for normal cardinal phrasings. - en_tn_tests: release-sanity test for tn_normalize_sentence (the public API hongbo-miao reported missing in 0.1.0 / issue #16). Closes #15.

The time tagger correctly handles "five oh five" → "05:05" patterns, which intercepted before telephone in the normalize() priority chain. Drop that assertion since it's testing time semantics, not cardinal digit-concat — the unit test in cardinal.rs already covers the direct words_to_number → 505 path which doesn't go through normalize. Also fix multi-line assert layout to match rustfmt output.

Drop the len >= 2 guard on the digit-concat path — single-token inputs like "one" produce the same answer either way. Move the running-sum/scale-multiplication logic out into grammatical_words_to_number, so the entry point reads as the actual disambiguation: "all single-digit words → concat; otherwise → grammar". No behavioural change vs. the previous commit on this branch; the existing test suite covers both paths.

"seven eighty eight" was running-summed to 95 (= 7 + 80 + 8) instead of being read as flight 788. The grammatical accumulator in words_to_number had no way to express "leading single-digit prefix + trailing grammatical compound", which is the natural reading for flight numbers, room numbers, and similar codes. Add a third reading mode to words_to_number, sandwiched between the all-digits path and the grammatical fallback: - Detect a leading run of single-digit words (zero-nine, oh, o) - Parse the remainder as a grammatical compound - Concatenate as strings: "7" ‖ 88 = "788" Disabled when the input contains a scale word (hundred/thousand/...), so "two thousand seventeen" stays 2017 instead of becoming 22017. This fix lives in cardinal because the sentence-mode pipeline (parse_span) reaches cardinal directly without going through the telephone tagger. The existing telephone-based fix for issue #15 only helps the single-string normalize() path; normalize_sentence("United seven eighty eight") had to be addressed at the cardinal level. Tests: - cardinal::tests::test_aviation_flight_number_style - cardinal::tests::test_words_to_number_aviation_flight_number - cardinal::tests::test_scale_word_forces_grammatical - en_tests::test_issue_14_aviation_flight_number - en_tests::test_issue_14_aviation_flight_number_in_sentence - en_tests::test_issue_14_does_not_break_scale_grammar Closes #14.

…er_aviation The previous commit wired aviation flight-number reading directly into `cardinal::words_to_number` (the generic ITN dispatch path). That broke `date::parse_old_year` for "twenty one forty two" (was 2043 via 20*100+43, became None because year_part now reads 142 > 99) and made test inputs collide with the time tagger. This commit reverts `words_to_number` to grammatical + digit-by-digit only (matching upstream NeMo behaviour) and exposes the aviation reading as a separate opt-in helper `words_to_number_aviation`. Callers in flight-number or call-sign contexts can reach for it explicitly; generic dispatch stays out of the date/time tagger's way. Behaviour summary (verified): - `normalize_sentence("twenty one forty two")` → "2043" (date, restored) - `normalize_sentence("seven eighty eight")` → "95" (grammatical) - `normalize("seven eighty eight")` → "788" (existing telephone tagger) - `words_to_number_aviation("seven eighty eight")` → 788 (opt-in) - `words_to_number("seven eighty eight")` → 95 (no aviation)

…ce_aviation) Caller-prioritized aviation flight-number reading. Default dispatch keeps upstream NeMo semantics (date wins for "twenty one forty two", time wins for "two thirty five"); callers in flight-number / call-sign contexts opt in by calling the new aviation variants. New public API: - `cardinal::parse_aviation(input)` — like `cardinal::parse` but uses `words_to_number_aviation` to recognise "seven eighty eight" as 788. - `normalize_aviation(input)` — single-input dispatch with aviation cardinal tried before time/date. - `normalize_sentence_aviation(input)` — sentence-mode dispatch via a new `parse_span_aviation` that puts aviation cardinal at priority 89 (above date 88, time 85; below measure 90, money 95). - `normalize_sentence_aviation_with_max_span(input, max_span)` for the configurable-span variant. Behaviour matrix (verified): normalize normalize_aviation sentence sentence_aviation "twenty one forty two" 2043 63 2043 63 "seven eighty eight" 788 788 95 788 "two thirty five" 02:35 235 02:35 235 "United seven eighty eight" 788 788 United 95 United 788 "flight two thirty five ..." 235-4* 235-4* flight 02:35 flight 235 "five dollars" / "five point two" $5/5.2 unchanged unchanged unchanged "twenty first" / "two thousand seventeen" / "one hundred" unchanged unchanged unchanged unchanged (* `normalize` single-input mode lets the telephone tagger eat the whole span; sentence mode does not.) Internals: `normalize_sentence_with_max_span` and the new `normalize_sentence_aviation_with_max_span` share their loop body via a private `normalize_sentence_inner(parse_span_fn)` helper. Closes #14 follow-up: provides the dispatch-level aviation pipeline that the previous `words_to_number_aviation` helper alone did not.

…tion) The previous commit forked parse_span into a near-identical parse_span_aviation that only differed by inserting cardinal::parse_aviation at priority 89. That meant adding a new tagger required updating two lists. Replace the fork with a single boolean parameter: - `parse_span(span, aviation: bool)` — one tagger list, the aviation cardinal at priority 89 lives behind `if aviation`, the regular cardinal fallback at 70 lives behind `if !aviation`. - `normalize_sentence_inner(input, max_span, aviation)` — pass through the bool instead of a fn-pointer. - `parse_span_aviation` deleted. Public API and behaviour unchanged. Net -47 lines.

- words_to_number / words_to_number_aviation: require words.len() >= 2 for digit-by-digit reading. Bare "oh" / "o" no longer resolve to 0 (those forms are only digits inside a longer spelled code; in isolation they are interjections / letters). Bare "zero" still works via the grammatical fallback. - parse_span: drop the <=4 token gate for the aviation cardinal branch. Aviation sentence mode is opt-in, so the caller has accepted aggressive matching across longer spans like "one thousand two hundred thirty four". parse_aviation falls back to grammatical, so non-aviation phrases still resolve. - ffi.rs / wasm.rs: expose the aviation pipeline (per AGENTS.md rule that lib / ffi / wasm be kept in sync). Adds nemo_normalize_aviation, nemo_normalize_sentence_aviation, nemo_normalize_sentence_aviation_with_max_span and the matching wasm-bindgen exports.

devin-ai-integration

Devin Review found 3 new potential issues.

View 15 additional findings in Devin Review.

devin-ai-integration · 2026-04-26T18:53:20Z

+        return words
+            .iter()
+            .map(|w| single_digit_char(w).unwrap())
+            .collect::<String>()
+            .parse()
+            .ok();


🟡 Uses unwrap() in library code, violating AGENTS.md style rule

The AGENTS.md specifies: "Error handling: prefer Option/Result; avoid unwrap()/expect() in library code". Three new call sites in src/itn/en/cardinal.rs use .unwrap() on single_digit_char(w) (lines 191, 227, 246). While these calls are guarded by a prior .all(|w| single_digit_char(w).is_some()) check, making them safe at runtime, they can be trivially replaced with .filter_map(|w| single_digit_char(w)) to satisfy the style guideline without any behavioral change.

Suggested change

return words

.iter()

.map(|w| single_digit_char(w).unwrap())

.collect::<String>()

.parse()

.ok();

return words

.iter()

.filter_map(|w| single_digit_char(w))

.collect::<String>()

.parse()

.ok();

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-26T18:53:21Z

+        return words
+            .iter()
+            .map(|w| single_digit_char(w).unwrap())
+            .collect::<String>()
+            .parse()
+            .ok();


🟡 Uses unwrap() in library code (words_to_number_aviation digit-by-digit path)

Same AGENTS.md rule violation as BUG-0001, in the words_to_number_aviation function's digit-by-digit reading path at src/itn/en/cardinal.rs:227. Guarded by .all() but can be replaced with .filter_map().

Suggested change

return words

.iter()

.map(|w| single_digit_char(w).unwrap())

.collect::<String>()

.parse()

.ok();

return words

.iter()

.filter_map(|w| single_digit_char(w))

.collect::<String>()

.parse()

.ok();

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-26T18:53:22Z

+                let prefix: String = words[..prefix_len]
+                    .iter()
+                    .map(|w| single_digit_char(w).unwrap())
+                    .collect();


🟡 Uses unwrap() in library code (aviation prefix concatenation path)

Same AGENTS.md rule violation, in the aviation prefix concatenation logic at src/itn/en/cardinal.rs:246. The .map(|w| single_digit_char(w).unwrap()) can be replaced with .filter_map(|w| single_digit_char(w)).

Suggested change

let prefix: String = words[..prefix_len]

.iter()

.map(|w| single_digit_char(w).unwrap())

.collect();

let prefix: String = words[..prefix_len]

.iter()

.filter_map(|w| single_digit_char(w))

.collect();

Was this helpful? React with 👍 or 👎 to provide feedback.

Alex-Wengg added 4 commits April 26, 2026 12:32

fmt: collapse short assert_eq to single line in decimal tests

6cb52bd

This comment was marked as resolved.

Sign in to view

Alex-Wengg changed the title ~~fix(en): treat spelled-digit sequences as digit concatenation (#15)~~ Fix spoken→digit cardinal readings (issues #14, #15) Apr 26, 2026

Alex-Wengg added 3 commits April 26, 2026 13:46

This comment was marked as resolved.

Sign in to view

Alex-Wengg changed the title ~~Fix spoken→digit cardinal readings (issues #14, #15)~~ Fix spelled decimals (#15) + add opt-in aviation pipeline (#14) Apr 26, 2026

devin-ai-integration Bot reviewed Apr 26, 2026

View reviewed changes

Alex-Wengg mentioned this pull request Apr 26, 2026

Failed on case "frequency one three five point six two five" #15

Closed

Alex-Wengg merged commit 9035284 into main Apr 26, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spelled decimals (#15) + add opt-in aviation pipeline (#14)#20

Fix spelled decimals (#15) + add opt-in aviation pipeline (#14)#20
Alex-Wengg merged 9 commits intomainfrom
fix/issue-15-spelled-decimals

Alex-Wengg commented Apr 26, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 26, 2026

Uh oh!

devin-ai-integration Bot Apr 26, 2026

Uh oh!

devin-ai-integration Bot Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alex-Wengg commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue #15 — default-path fix

Issue #14 — opt-in aviation pipeline

New public API (additive)

Internal change

Behaviour matrix

Tests

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alex-Wengg commented Apr 26, 2026 •

edited

Loading