Skip to content

fix: add FLAG num support for agglutinative languages#1090

Open
tolgakaratas wants to merge 1 commit intovale-cli:v3from
Denomas:fix/flag-num-support
Open

fix: add FLAG num support for agglutinative languages#1090
tolgakaratas wants to merge 1 commit intovale-cli:v3from
Denomas:fix/flag-num-support

Conversation

@tolgakaratas
Copy link

Summary

  • The spell checker's affix parser used rune (single character) as the map key for affix rules, which broke dictionaries using FLAG num format
  • Numeric flags like 14308,10482,4720 were parsed by reading only the first digit, making most affix rules unreachable
  • This affected all agglutinative languages (Turkish, Hungarian, Finnish, etc.) whose Hunspell dictionaries use FLAG num with thousands of suffix groups

Changes

  • Change AffixMap key from rune to string to support multi-character flags
  • Change compoundMap key from rune to string
  • Add parseFlags() method that correctly handles all Hunspell flag formats: ASCII, num, long, and UTF-8
  • Update expand() to use parsed flag slices instead of rune iteration
  • Update compound rule parsing in gospell.go

Impact

For the Turkish dictionary (tr_TR), this fix enables correct recognition of ~59,000 suffix groups and ~15.8 million inflected word forms that were previously unreachable. The tr_TR.aff file uses FLAG num with comma-separated numeric IDs.

Before: hunspell -d tr_TR -l recognizes "belediyeye", "adaletli", "ancak" — but Vale flags them as unknown.
After: Vale correctly recognizes these words using the same dictionary files.

Test plan

  • New unit tests for parseFlags() covering ASCII, num, long, and UTF-8 formats
  • New unit test for FLAG num affix parsing and expansion
  • New integration test: newGoSpellReader with FLAG num dictionary
  • Backward compatibility test: ASCII flag dictionaries still work correctly
  • All 8 new tests pass
=== RUN   TestParseFlagsASCII
--- PASS: TestParseFlagsASCII (0.00s)
=== RUN   TestParseFlagsNum
--- PASS: TestParseFlagsNum (0.00s)
=== RUN   TestParseFlagsLong
--- PASS: TestParseFlagsLong (0.00s)
=== RUN   TestParseFlagsUTF8
--- PASS: TestParseFlagsUTF8 (0.00s)
=== RUN   TestFlagNumAffixParsing
--- PASS: TestFlagNumAffixParsing (0.00s)
=== RUN   TestFlagNumExpand
--- PASS: TestFlagNumExpand (0.00s)
=== RUN   TestFlagNumGoSpellReader
--- PASS: TestFlagNumGoSpellReader (0.00s)
=== RUN   TestASCIFlagBackwardCompatibility
--- PASS: TestASCIFlagBackwardCompatibility (0.00s)
PASS

The spell checker's affix parser used `rune` (single character) as the
map key for affix rules. This broke dictionaries that use `FLAG num`
format, where flags are comma-separated numbers (e.g., "14308,10482").

Only the first digit of each numeric flag was read, causing most affix
rules to be unreachable. This affected all agglutinative languages
(Turkish, Hungarian, Finnish, etc.) whose Hunspell dictionaries use
`FLAG num` with tens of thousands of suffix groups.

Changes:
- Change AffixMap key from `rune` to `string`
- Change compoundMap key from `rune` to `string`
- Add parseFlags() method that handles ASCII, num, long, and UTF-8 formats
- Update expand() to use parsed flag slices instead of rune iteration
- Update compound rule parsing in gospell.go

For the Turkish dictionary (tr_TR), this enables correct recognition of
~59,000 suffix groups and ~15.8M inflected word forms that were
previously unreachable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant