fix: add FLAG num support for agglutinative languages#1090
Open
tolgakaratas wants to merge 1 commit intovale-cli:v3from
Open
fix: add FLAG num support for agglutinative languages#1090tolgakaratas wants to merge 1 commit intovale-cli:v3from
tolgakaratas wants to merge 1 commit intovale-cli:v3from
Conversation
The spell checker's affix parser used `rune` (single character) as the map key for affix rules. This broke dictionaries that use `FLAG num` format, where flags are comma-separated numbers (e.g., "14308,10482"). Only the first digit of each numeric flag was read, causing most affix rules to be unreachable. This affected all agglutinative languages (Turkish, Hungarian, Finnish, etc.) whose Hunspell dictionaries use `FLAG num` with tens of thousands of suffix groups. Changes: - Change AffixMap key from `rune` to `string` - Change compoundMap key from `rune` to `string` - Add parseFlags() method that handles ASCII, num, long, and UTF-8 formats - Update expand() to use parsed flag slices instead of rune iteration - Update compound rule parsing in gospell.go For the Turkish dictionary (tr_TR), this enables correct recognition of ~59,000 suffix groups and ~15.8M inflected word forms that were previously unreachable.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rune(single character) as the map key for affix rules, which broke dictionaries usingFLAG numformat14308,10482,4720were parsed by reading only the first digit, making most affix rules unreachableFLAG numwith thousands of suffix groupsChanges
AffixMapkey fromrunetostringto support multi-character flagscompoundMapkey fromrunetostringparseFlags()method that correctly handles all Hunspell flag formats:ASCII,num,long, andUTF-8expand()to use parsed flag slices instead of rune iterationgospell.goImpact
For the Turkish dictionary (
tr_TR), this fix enables correct recognition of ~59,000 suffix groups and ~15.8 million inflected word forms that were previously unreachable. Thetr_TR.afffile usesFLAG numwith comma-separated numeric IDs.Before:
hunspell -d tr_TR -lrecognizes "belediyeye", "adaletli", "ancak" — but Vale flags them as unknown.After: Vale correctly recognizes these words using the same dictionary files.
Test plan
parseFlags()covering ASCII, num, long, and UTF-8 formatsFLAG numaffix parsing and expansionnewGoSpellReaderwithFLAG numdictionary