Improve: [newmm tokenizer] Change regular expression of "non-thai-characters"#856
Improve: [newmm tokenizer] Change regular expression of "non-thai-characters"#856wannaphong merged 13 commits intoPyThaiNLP:devfrom
Conversation
Before: directly descript non-thai-characters by rule-based After: Just set as "anything except Thai-characters"
It seems that this change makes the tokenization more minute than the test-case. |
Can you update the pull request? 9df5a4a |
|
Greetings PR check
I don't understand this. It this error common in this project? |
Update thai2fit tokenizer
Updated |
|
It seems that there is unit-test error occuring by 9df5a4a Ignorable? |
Yes, It's self-host issues but I don't have time to new setup. The unit-test by GitHub is look good https://github.com/PyThaiNLP/pythainlp/actions/runs/6718024461/job/18256943528 |
|
I add some rule to fixed the error. konbraphat51#2 |
Fixed regex
For further mentenance easier
|
Hello @konbraphat51! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2023-11-01 13:08:42 UTC |
|
I merged and modified @wannaphong PR. Please check. |
|
OK. It look |
|
I fixed. |
|
In my case, fb3e7bb showed The last |
3d889f7 showed |
Oh sorry, I was testing by |
Interntion for ` \t\r\n`
|
Kudos, SonarCloud Quality Gate passed! |
|
Added the commentation for further maintenance. |
bact
left a comment
There was a problem hiding this comment.
Looks fine. Doesn't break the number grouping.








What does this changes
Make the newmm tokenization more accurate; recognize more characters as "non-thai"
What was wrong
#855
It sometimes didn't recognize non-thai symbols as non-thai
"(คนไม่เอา)" -> ['(คน', 'ไม่', 'เอา', ')']
"กม/ชม" -> ['กม', '/ชม']
"สีหน้า(รถ)" -> ['สีหน้า', '(รถ)']
How this fixes it
Fixed the recognition method of "non-thai-character".
The examples above are all improved.
Your checklist for this pull request