Skip to content

Preserve French and EU language characters in normalize_unicode_spaces#221

Merged
gauravdawar-e6 merged 3 commits intoe6data:mainfrom
tkaunlaky-e6:fix/preserve-french-eu-characters
Mar 2, 2026
Merged

Preserve French and EU language characters in normalize_unicode_spaces#221
gauravdawar-e6 merged 3 commits intoe6data:mainfrom
tkaunlaky-e6:fix/preserve-french-eu-characters

Conversation

@tkaunlaky-e6
Copy link

The isascii() check was replacing ALL non-ASCII characters with spaces, which corrupted French chars like é, ç, ü (e.g. Téléchargement became T l chargement). Now uses unicodedata.category() to only normalize actual Unicode whitespace/separators (Zs, Zl, Zp) and U+FFFD, preserving all letter characters from EU languages.

The isascii() check was replacing ALL non-ASCII characters with spaces,
which corrupted French chars like é, ç, ü (e.g. Téléchargement became
T l chargement). Now uses unicodedata.category() to only normalize
actual Unicode whitespace/separators (Zs, Zl, Zp) and U+FFFD, preserving
all letter characters from EU languages.
@tkaunlaky-e6 tkaunlaky-e6 marked this pull request as draft March 2, 2026 13:40
@tkaunlaky-e6 tkaunlaky-e6 marked this pull request as ready for review March 2, 2026 13:41
@gauravdawar-e6 gauravdawar-e6 self-requested a review March 2, 2026 13:43
Copy link
Collaborator

@gauravdawar-e6 gauravdawar-e6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@gauravdawar-e6 gauravdawar-e6 merged commit 8e40544 into e6data:main Mar 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants