-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Strip non-ascii characters in pod name on k8s executor #17057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return ''.join(ch.lower() for ch in list(string) if ch.isalnum()).encode('ascii', 'ignore').decode() | |
| return string.encode('ascii', 'ignore').decode().lower() |
I think this will be a bit more performant (less split and join happening)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using unidecode insteead of stripping? https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the comment, above. But I think it would be better to use unidecode. Maybe a little extreme but there are a few words in Polish for example, that contain only, or mostly accented characters. Not that they are often used, but this might lead to quite some ambiguity. And it might be worse in other languages.
Example words in Polish with only accented characters:
żółć, łóż, łżą, żąć, żął
Few short ones:
łódź, łażącą, łożącą, łóżmyż, łóżże, łżącą, żąłeś, żęłaś, żółcą, żółcę
Few longer words with mostly accented ones:
niedołężność, współdźwięcznością, żółtoróżowością, pięćdziesięciopięcioipółlatkąście
:D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW if we're going for a complex solution, Mozilla's unicode-slugify normalises even more things, including characters from Chinese (my native language) which consists entirely of non-ASCII characters 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@potiuk Yeah, but we are adding a package dependency to airflow only for handling the complexity of translating chars in pod name which we already strip characters incompatible ascii chars like - or _.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@potiuk anyways, its my opinion, if everybody agree adding the dependency unicode-uglify to airflow, i can make the changes in the PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no problem whatsoever with adding a proven dependency that does a good job and has rather straightforward and expected dependencies - especially if it can save not only us but also our current and future users from even seeing similar errors.
From the product point of view, if we can get such proven solution that saves us headeaches in the future is the right way. The case where POD has Chinese-only characters mentioned by @uranusjr WILL eventually happen, and when it will, we will have to do it anyway, so why not doing it now :).
unicode-slugify==0.1.3
- six [required: Any, installed: 1.16.0]
- unidecode [required: Any, installed: 1.2.0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised unicode-slugify has a hard dependency on unidecode, which brings back the GPL issue. Since unicode-slugify is a pretty thin layer over unidecode anyway (plus unicodedata, which is built-in), it’s probably to hand-roll an implementation based on text-unidecode in Airflow instead. I’ll take some time to look into this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah. VERY GOOD CALL @uranusjr . I did not notice it's GPL. In this case we should definitely NOT bring it.
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah... We already have python-slugify as dependency :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
some "real" tests failing in "upgrade" case - do not worry about MSSQl |

Strip non-ascii characters in pod name on k8s executor.
(Maybe we can use package unidecode to translate for example á -> a or ñ -> n, but I think that's not necessary, because this is only for the pod name, ignoring the characters is simpler)
closes: #16992