Skip to content

Conversation

@flolas
Copy link
Contributor

@flolas flolas commented Jul 17, 2021

Strip non-ascii characters in pod name on k8s executor.
(Maybe we can use package unidecode to translate for example á -> a or ñ -> n, but I think that's not necessary, because this is only for the pod name, ignoring the characters is simpler)

closes: #16992

@flolas flolas requested a review from dimberman as a code owner July 17, 2021 00:23
@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels Jul 17, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ''.join(ch.lower() for ch in list(string) if ch.isalnum()).encode('ascii', 'ignore').decode()
return string.encode('ascii', 'ignore').decode().lower()

I think this will be a bit more performant (less split and join happening)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@potiuk potiuk Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the comment, above. But I think it would be better to use unidecode. Maybe a little extreme but there are a few words in Polish for example, that contain only, or mostly accented characters. Not that they are often used, but this might lead to quite some ambiguity. And it might be worse in other languages.

Example words in Polish with only accented characters:

żółć, łóż, łżą, żąć, żął

Few short ones:

łódź, łażącą, łożącą, łóżmyż, łóżże, łżącą, żąłeś, żęłaś, żółcą, żółcę

Few longer words with mostly accented ones:

niedołężność, współdźwięcznością, żółtoróżowością, pięćdziesięciopięcioipółlatkąście

:D

Copy link
Member

@potiuk potiuk Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW. In case you had any doubts, Polish is one of the hardest languages in the world.

Here you can see about 1/3 of all the possible 'variants' of words that you can make from "plays" = "gra".

Screenshot 2021-07-18 16 03 14

Copy link
Member

@uranusjr uranusjr Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if we're going for a complex solution, Mozilla's unicode-slugify normalises even more things, including characters from Chinese (my native language) which consists entirely of non-ASCII characters 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk Yeah, but we are adding a package dependency to airflow only for handling the complexity of translating chars in pod name which we already strip characters incompatible ascii chars like - or _.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@potiuk anyways, its my opinion, if everybody agree adding the dependency unicode-uglify to airflow, i can make the changes in the PR...

Copy link
Member

@potiuk potiuk Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no problem whatsoever with adding a proven dependency that does a good job and has rather straightforward and expected dependencies - especially if it can save not only us but also our current and future users from even seeing similar errors.

From the product point of view, if we can get such proven solution that saves us headeaches in the future is the right way. The case where POD has Chinese-only characters mentioned by @uranusjr WILL eventually happen, and when it will, we will have to do it anyway, so why not doing it now :).

unicode-slugify==0.1.3
  - six [required: Any, installed: 1.16.0]
  - unidecode [required: Any, installed: 1.2.0]

Copy link
Member

@uranusjr uranusjr Jul 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised unicode-slugify has a hard dependency on unidecode, which brings back the GPL issue. Since unicode-slugify is a pretty thin layer over unidecode anyway (plus unicodedata, which is built-in), it’s probably to hand-roll an implementation based on text-unidecode in Airflow instead. I’ll take some time to look into this tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. VERY GOOD CALL @uranusjr . I did not notice it's GPL. In this case we should definitely NOT bring it.

@flolas
Copy link
Contributor Author

flolas commented Jul 18, 2021

@potiuk @uranusjr Could you take a look to the new commit? Uses the python-slugify package which doesnt have GPL issues. Feel free to change the current PR or create another PR for resolving the issue.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah... We already have python-slugify as dependency :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jul 18, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk
Copy link
Member

potiuk commented Jul 18, 2021

some "real" tests failing in "upgrade" case - do not worry about MSSQl

@potiuk potiuk merged commit a4af964 into apache:main Jul 19, 2021
@flolas flolas deleted the fix-bug-16992 branch July 19, 2021 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod fails to run when task or dag name contains non ASCII characters (k8s executor)

3 participants