Skip to content

Conversation

@pingzh
Copy link
Contributor

@pingzh pingzh commented Dec 14, 2022

With Path.mkdirs, If mode is given, it is combined with the process’ umask value to determine the file mode and access flags. if parents is true, any missing parents of this path are created as needed; they are created with the default permissions without taking mode into account. However, in the mkdirs, it specifies the umask is ignored

This mkdirs was replaced by Path(xx).mkdir in the current codebase, e.g.

Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)
.

But I think we should use mkdirs, as the Path(xx).mkdir does not set the mode correctly, since it factors in the umask , see this code:

e.g. code:

import os
from tempfile import TemporaryDirectory
from uuid import uuid4
from pathlib import Path

with TemporaryDirectory() as tmpdir:
    path = os.path.join(tmpdir, str(uuid4()), str(uuid4()))
    Path(path).mkdir(mode=0o777, parents=True, exist_ok=True)
    mode = oct(os.stat(path).st_mode)[-3:]
    print('mode is', mode) 
    assert mode == "777"

The output is :mode is, 755, which is not 777

see: https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

With Path.mkdirs, if parents is true, any missing parents of this path
are created as needed; they are created with the default permissions
without taking mode into account

see: https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir
@uranusjr
Copy link
Member

It may not be a good idea to reuse mkdirs. Its docstring specifically says If directory already exists, this is a no-op and the new implementation is not compatible with the previous behaviour.

@Taragolis
Copy link
Contributor

In additional airflow.utils.file.mkdirs does not use in airflow and community providers.
So any changes here does not improve anything.

@pingzh
Copy link
Contributor Author

pingzh commented Dec 15, 2022

In additional airflow.utils.file.mkdirs does not use in airflow and community providers. So any changes here does not improve anything.

@Taragolis it is true, it is not used. it was replaced by Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True), eg.

Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)

However, it is incorrect since the 0o777 won't be correctly set, see my PR description.

@uranusjr the main gap here is the about the dir mode, see this code block:

import os
from tempfile import TemporaryDirectory
from uuid import uuid4
from pathlib import Path

with TemporaryDirectory() as tmpdir:
    path = os.path.join(tmpdir, str(uuid4()), str(uuid4()))
    Path(path).mkdir(mode=0o777, parents=True, exist_ok=True)
    mode = oct(os.stat(path).st_mode)[-3:]
    print('mode is', mode) 
    assert mode == "777"

the mode is 755 not 777

@pingzh
Copy link
Contributor Author

pingzh commented Dec 15, 2022

It may not be a good idea to reuse mkdirs. Its docstring specifically says If directory already exists, this is a no-op and the new implementation is not compatible with the previous behaviour.

@uranusjr i think the new implementation is no-op when the dir already exists. it is actually from the previous implementation, e.g. 1.10.15

def mkdirs(path, mode):
"""
Creates the directory specified by path, creating intermediate directories
as necessary. If directory already exists, this is a no-op.
:param path: The directory to create
:type path: str
:param mode: The mode to give to the directory e.g. 0o755, ignores umask
:type mode: int
"""
try:
o_umask = os.umask(0)
os.makedirs(path, mode)
except OSError:
if not os.path.isdir(path):
raise
finally:
os.umask(o_umask)

@Taragolis
Copy link
Contributor

This implementation completely ignored umask for subdirectories, see:

import os
from tempfile import TemporaryDirectory
from pathlib import Path
from airflow.utils.file import mkdirs


def cur_umask() -> int:
    tmp = os.umask(0o022)
    os.umask(tmp)
    return oct(tmp)


def sample():
    print(f" Current umask: {cur_umask()} ".center(72, "="))
    with TemporaryDirectory() as tmpdir:
        sub_path = Path(tmpdir) / "child1"
        path = sub_path / "child2"

        mkdirs(path, 0o770)

        print(f"{sub_path} mode {oct(os.stat(sub_path).st_mode)}")
        print(f"{path} mode {oct(os.stat(path).st_mode)}")

sample()
# child1 mode 0o40777
# child2 mode 0o40770

os.umask(0)
sample()
# child1 mode 0o40777
# child2 mode 0o40770

os.umask(0o002)
sample()
# child1 mode 0o40777
# child2 mode 0o40770

@pingzh
Copy link
Contributor Author

pingzh commented Dec 15, 2022

@Taragolis thanks for the code sample, it boils down to the expectation that when users use this method mkdirs.

To me, it is a little bit confusing if i want a folder with 0o777, but it factors in the umask and gives me a folder that is not 0o777.

@potiuk
Copy link
Member

potiuk commented Dec 15, 2022

I think there is far too much of confusion here in general.

I'd rather say having "mkdir" which behaves different than Python and POSIX standard is confusing.

This was a decision made by Python creators and while we might argue with it, it has a good reason and it is explained in the docs of Python very clearly and explicitly https://docs.python.org/3/library/pathlib.html#pathlib.Path.mkdir - so you might say a lot of users will be confused if aiflow's mkdir behaves differently.

If parents is true, any missing parents of this path are created as needed; they are created with the default permissions without taking mode into account (mimicking the POSIX mkdir -p command).

I'd say having a different behaviour will only add to confusion (umask behaviour is sufficiently complex without it). I would also say that if we want to change behaviour of this in specific places and we have a good reason for it (do we?) undeprecating a method that have been deprecated > 2 years ago #10117 (which would add even more confusion) is a bad idea. Now - we want to add a new mkdirs util method that not only creates parents by default (contrary to the system lib), but also does it differently. That's super-confusing, and there is even no attempt to explain it in the docs of hte newly added method.

If we want to propose a change in behaviour of some pleces where currently Path.mkdir(parent=True) is used, then it should be a new feature added rather than silently changing it back. This ship has sailed. And rather than adding similarly named function to utils, it should have different name and signature to make it clear it is not similar to mkdirs behaviour.

But - this PR does not do that (it does not use the newly added method), so I am rather sceptical if we want to add it at all because it seems there is no intention to use it ? If we want to just add a new util file which is not used in airflow anyway why do we want to add it at all?

If it's not used, there is no reason to add it. YAGNI.

If it "just" needs to be used elsewhere, outside of Airflow (I guess AirBnB uses it), it should be added as a library there, not in Airflow.

We ere just now discussing about being very explicit what Public Airflow Interface is #28300 and the idea is there to be very explicit about what is "public" and treat all the rest as "internal detail". Adding - effectively (after old one is deprecated for 2.5 years) a new util API which is unused makes very little sense to me. We only want to expose as external interface things that are "extremely" important to be exposed. We do not want others relying on random utils and small things in Airflow, because it makes it more difficult to maintain Airflow in the future.

Are there any particular uses of it @pingzh in Airflow that you want to add now or soon (and if so - why it's not part of this PR)?

@pingzh
Copy link
Contributor Author

pingzh commented Dec 16, 2022

@potiuk thanks for chiming me. I agree with that:

And rather than adding similarly named function to utils, it should have different name and signature to make it clear it is not similar to mkdirs behaviour.

A better method name with very clear expectation is very important.

as for the usage of this method, i was planning to update places that use Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True). I should have called it out. as I don't want to make those changes, before we agree on the expected behavior of mkdirs. I noticed that while I was testing airflow 2 in Airbnb, but this method isn't directly used by our data pipelines)

For example, the places that I plan to change is the

# To handle log writing when tasks are impersonated, the log files need to
# be writable by the user that runs the Airflow command and the user
# that is impersonated. This is mainly to handle corner cases with the
# SubDagOperator. When the SubDagOperator is run, all of the operators
# run under the impersonated user and create appropriate log files
# as the impersonated user. However, if the user manually runs tasks
# of the SubDagOperator through the UI, then the log files are created
# by the user that runs the Airflow command. For example, the Airflow
# run command may be run by the `airflow_sudoable` user, but the Airflow
# tasks may be run by the `airflow` user. If the log files are not
# writable by both users, then it's possible that re-running a task
# via the UI (or vice versa) results in a permission error as the task
# tries to write to a log file created by the other user.
relative_path = self._render_filename(ti, ti.try_number)
full_path = os.path.join(self.local_base, relative_path)
directory = os.path.dirname(full_path)
# Create the log file and give it group writable permissions
# TODO(aoen): Make log dirs and logs globally readable for now since the SubDag
# operator is not compatible with impersonation (e.g. if a Celery executor is used
# for a SubDag operator and the SubDag operator has a different owner than the
# parent DAG)
Path(directory).mkdir(mode=0o777, parents=True, exist_ok=True)

To me, the mode of directory has to be 0o777 otherwise other users cannot write logs.

@Taragolis
Copy link
Contributor

This is the only one place in project where we tried to set 0o777 and it related only to SubDag, which nowadays deprecated due to tons of limitations.

And fix (it it actually required) could be Path(directory).chmod(0o777) or os.chmod(directory, 0o777) after directory created

@pingzh
Copy link
Contributor Author

pingzh commented Dec 16, 2022

@Taragolis @potiuk it looks like it's better just to update the file_task_handler.py given the fact that mkdir with mode isn't widely used. Let me know your thoughts.

@Taragolis
Copy link
Contributor

Yep. If this issue still exists better to fix them. It could be a situation that initial issue solved but comments still exists.

@potiuk
Copy link
Member

potiuk commented Dec 16, 2022

+1

@pingzh
Copy link
Contributor Author

pingzh commented Dec 19, 2022

close this in favor of #28477

@pingzh pingzh closed this Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants