-
Notifications
You must be signed in to change notification settings - Fork 16.4k
[AIRFLOW-2651] Add file system hooks with a common interface #3526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ashb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of new code here that duplicates existing hooks.
That is going to be confusing for on-going maintenance.
I haven't given a fuller review than the few comments at the moment as it's hard to see what is new and what is duplicated.
airflow/hooks/fs_hooks/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local FS is not good practice and shouldn't be included -- we don't want people to get in to the habit of storing files on disk IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree partially with you there @ashb, but a local path could also be a network mounted disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides this, the local hook can be useful for reading files locally and uploading them to a remote location. In that sense it mirrors some of the upload_file methods that exist now, but generalises by allowing files to be read anywhere by using different source/destination hooks, including the local file system.
airflow/hooks/fs_hooks/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these methods really add any value over just calling posixpath.join directly? The smaller the interface the easier it is to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a smaller interface is better, however I wanted to keep the flexibility for supporting file systems that do not necessarily use posix-path style joins. If this is only different on windows, we can consider just using posixpath. (Is Airflow supposed to support windows?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also keep the methods internal and default to posixpath, allowing subclasses to override if necessary. This would keep the interface cleaner, at the cost of losing some agnosticity for the caller concerning the used file system (they may need to account for the separators of the concerned fs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All versions of windows since 7? Vista? have supported / in paths - i.e. c:/foo.txt works so we don't need to worry about this and can remove it.
airflow/hooks/fs_hooks/_fnmatch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why latin-1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as the stdlib I guess? That is where most of the fnmatch code originated from.
airflow/hooks/fs_hooks/_fnmatch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different to fnmatch from the stdlib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows for recursive globs, which the standard fnmatch does not allow. We can also not support recursive globbing, which allows for a simpler codebase by allowing us to rely on the standard fnmatch.
airflow/hooks/fs_hooks/s3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a new hook - it should be methods added to the existing S3Hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, otherwise these will diverge and people will get confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily agree, as the current S3 hook represents S3 more as a key-based blob store than a file system. See my comment below for more details. A compromise may be to combine the two interfaces, allowing the hook to be used as a file system or the current interface. This would bloat the interface somewhat though.
airflow/hooks/fs_hooks/s3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 doesn't have the concept of directories -- you can just write to paths at will.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why makedirs is essentially a no-op. It does check for existence of the 'directory', to keep the idea of a 'file system' in line with the other hooks.
airflow/hooks/fs_hooks/sftp.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this shouldn't be a new hook, it should be part of the existing sftp hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I believe @NielsZeilemaker enriched the existing SFTP hook recently.
tests/hooks/fs_hooks/test_sftp.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove this
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jrderuiter. Thanks for taking the time to contribute this to Airflow. I really like the concepts and it adds some structure that Airflow is currently lacking. @ashb has some valid comments, and I also added a few.
Apart from that, I think it would be also really valuable to add a bit of documentation, so that other contributors can also add support, for example GCS. cc @fenglu-g @kaxil
Cheers, Fokko
airflow/hooks/fs_hooks/_fnmatch.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can rewrite this a bit:
if os.path is posixpath:
names = [os.path.normcase(name) for name in names]
names = [match(name) for name in names]
I like short code ;)
airflow/hooks/fs_hooks/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree partially with you there @ashb, but a local path could also be a network mounted disk.
airflow/hooks/fs_hooks/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe shorten this to:
return fnmatch.filter(all_paths, pattern, sep=self.sep)
airflow/hooks/fs_hooks/hdfs3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being autistic, but sometimes I see:
def disconnect(self):
if self._conn is not None:
self._conn.disconnect()
self._conn = None
and
def disconnect(self):
if self._conn is not None:
self._conn.disconnect()
self._conn = None
airflow/hooks/fs_hooks/local.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one isn't used
airflow/hooks/fs_hooks/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's this one used?
airflow/hooks/fs_hooks/s3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, otherwise these will diverge and people will get confused
airflow/hooks/fs_hooks/sftp.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I believe @NielsZeilemaker enriched the existing SFTP hook recently.
|
Can you also rebase onto master to resolve conflicts? :) |
|
Thanks @ashb and @Fokko for the comments. I think the discussion about how to integrate this with existing hooks is an important issue, but it is not quite clear for me how this would best be solved. The new SFTP hook implements most functionality of the existing hook, so dropping that hook from contrib would be possible, at the cost of changing the interface. I think this would be acceptable if the changes are integrated into a new major version of Airflow, rather than a new minor version. We can integrate any missing functionality into the new hook. The existing S3 hook is a bit more difficult, as the new hook represents S3 as a file system, whilst the existing hook stays closer to the notion of S3 as a key-based blob store. As such, methods such as I think the two options we have are to:
I would be keen to hear your ideas on this. |
|
I would like to stress that the biggest advantage of having a common interface for file systems is that it becomes easier to write operators/hooks that read/write to different (combinations of) file systems. For example, by combining these file system hooks we can essentially copy files between any of the file systems (by using specific source/destination hooks), rather than being limited to copying from local to/from a specific file system. |
|
Definitely not having two versions of the hooks is crucial. For the case of S3: yes, s3 is not a filesystem. But the "fs"-like interfaces should be implemented as method on the existing S3 hook using the existing methods where possible. i.e. Multiple inheritance is entirely possible in python, and having: would be reasonable in this case. |
|
@jrderuiter I like the possibilities that this will deliver, but I think some architectural updates are required. The |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
JIRA
https://issues.apache.org/jira/browse/AIRFLOW-2651
Description
This PR adds a set of file system hooks (FsHooks) that implement a common interface for manipulating files on different types of file systems. For more details, see the JIRA issue above. Currently the PR implements FsHooks for FTP, HDFS3, S3, SFTP and the local file system.
Tests
My PR adds the following unit tests OR does not need testing for this extremely good reason:
tests/hooks/fs_hooks/test_base.py
tests/hooks/fs_hooks/test_ftp.py
tests/hooks/fs_hooks/test_hdfs3.py
tests/hooks/fs_hooks/test_local.py
tests/hooks/fs_hooks/test_s3.py
tests/hooks/fs_hooks/test_sftp.py
Commits
TODO.
Documentation
TODO.
Code Quality
git diff upstream/master -u -- "*.py" | flake8 --diff