Skip to content

Conversation

@jrderuiter
Copy link
Contributor

JIRA

  • My PR addresses the following Airflow JIRA issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

https://issues.apache.org/jira/browse/AIRFLOW-2651

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR adds a set of file system hooks (FsHooks) that implement a common interface for manipulating files on different types of file systems. For more details, see the JIRA issue above. Currently the PR implements FsHooks for FTP, HDFS3, S3, SFTP and the local file system.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

  • tests/hooks/fs_hooks/test_base.py

  • tests/hooks/fs_hooks/test_ftp.py

  • tests/hooks/fs_hooks/test_hdfs3.py

  • tests/hooks/fs_hooks/test_local.py

  • tests/hooks/fs_hooks/test_s3.py

  • tests/hooks/fs_hooks/test_sftp.py

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

TODO.

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

TODO.

Code Quality

  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of new code here that duplicates existing hooks.

That is going to be confusing for on-going maintenance.

I haven't given a fuller review than the few comments at the moment as it's hard to see what is new and what is duplicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local FS is not good practice and shouldn't be included -- we don't want people to get in to the habit of storing files on disk IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree partially with you there @ashb, but a local path could also be a network mounted disk.

Copy link
Contributor Author

@jrderuiter jrderuiter Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides this, the local hook can be useful for reading files locally and uploading them to a remote location. In that sense it mirrors some of the upload_file methods that exist now, but generalises by allowing files to be read anywhere by using different source/destination hooks, including the local file system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these methods really add any value over just calling posixpath.join directly? The smaller the interface the easier it is to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a smaller interface is better, however I wanted to keep the flexibility for supporting file systems that do not necessarily use posix-path style joins. If this is only different on windows, we can consider just using posixpath. (Is Airflow supposed to support windows?)

Copy link
Contributor Author

@jrderuiter jrderuiter Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also keep the methods internal and default to posixpath, allowing subclasses to override if necessary. This would keep the interface cleaner, at the cost of losing some agnosticity for the caller concerning the used file system (they may need to account for the separators of the concerned fs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All versions of windows since 7? Vista? have supported / in paths - i.e. c:/foo.txt works so we don't need to worry about this and can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why latin-1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the stdlib I guess? That is where most of the fnmatch code originated from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different to fnmatch from the stdlib?

Copy link
Contributor Author

@jrderuiter jrderuiter Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows for recursive globs, which the standard fnmatch does not allow. We can also not support recursive globbing, which allows for a simpler codebase by allowing us to rely on the standard fnmatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a new hook - it should be methods added to the existing S3Hook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, otherwise these will diverge and people will get confused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily agree, as the current S3 hook represents S3 more as a key-based blob store than a file system. See my comment below for more details. A compromise may be to combine the two interfaces, allowing the hook to be used as a file system or the current interface. This would bloat the interface somewhat though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 doesn't have the concept of directories -- you can just write to paths at will.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why makedirs is essentially a no-op. It does check for existence of the 'directory', to keep the idea of a 'file system' in line with the other hooks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this shouldn't be a new hook, it should be part of the existing sftp hook.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I believe @NielsZeilemaker enriched the existing SFTP hook recently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jrderuiter. Thanks for taking the time to contribute this to Airflow. I really like the concepts and it adds some structure that Airflow is currently lacking. @ashb has some valid comments, and I also added a few.

Apart from that, I think it would be also really valuable to add a bit of documentation, so that other contributors can also add support, for example GCS. cc @fenglu-g @kaxil

Cheers, Fokko

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rewrite this a bit:

if os.path is posixpath:
    names = [os.path.normcase(name) for name in names]
names = [match(name) for name in names]

I like short code ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree partially with you there @ashb, but a local path could also be a network mounted disk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe shorten this to:

return fnmatch.filter(all_paths, pattern, sep=self.sep)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being autistic, but sometimes I see:

def disconnect(self):
    if self._conn is not None:
        self._conn.disconnect()
    self._conn = None

and

def disconnect(self):
    if self._conn is not None:
        self._conn.disconnect()
        self._conn = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one isn't used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's this one used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, otherwise these will diverge and people will get confused

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I believe @NielsZeilemaker enriched the existing SFTP hook recently.

@Fokko
Copy link
Contributor

Fokko commented Jun 20, 2018

Can you also rebase onto master to resolve conflicts? :)

@jrderuiter
Copy link
Contributor Author

jrderuiter commented Jun 20, 2018

Thanks @ashb and @Fokko for the comments. I think the discussion about how to integrate this with existing hooks is an important issue, but it is not quite clear for me how this would best be solved.

The new SFTP hook implements most functionality of the existing hook, so dropping that hook from contrib would be possible, at the cost of changing the interface. I think this would be acceptable if the changes are integrated into a new major version of Airflow, rather than a new minor version. We can integrate any missing functionality into the new hook.

The existing S3 hook is a bit more difficult, as the new hook represents S3 as a file system, whilst the existing hook stays closer to the notion of S3 as a key-based blob store. As such, methods such as select_key and list_prefixes have no clear meaning in my proposed S3FsHook, as these aren't file system concepts. We could integrate some of these methods into the S3FsHook, but it would dilute the interface considerably, I fear.

I think the two options we have are to:

  1. Keep two separate hooks, one for S3 as a file system and one for S3 as a blob store.
  2. Keep both interfaces, but separate the blob store interface into a sub-property of the hook to keep the interfaces separate. This would still require separating the existing S3Hook from the AwsHook base class, to avoid multiple inheritance (as we would need to inherit from both FsHook and AwsHook otherwise). S3Hook could still use the AwsHook through composition, as the only thing the AwsHook is really needed for is fetching the client and/or credentials. The only other option would be to make FsHook a mixin, but I fear that that would lead to dilution of the interface.

I would be keen to hear your ideas on this.

@jrderuiter
Copy link
Contributor Author

I would like to stress that the biggest advantage of having a common interface for file systems is that it becomes easier to write operators/hooks that read/write to different (combinations of) file systems. For example, by combining these file system hooks we can essentially copy files between any of the file systems (by using specific source/destination hooks), rather than being limited to copying from local to/from a specific file system.

@ashb
Copy link
Member

ashb commented Jun 21, 2018

Definitely not having two versions of the hooks is crucial.

For the case of S3: yes, s3 is not a filesystem. But the "fs"-like interfaces should be implemented as method on the existing S3 hook using the existing methods where possible. i.e. walk() would use list_prefixes() in someway. Anything else is just going to make long-term maintenance too hard.

Multiple inheritance is entirely possible in python, and having:

class S3Hook(FsHook, AwsHook):

would be reasonable in this case.

@bolkedebruin
Copy link
Contributor

@jrderuiter I like the possibilities that this will deliver, but I think some architectural updates are required. The lineage improvements basically allow for the same kind of functionality and these changes will need to tie in with it. Maybe have a discussion offline or on th emaling list (or an improvement proposal) can speed this up?

@stale
Copy link

stale bot commented Dec 10, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 10, 2018
@stale stale bot closed this Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants