Skip to content

Conversation

@feluelle
Copy link
Member

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

  • Here are some details about my PR, including screenshots of any UI changes:

This PR adds an operator that transfers a mail attachment into an s3 bucket.
It also:

  • adds tests

NOTE: This operator only transfers the latest attachment by name.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

Code Quality

  • Passes flake8

@feluelle
Copy link
Member Author

feluelle commented Jan 10, 2019

PTAL: @feng-tao @kaxil @ashb

Especially because of this:

NOTE: This operator only transfers the latest attachment by name.

I am thinking of changing this so that you can transfer multiple attachments that for example match the regular expression you pass in but I don't know if this is really a good idea and if so how would be the best way to implement this.

Something like adding imap_latest_only (set to False) + passing s3_bucket so that the attachments just keep its names looks very strange to me. :/ ..but I don't know any other way of doing this.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @feluelle

Looking good, but I have some minor comments. PTAL

@feluelle feluelle force-pushed the feature/AIRFLOW-3552-imap-to-s3-transfer branch from 775e108 to b146d14 Compare January 14, 2019 08:42
@codecov-io
Copy link

Codecov Report

Merging #4476 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4476      +/-   ##
==========================================
+ Coverage   74.76%   74.78%   +0.02%     
==========================================
  Files         429      430       +1     
  Lines       29648    29670      +22     
==========================================
+ Hits        22165    22189      +24     
+ Misses       7483     7481       -2
Impacted Files Coverage Δ
.../contrib/operators/imap_to_s3_transfer_operator.py 100% <100%> (ø)
airflow/models/__init__.py 92.54% <0%> (+0.04%) ⬆️
airflow/contrib/operators/ssh_operator.py 85% <0%> (+1.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e990aef...b146d14. Read the comment docs.

@feluelle
Copy link
Member Author

@Fokko resolved and also added template fields.
Please take another look when you got the time. Thanks :)

@feluelle
Copy link
Member Author

This PR kinda depends on #4475 . @Fokko may you also take a look there? :)

@OmerJog
Copy link
Contributor

OmerJog commented Jan 23, 2019

Is there a naming policy ?
This PR introduce ImapToS3TransferOperator in file imap_to_s3_transfer_operator.py

However there are two definition

Without Operator:
OracleToOracleTransfer in file oracle_to_oracle_transfer.py
OracleToAzureDataLakeTransfer in file oracle_to_azure_data_lake_transfer.py

With Opeartor:
GoogleCloudStorageToGoogleCloudStorageTransferOperator in file gcs_to_gcs_transfer_operator.py
S3ToGoogleCloudStorageTransferOperator in file s3_to_gcs_transfer_operator.py

It's very confusing.

@ashb
Copy link
Member

ashb commented Jan 23, 2019

Most things have without Operator - I think the Oracle ones you mention are the exception (but I haven't checked all of them)

(Edit: meant to say with operator)

@OmerJog
Copy link
Contributor

OmerJog commented Jan 23, 2019

@ashb Actually most with Operator.
I'm not sure why the Transfer operators are treated differently at the end they are still Operators like the rest:
PostgresToGoogleCloudStorageOperator
SFTPToS3Operator
etc..

Actually there is a 3rd kind HiveToDynamoDBTransferOperator in file hive_to_dynamodb.py (No Transfer nor Operator in the file name).

I think there should be specific rules for this in CONTRIBUTING.md and that will be enforced when code-reviewing. But this discussion is out of scope of this specific PR.

@feluelle
Copy link
Member Author

@OmerJog Absolutly agree. I don't like how this is currently done, too.
Maybe you can open a seperate PR for this. So we can continue discussing it there or even better send a mail to the mailing list - especially for Naming Conventions.

@OmerJog
Copy link
Contributor

OmerJog commented Jan 23, 2019

I'm not on mailing list that usually too spamy for me. Just thought it's worth mentioning

@feluelle
Copy link
Member Author

Please do not merge this yet. I need to adapt this regarding to latest changes from #4475 .

@ashb Do you think we you should be able to specify the not_found_mode here? Or not for that kind of an operator? (default is raise then)

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the class name I wouldn't have expeceted this to be about attacmhents, but just the message. Maybe ImapAttachmentToS3TransferOperator is a little bit long ImapAttachmentToS3Operator

If the sole purpose of this operator is to transfer attachments, then failing when there are none probably makes sense.

@feluelle feluelle force-pushed the feature/AIRFLOW-3552-imap-to-s3-transfer branch from b146d14 to b4de8ea Compare January 28, 2019 14:13
@feluelle feluelle changed the title [AIRFLOW-3552] Add ImapToS3TransferOperator [AIRFLOW-3552] Add ImapAttachmetToS3Operator Jan 28, 2019
NOTE: This operator only transfers the latest attachment by name.
- adds tests

[AIRFLOW-3552] Add ImapToS3TransferOperator

- add missing template fields
- add missing apply_defaults decorator
- change s3_conn_id to default to aws_default connection id

[AIRFLOW-3602] Rename class and its references

[AIRFLOW-3602] Fix imap_to_s3 test class
@feluelle feluelle force-pushed the feature/AIRFLOW-3552-imap-to-s3-transfer branch from b4de8ea to e6f8492 Compare January 28, 2019 16:16
@feluelle
Copy link
Member Author

Hey @ashb, I changed the class name. Any other comments about this feature? :)

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a comment on this PR, more a general observation:

I'd like to find a way to make this sort of XToYOperator more generic.

Something like:

CopyOperator(
    source=ImapAttachmentSomething(imap_conn_id...),
    dest=S3Fs(....)
)

Very half baked idea, but supporting arbitrary sources and dests would be nice.

@ashb ashb merged commit cd9d543 into apache:master Jan 29, 2019
@feluelle
Copy link
Member Author

feluelle commented Jan 29, 2019

@ashb I understand your idea, but I don't think this can't be more generic. It won't get simpler.

What I have done in for our custom airflow plugin is adding kinda like an interface:

class BaseTransferOperator(BaseOperator):

  def __init__(self, source_conn_id, destination_conn_id, *args, **kwargs):
    super().__init__(*args, **kwargs)

So all transfer operators need to have two connections.

@mik-laj
Copy link
Member

mik-laj commented Jan 29, 2019

@ashb Look at:
#3526
Here was the beginning of work on this idea.

@mik-laj
Copy link
Member

mik-laj commented Jan 29, 2019

@ashb
We need to think about classified different types of system (ex. filesystem. database) and on the basis of this write operators, which will copy data between different types of systems.
Currently, we have GenericTransfer which copies between the database and the database.
Another operator can be created, which will copy data between database and file systems.
However, this requires a lot of deeper architectural thought.

ashb pushed a commit that referenced this pull request Apr 3, 2019
NOTE: This operator only transfers the latest attachment by name.
@feluelle feluelle deleted the feature/AIRFLOW-3552-imap-to-s3-transfer branch April 9, 2019 12:09
wmorris75 pushed a commit to modmed-external/incubator-airflow that referenced this pull request Jul 29, 2019
NOTE: This operator only transfers the latest attachment by name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants