Skip to content

Conversation

@rafalh
Copy link
Contributor

@rafalh rafalh commented Feb 3, 2022

Use temporary file in GCSToS3Operator instead of keeping copied file content in the process memory. It allows copying big files on machines with small RAM size.

Use temporary file in GCSToS3Operator instead of keeping copied file content in the process memory. It allows copying big files on machines with small RAM size.
@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Feb 3, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 3, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Feb 4, 2022
@github-actions
Copy link

github-actions bot commented Feb 4, 2022

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@potiuk
Copy link
Member

potiuk commented Feb 4, 2022

Some tests are failing (related) though

Copy link
Contributor

@raphaelauv raphaelauv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an option to the operator, so user have the choice

in_memory:bool = True

@potiuk
Copy link
Member

potiuk commented Feb 6, 2022

maybe add an option to the operator, so user have the choice

in_memory:bool = True

Is there any drawback to not having it ? I believe this is higly unlikely to have less disk than memory?

@potiuk potiuk closed this Feb 6, 2022
@potiuk potiuk reopened this Feb 6, 2022
@rafalh
Copy link
Contributor Author

rafalh commented Feb 6, 2022

I was considering to add in_memory argument but I agree it does not bring much and unnecessary increases operator implantation and API complexity. AFAIK other operators also use temporary files and don't have an option to change it.
I through test failures were unrelated to my changes but I was wrong. I am going to look into them this week.

@raphaelauv
Copy link
Contributor

If It's the commun pattern to write to a temp file , then you are right it's better to align the operators.

But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.

@potiuk
Copy link
Member

potiuk commented Feb 7, 2022

But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.

Very much so, but it wasn't doing it - it was reading whole file to memory and pusing it.

@potiuk
Copy link
Member

potiuk commented Feb 7, 2022

Any PRs for that are most welcome :)

@rafalh
Copy link
Contributor Author

rafalh commented Feb 8, 2022

But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great.

It would be cool but it would require more changes because AFAIK hook classes do not support streaming right now.

@eladkal eladkal merged commit 2c5f636 into apache:main Feb 11, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Feb 11, 2022

Awesome work, congrats on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers okay to merge It's ok to merge this PR as it does not require more tests provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants