-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Use temporary file in GCSToS3Operator #21295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use temporary file in GCSToS3Operator instead of keeping copied file content in the process memory. It allows copying big files on machines with small RAM size.
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
|
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
|
Some tests are failing (related) though |
raphaelauv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an option to the operator, so user have the choice
in_memory:bool = True
Is there any drawback to not having it ? I believe this is higly unlikely to have less disk than memory? |
|
I was considering to add |
|
If It's the commun pattern to write to a temp file , then you are right it's better to align the operators. But for this operator since it could be about transferring big files , streaming from GCS to S3 with a multiparty upload would be great. |
Very much so, but it wasn't doing it - it was reading whole file to memory and pusing it. |
|
Any PRs for that are most welcome :) |
It would be cool but it would require more changes because AFAIK hook classes do not support streaming right now. |
|
Awesome work, congrats on your first merged pull request! |
Use temporary file in GCSToS3Operator instead of keeping copied file content in the process memory. It allows copying big files on machines with small RAM size.