Skip to content

Conversation

@Ghoul-SSZ
Copy link

Problem

  1. Currently the DynamoDBToS3Operator only supports 2 ways of downloading data: "full export using scan" or "full exports using boto3's export_table_to_point_in_time". When using export_table_to_point_in_time method, you can also use "Incremental Export" instead of "Full Export" if you only wish for the delta of the data changes between the specified period. However this functionality is currently not supported in the DynamoDBToS3Operator.
  2. There are times we need to export the data from 1 AWS account (Account A) to another account (Account B). In the full export using scan method, you can specify 2 aws connection to facilitate this need. But for export_table_to_point_in_time, we would need an additional argument (s3_bucket_owner) to allow cross account export.

Proposed Changes

  1. Replace the export_time arg with a boolean value make the Operator more visible that it is doing a point in time export.
  2. Add export_type, incremental_export_from_time , incremental_export_to_time , incremental_export_view_type arguments to allow Incremental Export.
  3. Add s3_bucket_owner arguments to allow cross-account export

closes: #40737
related: #40737


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Aug 7, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@eladkal eladkal requested review from ferruzzi and vincbeck August 10, 2024 04:23
@eladkal eladkal changed the title Add incremental export and cross account export functionality Add incremental export and cross account export functionality in DynamoDBToS3Operator Aug 10, 2024
Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work!

@vincbeck
Copy link
Contributor

Static checks are failing. Could you please run them

@Ghoul-SSZ
Copy link
Author

I have ran and update the import so that the Static checks should pass.

However, I have noticed that documentation build and spell checks have failed in the GitHub Action for commit 0a195aa but are not related to my changes.

Should I try to fix those now or maybe open a different Pull Request for it?

@vincbeck
Copy link
Contributor

I have ran and update the import so that the Static checks should pass.

However, I have noticed that documentation build and spell checks have failed in the GitHub Action for commit 0a195aa but are not related to my changes.

Should I try to fix those now or maybe open a different Pull Request for it?

I think it has been fixed by #41449, could you please update your fork and update this branch? It should solve the issue

@Ghoul-SSZ
Copy link
Author

I have ran and update the import so that the Static checks should pass.
However, I have noticed that documentation build and spell checks have failed in the GitHub Action for commit 0a195aa but are not related to my changes.
Should I try to fix those now or maybe open a different Pull Request for it?

I think it has been fixed by #41449, could you please update your fork and update this branch? It should solve the issue

Looks like it is still failing. 😞

@vincbeck
Copy link
Contributor

vincbeck commented Aug 14, 2024

I resolved the doc building issue. Now, there is one unit test (Kubernetes provider) that is failing. It is definitely not related to this PR

@vincbeck
Copy link
Contributor

But I am not sure what is going on either ....

@Ghoul-SSZ
Copy link
Author

But I am not sure what is going on either ....

yeah it is strange. I will also try to look at it tomorrow when I have time

@vincbeck
Copy link
Contributor

It has been fixed in #41500, please rebase your PR :)

@Ghoul-SSZ
Copy link
Author

It has been fixed in #41500, please rebase your PR :)

All checks have passed! 🎉

@vincbeck vincbeck merged commit a70ee72 into apache:main Aug 15, 2024
@boring-cyborg
Copy link

boring-cyborg bot commented Aug 15, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@Ghoul-SSZ Ghoul-SSZ deleted the feature/add-point-in-time-export-functionality-for-DynamoDBToS3Operator branch August 15, 2024 16:57
Copy link
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has caused the test to start failing. I have found at least three issues; two I have left the fix for in the comments, the third I'm not sure yet. @Ghoul-SSZ - can you have a look and try to make the fixes?

Comment on lines 227 to +228
export_time,
backup_db_to_point_in_time,
backup_db_to_point_in_time_full_export,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert latest_export_time between these two lines

       export_time,
       latest_export_time
       backup_db_to_point_in_time_full_export,

Without it listed in the chain(), it fires off at the very beginning and causes the test to fail because env_id isn't populated yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #41517

self.point_in_time_export = point_in_time_export
self.export_time = export_time
self.export_format = export_format
self.export_table_to_point_in_time_kwargs = export_table_to_point_in_time_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be

self.export_table_to_point_in_time_kwargs = export_table_to_point_in_time_kwargs or {}

otherwise if None is passed in, Line 199 fails because None can't be unpacked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in #41517

Comment on lines +196 to +197
"ExportFromTime": export_time,
"ExportToTime": latest_export_time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue I have run out of time for today. with the other two fixes applied locally, the test keeps failing and I've tried a number of solutions. Here's what I've found so far:

error: "from" time can't be equal to "to" time
tried: so I set "from" to export_time - 1_minute
error: "from" can't be less than table creation time
tried: revert the above and set "to" to be latest_export_time + 1_minute
error: Difference between "from" time and "to" is less than 15 minutes
tried: revert the above and set "to" to be latest_export_time + 16_minutes
error: "to" time cannot be greater than the current time
tried: revert the above and set "to" time to utcnow()
error: this gets set at parse time and "to" time is now in the past

We will either need to figure out a way to make this happy or set that task to not run using an unreachable trigger_rule or a branch operator to skip it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"error: Difference between "from" time and "to" is less than 15 minutes" is the most scary. That means we need to wait for 15 minutes so that we can create the incremental export. In that case we dont want to run it as part of the system test, so either we delete it from the system test, or we, somehow, skip it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I just put up #41546 which skips that task. That way we can still have the snippet in the docs.

Artuz37 pushed a commit to Artuz37/airflow that referenced this pull request Aug 19, 2024
…amoDBToS3Operator` (apache#41304)



---------

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Aug 20, 2024
…amoDBToS3Operator` (apache#41304)



---------

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DynamoDBToS3Operator using native export functionality.

4 participants