Skip to content

Conversation

@dabla
Copy link
Contributor

@dabla dabla commented Dec 10, 2024

As explained in my Airflow medium blogpost, I've refactored the GenericTransfer to support deferred paginated reads.

When dealing with large datasets, not the whole dataset needs to be read into memory first before persisting it afterwards, as this could otherwise lead to out of memory errors on the worker executing the code.

I also took the opportunity to introduce an SQLExecuteQueryTrigger in the common sql provider, allowing the GenericTransfer to handle the paginated reads in deferred mode, so that the paginated reads can be decoupled from the writes, which shouldn't continuously block the worker as it can offload the reads to the triggerer while persisting the previous page in the meantime.

Once the dialects PR is done, we could improve the way how the GenericTransfer handles the paginated SQL queries across different databases. At this moment the paginated SQL query can be customized through the paginated_sql_statement_format parameter. The read size can be specified through the page_size parameter, maybe another (better) name could be preferred here but that I let you guy's decide how it's best named. If no page_size is specified, then the original implementation is used and everything is read and persisted in one go.

Last but not least, I've moved the test code to test deferrable operators out of the microsoft azure provider and put it into the common test utils, so it can be re-used across multiple modules.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@dabla
Copy link
Contributor Author

dabla commented Dec 11, 2024

Following dependency check is failing in breeze:

pytest.param(
            ("providers/src/airflow/providers/standard/operators/bash.py",),
            {
                "selected-providers-list-as-string": "common.compat standard",
                "all-python-versions": "['3.9']",
                "all-python-versions-list-as-string": "3.9",
                "python-versions": "['3.9']",
                "python-versions-list-as-string": "3.9",
                "ci-image-build": "true",
                "prod-image-build": "false",
                "needs-helm-tests": "false",
                "run-tests": "true",
                "run-amazon-tests": "false",
                "docs-build": "true",
                "run-kubernetes-tests": "false",
                "skip-pre-commits": "identity,lint-helm-chart,mypy-airflow,mypy-dev,mypy-docs,mypy-providers,mypy-task-sdk,"
                "ts-compile-format-lint-ui,ts-compile-format-lint-www",
                "upgrade-to-newer-dependencies": "false",
                "core-test-types-list-as-string": "Always Core Serialization",
                "providers-test-types-list-as-string": "Providers[common.compat] Providers[standard]",
                "needs-mypy": "true",
                "mypy-checks": "['mypy-providers']",
            },
            id="Providers standard tests and Serialization tests to run when airflow bash.py changed",
        ),

@eladkal @potiuk This error is logical, as I needed to add the common sql provider dependency as the GenericTransfer needs this dependency due to the newly introduced SQLExecuteQueryTrigger used to allow the deferred paging mechanism.

But after some reflection, it still feels unlogical to me that the GenericTransfer operator is part of the standard provider package, unless it allows more than just transferring data from database to database? If not, it would be more logical it resides in the common sql provider or I'm missing something?

I've been going through the code, and checked implementations of the get_records and insert_rows method, which where all implemented by a Hook extending the DbApiHook, but I suspect the DbApiHook was introduced after the GenericTransfer already existed.

@dabla
Copy link
Contributor Author

dabla commented Jan 1, 2025

Following dependency check is failing in breeze:

pytest.param(
            ("providers/src/airflow/providers/standard/operators/bash.py",),
            {
                "selected-providers-list-as-string": "common.compat standard",
                "all-python-versions": "['3.9']",
                "all-python-versions-list-as-string": "3.9",
                "python-versions": "['3.9']",
                "python-versions-list-as-string": "3.9",
                "ci-image-build": "true",
                "prod-image-build": "false",
                "needs-helm-tests": "false",
                "run-tests": "true",
                "run-amazon-tests": "false",
                "docs-build": "true",
                "run-kubernetes-tests": "false",
                "skip-pre-commits": "identity,lint-helm-chart,mypy-airflow,mypy-dev,mypy-docs,mypy-providers,mypy-task-sdk,"
                "ts-compile-format-lint-ui,ts-compile-format-lint-www",
                "upgrade-to-newer-dependencies": "false",
                "core-test-types-list-as-string": "Always Core Serialization",
                "providers-test-types-list-as-string": "Providers[common.compat] Providers[standard]",
                "needs-mypy": "true",
                "mypy-checks": "['mypy-providers']",
            },
            id="Providers standard tests and Serialization tests to run when airflow bash.py changed",
        ),

@eladkal @potiuk This error is logical, as I needed to add the common sql provider dependency as the GenericTransfer needs this dependency due to the newly introduced SQLExecuteQueryTrigger used to allow the deferred paging mechanism.

But after some reflection, it still feels unlogical to me that the GenericTransfer operator is part of the standard provider package, unless it allows more than just transferring data from database to database? If not, it would be more logical it resides in the common sql provider or I'm missing something?

I've been going through the code, and checked implementations of the get_records and insert_rows method, which where all implemented by a Hook extending the DbApiHook, but I suspect the DbApiHook was introduced after the GenericTransfer already existed.

Hello @potiuk @eladkal could you check my above question whether it makes sense or not? Thx

@potiuk
Copy link
Member

potiuk commented Jan 1, 2025

@eladkal @potiuk This error is logical, as I needed to add the common sql provider dependency as the GenericTransfer needs this dependency due to the newly introduced SQLExecuteQueryTrigger used to allow the deferred paging mechanism.

But after some reflection, it still feels unlogical to me that the GenericTransfer operator is part of the standard provider package, unless it allows more than just transferring data from database to database? If not, it would be more logical it resides in the common sql provider or I'm missing something?

Absolultely. It should be added to common.sql no doubts about that.

@dabla
Copy link
Contributor Author

dabla commented Jan 1, 2025

@eladkal @potiuk This error is logical, as I needed to add the common sql provider dependency as the GenericTransfer needs this dependency due to the newly introduced SQLExecuteQueryTrigger used to allow the deferred paging mechanism.

But after some reflection, it still feels unlogical to me that the GenericTransfer operator is part of the standard provider package, unless it allows more than just transferring data from database to database? If not, it would be more logical it resides in the common sql provider or I'm missing something?

Absolultely. It should be added to common.sql no doubts about that.

@potiuk Okay but this would then have an impact on imports no? Or would you keep same structure as is and move the GenericTransfer from standard providers to common sql?

@potiuk
Copy link
Member

potiuk commented Jan 1, 2025

@potiuk Okay but this would then have an impact on imports no? Or would you keep same structure as is and move the GenericTransfer from standard providers to common sql?

Generic Transfer has only been moved to "standard" provider recently as part of the preparation for Airflow 3. And the "standard" provider is not YET released in a 1.0.* version - it is 0.0.3 now - because we have not completed yet extraction of everything there, and we expected that we might have some changes here and there, so Generic Transfer moved to the standard provider can be classified as mistake - should be moved to common.sql in the first place, and we can do it without taking care about back-compatibility.

The only back-compatibiity issue is that the old generic transfer should be redirected in Airflow 3 - but we can simply redirect it to the new place in common.sql, no problem with it whatsoever:

@potiuk potiuk force-pushed the feature/paginated-generic-transfer branch from fa178af to 59eae35 Compare January 2, 2025 12:37
@potiuk
Copy link
Member

potiuk commented Jan 2, 2025

@dabla - I rebased it -> we found and issue with @jscheffl with the new caching scheme - fixed in #45347 that would run "main" version of the tests.

@dabla
Copy link
Contributor Author

dabla commented Jan 2, 2025

@dabla - I rebased it -> we found and issue with @jscheffl with the new caching scheme - fixed in #45347 that would run "main" version of the tests.

Thx @jscheffl and @potiuk

@dabla
Copy link
Contributor Author

dabla commented Jan 31, 2025

I'm having trouble fixing this mypy error as both the .py and and .pyi files do have a correct signature, at least from what I see:

providers/common/sql/src/airflow/providers/common/sql/triggers/sql.pyi:45: error:
Return type "Coroutine[Any, Any, AsyncIterator[TriggerEvent]]" of "run"
incompatible with return type "AsyncIterator[TriggerEvent]" in supertype
"BaseTrigger"  [override]
        async def run(self) -> AsyncIterator[TriggerEvent]: ...
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Found 1 error in 1 file (checked [36](https://github.com/apache/airflow/actions/runs/13053144783/job/36418264340?pr=44809#step:6:37)09 source files)
Error 1 returned
You are running mypy with the folders selected. If you want to reproduce it l

@dabla
Copy link
Contributor Author

dabla commented Feb 12, 2025

@eladkal @potiuk test are green again

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice and clean! Thanks @dabla !

@potiuk potiuk merged commit 310f5cd into apache:main Feb 26, 2025
149 checks passed
ambika-garg pushed a commit to ambika-garg/airflow that referenced this pull request Feb 28, 2025
Refactored the GenericTransfer operator to support paginated reads (in deferred mode) and introduce a SQLExecuteQueryTrigger and moved it to common.sql


---------

Co-authored-by: David Blain <david.blain@infrabel.be>
Co-authored-by: Aritra Basu <24430013+aritra24@users.noreply.github.com>
Co-authored-by: max <42827971+moiseenkov@users.noreply.github.com>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Co-authored-by: Brent Bovenzi <brent@astronomer.io>
Co-authored-by: Pierre Jeambrun <pierrejbrun@gmail.com>
Co-authored-by: Kalyan R <kalyan.ben10@live.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Kunal Bhattacharya <kunal.jubce@gmail.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: jj.lee <63435794+jx2lee@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
Co-authored-by: Pratiksha <128999446+Prab-27@users.noreply.github.com>
Co-authored-by: pratiksha rajendrabhai badheka <pratiksha@DESKTOP-T5HUA05>
Co-authored-by: Josix <josixwang@gmail.com>
Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>
nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Apr 4, 2025
Refactored the GenericTransfer operator to support paginated reads (in deferred mode) and introduce a SQLExecuteQueryTrigger and moved it to common.sql


---------

Co-authored-by: David Blain <david.blain@infrabel.be>
Co-authored-by: Aritra Basu <24430013+aritra24@users.noreply.github.com>
Co-authored-by: max <42827971+moiseenkov@users.noreply.github.com>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Co-authored-by: Brent Bovenzi <brent@astronomer.io>
Co-authored-by: Pierre Jeambrun <pierrejbrun@gmail.com>
Co-authored-by: Kalyan R <kalyan.ben10@live.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Amogh Desai <amoghrajesh1999@gmail.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Kunal Bhattacharya <kunal.jubce@gmail.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: jj.lee <63435794+jx2lee@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
Co-authored-by: Pratiksha <128999446+Prab-27@users.noreply.github.com>
Co-authored-by: pratiksha rajendrabhai badheka <pratiksha@DESKTOP-T5HUA05>
Co-authored-by: Josix <josixwang@gmail.com>
Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.