-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add DatabricksWorkflowTaskGroup #39771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DatabricksWorkflowTaskGroup #39771
Conversation
tatiana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to see how far you've advanced on this, @pankajkoti ! You're reimplementing the original feature in a better way. Thank you!
92b551f to
1e911c6
Compare
f12d091 to
cc8d0a8
Compare
11c0b93 to
ffc4f25
Compare
a328ad8 to
a9125aa
Compare
docs/apache-airflow-providers-databricks/operators/workflow.rst
Outdated
Show resolved
Hide resolved
tatiana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajkoti I left some minor comments, but it looks great. It's impressive how you were able to keep the original interfaces, while being consistent with the Airflow Databricks provider. Thank you!
Lee-W
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left some minor suggestions, most of them are just annotation improvements
tests/system/providers/databricks/example_databricks_workflow.py
Outdated
Show resolved
Hide resolved
tests/system/providers/databricks/example_databricks_workflow.py
Outdated
Show resolved
Hide resolved
tests/system/providers/databricks/example_databricks_workflow.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Wei Lee <weilee.rx@gmail.com> Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
|
@tatiana @Lee-W @phanikumv I have addressed all the review comments so far. Would appreciate another review please. |
Lee-W
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pankajkoti ! overall it looks great! left one suggestion but we probably can do it in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pankajkoti , thanks for addressing all the feedback, you made it better than the original implementation!
Some minor feedback that can be addressed in follow-up PRs:
- When using
DatabricksNotebookOperatorfromDatabricksWorkflowTaskGroup, the Databricks Workflow Job Tasks should not be prefixed with the DAG & TaskGroup. Using the example shown in the PR description:- The task (notebook identifier) is named
workflow_notebook_1in the Airflow Graph view - The task is currently named
example_Databricks_workflow_test_workflow_root_1234_run_workflow_notebook_1in the Databricks Graph view - unnecessarily since tasks are not displayed as jobs in the Databricks job list. The current name in Databricks makes it hard to understand how different notebooks relate to each other in the Databricks UI.
- The task (notebook identifier) is named
- To have a clear bullet list or table showing how the concepts map between Airflow & Databricks:
- Airflow DAG - Databricks Workflow Job name prefix
- Airflow TaskGroup - Databricks Workflow Job
- Airflow Notebook task - Databricks Workflow Job Task
- Move a log line outside of the loop since it does not give meaningful data to be printed for each iteration of the for loop: #39771 (comment)
- Switch
Using token authlog fromINFOtoDEBUG(unrelated to the current PR, but it improves the log messages) - Refactor other parts of the Databricks provider to use the just introduced
EnumRunLifeCycleState
This pull request introduces the [DatabricksWorkflowTaskGroup](https://github.com/astronomer/astro-provider-databricks/blob/main/src/astro_databricks/operators/workflow.py#L226) to the Airflow Databricks provider from the [astro-provider-databricks](https://github.com/astronomer/astro-provider-databricks/tree/main) repository. It marks another pull request aimed at contributing operators and features from that repository into the Airflow Databricks provider, the previous PR being apache#39178. The task group launches a [Databricks Workflow](https://docs.databricks.com/en/workflows/index.html) and runs the notebook jobs from within it, resulting in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute) when compared to executing ``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``. --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com> Co-authored-by: Wei Lee <weilee.rx@gmail.com>
This pull request introduces the [DatabricksWorkflowTaskGroup](https://github.com/astronomer/astro-provider-databricks/blob/main/src/astro_databricks/operators/workflow.py#L226) to the Airflow Databricks provider from the [astro-provider-databricks](https://github.com/astronomer/astro-provider-databricks/tree/main) repository. It marks another pull request aimed at contributing operators and features from that repository into the Airflow Databricks provider, the previous PR being apache#39178. The task group launches a [Databricks Workflow](https://docs.databricks.com/en/workflows/index.html) and runs the notebook jobs from within it, resulting in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute) when compared to executing ``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``. --------- Co-authored-by: Daniel Imberman <daniel.imberman@gmail.com> Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com> Co-authored-by: Wei Lee <weilee.rx@gmail.com>
As part of Astronomer's internal plans and decisions, we've decided to contribute the existing functionality provided by the operators and plugins in this repository to the official Apache Airflow Databricks provider. To achieve this, we submitted the following PRs to the Airflow provider: 1. apache/airflow#39178 2. apache/airflow#39771 3. apache/airflow#40013 4. apache/airflow#40724 5. apache/airflow#39295 All functionality has now been contributed to the Airflow Databricks provider, and ongoing support will be maintained there. As a result, we're deprecating the operators and plugins in this repository. Users are encouraged to transition to the official Apache Airflow Databricks provider as soon as possible. The migration process is straightforward—simply update the import path to point to the Airflow provider and ensure that you install `apache-airflow-providers-databricks>=6.8.0`, which includes all the contributions mentioned above. closes: astronomer/issues-airflow#715
This pull request introduces the DatabricksWorkflowTaskGroup
to the Airflow Databricks provider from the astro-provider-databricks
repository.
It marks another pull request aimed at contributing
operators and features from that repository into the Airflow
Databricks provider, the previous PR being #39178.
The task group launches a Databricks Workflow
and runs the notebook jobs from within it, resulting in a
75% cost reduction ($0.40/DBU for all-purpose compute,
$0.07/DBU for Jobs compute) when compared to executing
DatabricksNotebookOperatoroutside ofDatabricksWorkflowTaskGroup.There are a few advantages to defining your Databricks Workflows in Airflow:
Below screenshots depict successful Airflow DAG runs and corresponding
successful Databricks job run
Airflow DAG view

Datarbricks job Graph view
Co-authored by: @dimberman @tatiana in the original repo.
Co-authored-by: Daniel Imberman daniel.imberman@gmail.com
Co-authored-by: Tatiana Al-Chueyr tatiana.alchueyr@gmail.com
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.