From 6347db2c10b7c10cd1fa6052c6920280a0d6c105 Mon Sep 17 00:00:00 2001 From: TJaniF Date: Fri, 9 Aug 2024 12:33:43 +0200 Subject: [PATCH 1/4] Typo fix dataset guide --- docs/apache-airflow/authoring-and-scheduling/datasets.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/apache-airflow/authoring-and-scheduling/datasets.rst b/docs/apache-airflow/authoring-and-scheduling/datasets.rst index 07eb571153853..dbc85f92eb50a 100644 --- a/docs/apache-airflow/authoring-and-scheduling/datasets.rst +++ b/docs/apache-airflow/authoring-and-scheduling/datasets.rst @@ -338,7 +338,7 @@ In this example, the DAG ``waiting_for_dataset_1_and_2`` will be triggered when ... -``quededEvent`` API endpoints are introduced to manipulate such records. +``queuedEvent`` API endpoints are introduced to manipulate such records. * Get a queued Dataset event for a DAG: ``/datasets/queuedEvent/{uri}`` * Get queued Dataset events for a DAG: ``/dags/{dag_id}/datasets/queuedEvent`` @@ -347,7 +347,7 @@ In this example, the DAG ``waiting_for_dataset_1_and_2`` will be triggered when * Get queued Dataset events for a Dataset: ``/dags/{dag_id}/datasets/queuedEvent/{uri}`` * Delete queued Dataset events for a Dataset: ``DELETE /dags/{dag_id}/datasets/queuedEvent/{uri}`` - For how to use REST API and the parameters needed for these endpoints, please refer to :doc:`Airflow API ` + For how to use REST API and the parameters needed for these endpoints, please refer to :doc:`Airflow API `. Advanced dataset scheduling with conditional expressions -------------------------------------------------------- From d41743970730ae71abf08cd889d4f8efe6a2e99d Mon Sep 17 00:00:00 2001 From: TJaniF Date: Fri, 9 Aug 2024 12:47:10 +0200 Subject: [PATCH 2/4] remove extra bracket in example code --- docs/apache-airflow/authoring-and-scheduling/datasets.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/authoring-and-scheduling/datasets.rst b/docs/apache-airflow/authoring-and-scheduling/datasets.rst index dbc85f92eb50a..d311c3c5ad227 100644 --- a/docs/apache-airflow/authoring-and-scheduling/datasets.rst +++ b/docs/apache-airflow/authoring-and-scheduling/datasets.rst @@ -444,7 +444,7 @@ The following example creates a dataset event against the S3 URI ``f"s3://bucket @task(outlets=[DatasetAlias("my-task-outputs")]) def my_task_with_metadata(): - s3_dataset = Dataset("s3://bucket/my-task}") + s3_dataset = Dataset("s3://bucket/my-task") yield Metadata(s3_dataset, extra={"k": "v"}, alias="my-task-outputs") Only one dataset event is emitted for an added dataset, even if it is added to the alias multiple times, or added to multiple aliases. However, if different ``extra`` values are passed, it can emit multiple dataset events. In the following example, two dataset events will be emitted. From 33e184711dfe74a30e6ee5c6ae9b6f2fda740c13 Mon Sep 17 00:00:00 2001 From: TJaniF Date: Fri, 9 Aug 2024 13:31:07 +0200 Subject: [PATCH 3/4] missing word --- docs/apache-airflow/authoring-and-scheduling/datasets.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/authoring-and-scheduling/datasets.rst b/docs/apache-airflow/authoring-and-scheduling/datasets.rst index d311c3c5ad227..f6b194fa7b89a 100644 --- a/docs/apache-airflow/authoring-and-scheduling/datasets.rst +++ b/docs/apache-airflow/authoring-and-scheduling/datasets.rst @@ -470,7 +470,7 @@ Only one dataset event is emitted for an added dataset, even if it is added to t Scheduling based on dataset aliases ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Since dataset events added to an alias are just simple dataset events, a downstream depending on the actual dataset can read dataset events of it normally, without considering the associated aliases. A downstream can also depend on a dataset alias. The authoring syntax is referencing the ``DatasetAlias`` by name, and the associated dataset events are picked up for scheduling. Note that a DAG can be triggered by a task with ``outlets=DatasetAlias("xxx")`` if and only if the alias is resolved into ``Dataset("s3://bucket/my-task")``. The DAG runs whenever a task with outlet ``DatasetAlias("out")`` gets associated with at least one dataset at runtime, regardless of the dataset's identity. The downstream DAG is not triggered if no datasets are associated to the alias for a particular given task run. This also means we can do conditional dataset-triggering. +Since dataset events added to an alias are just simple dataset events, a downstream DAG depending on the actual dataset can read dataset events of it normally, without considering the associated aliases. A downstream can also depend on a dataset alias. The authoring syntax is referencing the ``DatasetAlias`` by name, and the associated dataset events are picked up for scheduling. Note that a DAG can be triggered by a task with ``outlets=DatasetAlias("xxx")`` if and only if the alias is resolved into ``Dataset("s3://bucket/my-task")``. The DAG runs whenever a task with outlet ``DatasetAlias("out")`` gets associated with at least one dataset at runtime, regardless of the dataset's identity. The downstream DAG is not triggered if no datasets are associated to the alias for a particular given task run. This also means we can do conditional dataset-triggering. The dataset alias is resolved to the datasets during DAG parsing. Thus, if the "min_file_process_interval" configuration is set to a high value, there is a possibility that the dataset alias may not be resolved. To resolve this issue, you can trigger DAG parsing. From 9aac66c36f4f215b6661b384bdc74b486bb39586 Mon Sep 17 00:00:00 2001 From: TJaniF Date: Fri, 9 Aug 2024 13:31:49 +0200 Subject: [PATCH 4/4] missing word --- docs/apache-airflow/authoring-and-scheduling/datasets.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/authoring-and-scheduling/datasets.rst b/docs/apache-airflow/authoring-and-scheduling/datasets.rst index f6b194fa7b89a..a69c09bc13b0f 100644 --- a/docs/apache-airflow/authoring-and-scheduling/datasets.rst +++ b/docs/apache-airflow/authoring-and-scheduling/datasets.rst @@ -470,7 +470,7 @@ Only one dataset event is emitted for an added dataset, even if it is added to t Scheduling based on dataset aliases ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Since dataset events added to an alias are just simple dataset events, a downstream DAG depending on the actual dataset can read dataset events of it normally, without considering the associated aliases. A downstream can also depend on a dataset alias. The authoring syntax is referencing the ``DatasetAlias`` by name, and the associated dataset events are picked up for scheduling. Note that a DAG can be triggered by a task with ``outlets=DatasetAlias("xxx")`` if and only if the alias is resolved into ``Dataset("s3://bucket/my-task")``. The DAG runs whenever a task with outlet ``DatasetAlias("out")`` gets associated with at least one dataset at runtime, regardless of the dataset's identity. The downstream DAG is not triggered if no datasets are associated to the alias for a particular given task run. This also means we can do conditional dataset-triggering. +Since dataset events added to an alias are just simple dataset events, a downstream DAG depending on the actual dataset can read dataset events of it normally, without considering the associated aliases. A downstream DAG can also depend on a dataset alias. The authoring syntax is referencing the ``DatasetAlias`` by name, and the associated dataset events are picked up for scheduling. Note that a DAG can be triggered by a task with ``outlets=DatasetAlias("xxx")`` if and only if the alias is resolved into ``Dataset("s3://bucket/my-task")``. The DAG runs whenever a task with outlet ``DatasetAlias("out")`` gets associated with at least one dataset at runtime, regardless of the dataset's identity. The downstream DAG is not triggered if no datasets are associated to the alias for a particular given task run. This also means we can do conditional dataset-triggering. The dataset alias is resolved to the datasets during DAG parsing. Thus, if the "min_file_process_interval" configuration is set to a high value, there is a possibility that the dataset alias may not be resolved. To resolve this issue, you can trigger DAG parsing.