Skip to content

[Bug] Strict validation in Dataset URI in Airflow 2.9 breaks some DAGs #39486

@tatiana

Description

@tatiana

Apache Airflow version

2.9.0 & 2.9.1

What happened?

Valid DAGs that worked in Airflow 2.8.x and had tasks with outlets with specific URIs, such as Dataset("postgres://postgres:5432/postgres.dbt.stg_customers"), stopped working in Airflow 2.9.0, after #37005 was merged.

What you think should happen instead?

This is a breaking change in an Airflow minor version. We should avoid this.

Airflow < 3.0 should raise a warning, and from Airflow 3.0, we can make errors by default. We can have a feature flag to allow users who want to see this in advance to enable errors in Airflow 2. x, but this should not be the default behaviour.

The DAGs should continue working on Airflow 2.x minor/micro releases without errors (unless the user opts in via configuration).

How to reproduce

By running the following DAG, as an example:

from datetime import datetime

from airflow import DAG
from airflow.datasets import Dataset
from airflow.operators.empty import EmptyOperator



with DAG(dag_id='empty_operator_example', start_date=datetime(2022, 1, 1), schedule_interval=None) as dag:

    task1 = EmptyOperator(
        task_id='empty_task1',
        dag=dag,
        outlets=[Dataset("postgres://postgres:5432/postgres.dbt.stg_customers")]
    )

    task2 = EmptyOperator(
        task_id='empty_task2',
        dag=dag
    )

    task1 >> task2

Causes to the exception:

Broken DAG: [/usr/local/airflow/dags/example_issue.py]
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/datasets/__init__.py", line 81, in _sanitize_uri
    parsed = normalizer(parsed)
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/postgres/datasets/postgres.py", line 34, in sanitize_uri
    raise ValueError("URI format postgres:// must contain database, schema, and table names")
ValueError: URI format postgres:// must contain database, schema, and table names

Operating System

All

Versions of Apache Airflow Providers

apache-airflow-providers-postgres==5.11.0

Deployment

Astronomer

Deployment details

This issue can be seen in any Airflow 2.9.0 or 2.9.1 deployment.

Anything else?

An apparent "quick-fix" could be to replace "." by "/" in the Dataset URI of the DAG. However, this would break any DAGs that use data-aware schedules and consume from this Dataset - in a silent way. For this reason, let's fix this in Airflow 2.9 and prepare users for a breaking change in Airflow 3.0.

Suggested approach to fix the problem:

  1. Introduce a boolean configuration within [core], named strict_dataset_uri_validation, which should be False by default.

  2. When this configuration is False, Airflow should raise a warning saying:

From Airflow 3, Airflow will be more strict with Dataset URIs, and the URI xx will no longer be valid. Please, follow the expected standard as documented in XX.
  1. If this configuration is True, Airflow should raise the exception, as it does now in Airflow 2.9.0 and 2.9.1

  2. From Airflow 3.0, we change this configuration to be True by default.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions