-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Fix access_control syncing; faster cli sync-perm #15293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This consolidates the cli sync-perm command and the syncing that happens by default during webserver startup. Prior to this change, `access_control` wasn't supported in the default sync and the cli sync-perm command was slow when you have many DAGs.
| .all() | ||
| ) | ||
| dagbag = DagBag(read_dags_from_db=True) | ||
| dagbag.collect_dags_from_db() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooff -- This will load all the Serialized DAGs though here and start again with an incremental DagBag in
airflow/airflow/www/extensions/init_dagbag.py
Lines 24 to 32 in 9dd14aa
| def init_dagbag(app): | |
| """ | |
| Create global DagBag for webserver and API. To access it use | |
| ``flask.current_app.dag_bag``. | |
| """ | |
| if os.environ.get('SKIP_DAGS_PARSING') == 'True': | |
| app.dag_bag = DagBag(os.devnull, include_examples=False) | |
| else: | |
| app.dag_bag = DagBag(DAGS_FOLDER, read_dags_from_db=True) |
The Serialized DAG will be loaded when required by:
airflow/airflow/models/dagbag.py
Lines 184 to 200 in 9dd14aa
| # If DAG is in the DagBag, check the following | |
| # 1. if time has come to check if DAG is updated (controlled by min_serialized_dag_fetch_secs) | |
| # 2. check the last_updated column in SerializedDag table to see if Serialized DAG is updated | |
| # 3. if (2) is yes, fetch the Serialized DAG. | |
| min_serialized_dag_fetch_secs = timedelta(seconds=settings.MIN_SERIALIZED_DAG_FETCH_INTERVAL) | |
| if ( | |
| dag_id in self.dags_last_fetched | |
| and timezone.utcnow() > self.dags_last_fetched[dag_id] + min_serialized_dag_fetch_secs | |
| ): | |
| sd_last_updated_datetime = SerializedDagModel.get_last_updated_datetime( | |
| dag_id=dag_id, | |
| session=session, | |
| ) | |
| if sd_last_updated_datetime and sd_last_updated_datetime > self.dags_last_fetched[dag_id]: | |
| self._add_dag_from_db(dag_id=dag_id, session=session) | |
| return self.dags.get(dag_id) |
Let's say if someone changes the DAG File with a change in access_control and the Parsing process writes the serialized_dag, it will hit the above code-block and if 10 seconds have passed (AIRFLOW__CORE__MIN_SERIALIZED_DAG_FETCH_INTERVAL -https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#min-serialized-dag-fetch-interval), it will re-fetch and update the DAG. I think we should refresh the permission for that DAG in that nested if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add if dag.access_control: sync_dag_perm.... in:
airflow/airflow/models/dagbag.py
Lines 231 to 245 in 9dd14aa
| def _add_dag_from_db(self, dag_id: str, session: Session): | |
| """Add DAG to DagBag from DB""" | |
| from airflow.models.serialized_dag import SerializedDagModel | |
| row = SerializedDagModel.get(dag_id, session) | |
| if not row: | |
| raise SerializedDagNotFound(f"DAG '{dag_id}' not found in serialized_dag table") | |
| row.load_op_links = self.load_op_links | |
| dag = row.dag | |
| for subdag in dag.subdags: | |
| self.dags[subdag.dag_id] = subdag | |
| self.dags[dag.dag_id] = dag | |
| self.dags_last_fetched[dag.dag_id] = timezone.utcnow() | |
| self.dags_hash[dag.dag_id] = row.dag_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, we could just remove syncing DAG level permissions from sync-perm command
|
I'm going to merge this with pr #15311. |
This consolidates the cli sync-perm command and the syncing that happens by default during webserver startup. Prior to this change,
access_controlwasn't supported in the default sync and the cli sync-perm command was slow when you have many DAGs.With ~5k simple DAGs, this makes sync-perms faster (~24s -> ~10s). It does make the webserver startup slower (due to loading DagBag from the db vs just querying
DagModel, but it also syncsaccess_controlwhere it wasn't before.