Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Jun 15, 2022

Goal here is to make the healthchecks for triggerer/scheduler quicker by having to do less

Roughly:

  • Don't import jsonschema (module or schemas) until we need it - it was surprisingly slow.
  • Dont import all models when importing any one -- the only times we need that are when auto-generating migrations, or when checking DB against schema. (That said, accesing any class will cause the SQLA event to import all models, otherwise we often end up with issues due to the relationships on classes)
  • Move a few more things to be lazy/delayed import too.

@boring-cyborg boring-cyborg bot added area:Scheduler including HA (high availability) scheduler area:serialization labels Jun 15, 2022
@ashb
Copy link
Member Author

ashb commented Jun 15, 2022

/cc @ianbuss

@potiuk
Copy link
Member

potiuk commented Jun 16, 2022

I really hope one day https://peps.python.org/pep-0690/ will be in place so that we don't have to do all those any more.

@ashb
Copy link
Member Author

ashb commented Jun 17, 2022

I really hope one day https://peps.python.org/pep-0690/ will be in place so that we don't have to do all those any more.

Yeah, that would be nice.

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 2, 2022
@uranusjr
Copy link
Member

uranusjr commented Aug 2, 2022

I recently listened to some of the folks behind PEP 690 discuss lazy imports and came away with the impression that even after it lands Airflow is highly likely not be able to use it. Lazy imports has a lot of implications and PEP 690 quite heavy-handedly make lazy imports the default, and Airflow has so many dependencies I can easily it might never be able to be fully compatible.

@potiuk
Copy link
Member

potiuk commented Aug 2, 2022

I recently listened to some of the folks behind PEP 690 discuss lazy imports and came away with the impression that even after it lands Airflow is highly likely not be able to use it. Lazy imports has a lot of implications and PEP 690 quite heavy-handedly make lazy imports the default, and Airflow has so many dependencies I can easily it might never be able to be fully compatible.

Yeah. We've probably listened to the same Talk Python episode :). I think it's not even likely to have PEP 690 around in 3.11 or maybe even 3.12. Unless we roll our own sleeves up and help. I think it would be great to actually HELP in PEP 690 by making Airlfow works with lazy imports (maybe by improving PEP 690 implementation as well).

@ashb
Copy link
Member Author

ashb commented Aug 2, 2022

Shall I resume this and get the SQLA model loading working do you think?

@potiuk
Copy link
Member

potiuk commented Aug 2, 2022

I think it might be worth it. Would be nice to see how much we gain though in those casese

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Aug 3, 2022
@kaxil
Copy link
Member

kaxil commented Aug 12, 2022

Shall I resume this and get the SQLA model loading working do you think?

Yup, lets try it out

@ashb ashb added the pinned Protect from Stalebot auto closing label Aug 17, 2022
@ashb ashb force-pushed the more-lazy-loading branch 5 times, most recently from ec86106 to 711ecf1 Compare September 1, 2022 14:57
@ashb ashb force-pushed the more-lazy-loading branch from 711ecf1 to 68fab20 Compare September 5, 2022 10:59
@ashb
Copy link
Member Author

ashb commented Sep 5, 2022

Wow mypy is confusing/wrong sometimes:


airflow/example_dags/example_latest_only_with_trigger.py:39: error: Unexpected
keyword argument "task_id" for "LatestOnlyOperator"  [call-arg]
        latest_only = LatestOnlyOperator(task_id='latest_only')
                      ^
airflow/utils/log/logging_mixin.py: note: "LatestOnlyOperator" defined here
airflow/example_dags/example_latest_only.py:34: error: Unexpected keyword
argument "task_id" for "LatestOnlyOperator"  [call-arg]
        latest_only = LatestOnlyOperator(task_id='latest_only')
                      ^
airflow/utils/log/logging_mixin.py: note: "LatestOnlyOperator" defined here

Edit: turns out Mypy gets confused by Pep 562 lazy imports (not a great surprise really) -- something was importing BaseOperator from airflow.models, not airflow.models.baseoperator.

@ashb ashb force-pushed the more-lazy-loading branch from a596456 to bb9fa8f Compare September 5, 2022 15:59
Comment on lines +28 to +29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was high up in the list of slow imports!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A schema validation library being slow… I guess that’s how third-party libraries work.

@ashb
Copy link
Member Author

ashb commented Sep 6, 2022

I've found an easy way to make sure all models are loaded when needed:

    event.listen(sqlalchemy.orm.mapper, "before_configured", import_all_models, once=True)

This will ensure that on the first time any model class is queried or instantiated all other models will be loaded, which ensures we don't get the "class not found in registry" error message, but still means imports are "quick".

@ashb ashb force-pushed the more-lazy-loading branch from 43cec31 to 685498f Compare September 6, 2022 10:58
@ashb
Copy link
Member Author

ashb commented Sep 6, 2022

I wonder why this change causes these mypy errors to appear:

tests/system/providers/dbt/cloud/example_dbt_cloud.py:80: error: Unsupported
operand types for >> ("List[object]" and "EmptyOperator")  [operator]
        [get_run_results_artifact, job_run_sensor] >> end
        ^
tests/system/providers/microsoft/azure/example_adf_run_pipeline.py:76: error:
Unsupported operand types for >> ("List[object]" and "EmptyOperator") 
[operator]
        [run_pipeline1, pipeline_run_sensor] >> end
        ^
Found 2 errors in 2 files (checked 2258 source files)

Run mypy for providers.................................................................Failed
- hook id: run-mypy
- exit code: 1

airflow/providers/atlassian/jira/operators/jira.py:70: error: Extra argument
"conf" from **args for "execute" of "JiraOperator"  [misc]
                        resource = self.get_jira_resource_method.execute(*...
                                   ^
airflow/providers/qubole/operators/qubole.py:244: error: Cannot determine type
of "on_failure_callback"  [has-type]
            if self.on_failure_callback is None:
               ^
airflow/providers/qubole/operators/qubole.py:247: error: Cannot determine type
of "on_retry_callback"  [has-type]
            if self.on_retry_callback is None:

@ashb
Copy link
Member Author

ashb commented Sep 6, 2022

I was testing this with python -X importtime -c 'import airflow.jobs.scheduler_job' 2>import-before-schedjob.log and tuna for display:

Before:

image

After:
image

@ashb
Copy link
Member Author

ashb commented Sep 6, 2022

I tried adding a airflow/models/__init__.pyi containing this:

# MyPy doesn't/can't recoginze the PEP 562 style lazy imports, so we have to
# tell it about the imports

from .dag import DAG
from .base import ID_LEN
from .xcom import XCOM_RETURN_KEY
from .base import Base
from .baseoperator import BaseOperator
from .baseoperator import BaseOperatorLink
from .connection import Connection
from .dagbag import DagBag
from .dag import DagModel
from .dagpickle import DagPickle
from .dagrun import DagRun
from .dag import DagTag
from .db_callback_request import DbCallbackRequest
from .errors import ImportError
from .log import Log
from .mappedoperator import MappedOperator
from .operator import Operator
from .param import Param
from .pool import Pool
from .renderedtifields import RenderedTaskInstanceFields
from .skipmixin import SkipMixin
from .slamiss import SlaMiss
from .taskfail import TaskFail
from .taskinstance import TaskInstance
from .taskreschedule import TaskReschedule
from .trigger import Trigger
from .variable import Variable
from .xcom import XCom
from .taskinstance import clear_task_instances

def import_all_models(): ...

But I ended up with loads of errors along the lines of:

airflow/www/views.py:109: error: Module "airflow.models" has no attribute "Connection"  [attr-defined]

I wonder what is going on ... Any ideas?

@ashb ashb requested a review from XD-DENG as a code owner September 6, 2022 13:56
@ashb ashb force-pushed the more-lazy-loading branch from 6955b6c to 5dc14ce Compare September 6, 2022 20:00
@ashb ashb force-pushed the more-lazy-loading branch 3 times, most recently from bd213c4 to 4817ccf Compare September 7, 2022 16:04
@ashb ashb force-pushed the more-lazy-loading branch from 4817ccf to c54280a Compare September 7, 2022 17:59
@ashb
Copy link
Member Author

ashb commented Sep 7, 2022

Rebasing to check still good.

@ashb ashb merged commit 2f7007c into apache:main Sep 7, 2022
@ashb ashb deleted the more-lazy-loading branch September 7, 2022 19:51
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler area:serialization pinned Protect from Stalebot auto closing type:misc/internal Changelog: Misc changes that should appear in change log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants