Skip to content

Conversation

@mathbou
Copy link
Contributor

@mathbou mathbou commented Mar 30, 2023

Context:

  • An Airflow instance (2.5.2) with debug logging level
  • A Dag with an ExternalPythonOperator or PythonVirtualenvOperator task

In the upper context, the task fails to retrieve the Airflow context, and this line appears in the task log:

[2023-03-30, 10:15:17 CEST] {python.py:711} WARNING - When checking for Airflow installed in venv got The version of Airflow installed for the /venvs/shotgun/bin/python([2023-03-30T08:15:17.222+0000] {settings.py:267} DEBUG - Setting up DB connection pool (PID 241)

It appears that the result of the following call contains log outputs from the worker:

result = subprocess.check_output(
[self.python, "-c", "from airflow import __version__; print(__version__)"], text=True
)

@boring-cyborg
Copy link

boring-cyborg bot commented Mar 30, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Mar 30, 2023

Are you sure there are no side effects? Can you take a look at initialize and possibly add a test or two showing that things are working as expected.

There is a VERY strong comment in the code:

# Perform side-effects unless someone has explicitly opted out before import
# WARNING: DO NOT USE THIS UNLESS YOU REALLY KNOW WHAT YOU'RE DOING.
if not os.environ.get("_AIRFLOW__AS_LIBRARY", None):
     settings.initialize()

And I think only looking at what settings.initialize does and adding some related tests can prove that you actually REALLY KNOW WHAT YOU'RE DOING . Changing production behaviour in order to make "DEBUG" settings behave wrongly is generally a recipe for disaster unless you REALLY know what you are doing.

@mathbou
Copy link
Contributor Author

mathbou commented Mar 30, 2023

To give more context, with the debug level logging, here is the output of

result = subprocess.check_output(
[self.python, "-c", "from airflow import __version__; print(__version__)"], text=True
)

[2023-03-30T12:40:46.882+0000] {settings.py:267} DEBUG - Setting up DB connection pool (PID 170)
[2023-03-30T12:40:46.882+0000] {settings.py:372} DEBUG - settings.prepare_engine_args(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=170
2.5.2
[2023-03-30T12:40:46.908+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 170)

The same call with a >=info logging level only returns the version number 2.5.2.


About side effects, from what I test and read in the source code, there shouldn't be any. I saw the warning when digging to understand what Airflow was doing on import, especially to understand why it was initializing itself on its own. From my understanding, there's no point, for a version query and specifically this call, that Airflow initialize itself, this command should only get the value from the module, no more, no less.


I'm not sure to fully understand your point here:

Changing production behaviour in order to make "DEBUG" settings behave wrongly

@potiuk
Copy link
Member

potiuk commented Mar 30, 2023

About side effects, from what I test and read in the source code, there shouldn't be any.

Just to be clear, I also think there might be no side effects. But I just read the code (which I did not write BTW.) and someone who added this feature made it absolutely clear that you have to REALLY know what you are doing. I am just asking are you REALLY sure you know it and are you aware of all the side effects. An example of that is that maybe (just maybe) in case of Python venv operator. in some cases whatever "initialize_settings" (which the variable disable) does, is needed in the next step. From what I know initializing settings for example makes sure that if you have no config file it will generate one - including default values, by disabling it via the variable, this side-efect might be gone (and if something else relies on it, it might be broken).

I'm not sure to fully understand your point here:

Changing production behaviour in order to make "DEBUG" settings behave wrongly

I mean that in order to get "debug" setting works we are changing the behaviour of code that is going to be executed even in "production" mode (i.e. with INFO level). Adding _AIRFLOW__AS_LIBRARY definitely changes the behaviour. No matter how much we think it is risky, it's generally a bad idea.

I think there are better ways of solving it:

  • find or add a way to disable those debug logs (and only them, nothing else) rather than using _AIRFLOW__AS_LIBRARY
  • only add _AIRFLOW__AS_LIBRARY in case DEBUG level is set.

Both are safer and solve the problem

@mathbou
Copy link
Contributor Author

mathbou commented Mar 30, 2023

in some cases whatever "initialize_settings" (which the variable disable) does, is needed in the next step. From what I know initializing settings for example makes sure that if you have no config file it will generate one - including default values, by disabling it via the variable, this side-efect might be gone (and if something else relies on it, it might be broken).

About the initialize side effect you're describing:

  • first off, it would be really bad to depends on the _get_airflow_version_from_target_env call to get the initialization side effect, it's closer to a misuse than anything else.
  • second, by digging deeper I finally found where is the proper initialization call for Airflow
    try:
    from airflow.plugins_manager import integrate_macros_plugins
    integrate_macros_plugins()
    It's at the very top of the script that the External and Virtualenv python operators will run to execute tasks, so the settings.initialize() will happen.

Only with this second point, I would say it's safe to assume that production and debug will behave the same. I already ran some Dags with this fix in both configurations without problems.


find or add a way to disable those debug logs (and only them, nothing else) rather than using _AIRFLOW__AS_LIBRARY

Disabling debug logs while in debug mode would be quite ironic, and as we need them for debug purposes, we would need to disable them only for this very specific case. I don't think this case is special enough to add that much complexity to three methods of the settings.py module.

only add _AIRFLOW__AS_LIBRARY in case DEBUG level is set.

This solution is a nonsense to me, it would add a risk of different behavior between production and debug, and that exactly what we are trying to avoid here.

@potiuk
Copy link
Member

potiuk commented Mar 30, 2023

Disabling debug logs while in debug mode would be quite ironic, and as we need them for debug purposes, we would need to disable them only for this very specific case. I don't think this case is special enough to add that much complexity to three methods of the settings.py module.

Correct me if I am wrong, but you seem to attempt to do exactly this (disable debug logging while in debug mode) but you are trying to use completely unrelated, badly documented and warned not to use_AIRFLOW__AS_LIBRARY variable? Or am I misinterpreting your intentions? Is there other reason why you want to use _AIRFLOW__AS_LIBRARY ? Could you please explain why lead you to choose that variable setting, I am curious ?

@uranusjr
Copy link
Member

This change makes sense to me regardless sine the usage is arguably using Airflow as a library.

@potiuk
Copy link
Member

potiuk commented Mar 30, 2023

This change makes sense to me regardless sine the usage is arguably using Airflow as a library.

If you can confirm that we know the side effects and that the comment there is fine, I can trust your judgment feel free to approve it.

@uranusjr
Copy link
Member

uranusjr commented Mar 31, 2023

The environment variable was introduced in #25832 to avoid import time side effects. Since reading the __version__ variable should not be affected by side effects (we are in big trouble if it does), using this environ should be safe and future-proof.

@potiuk
Copy link
Member

potiuk commented Mar 31, 2023

The environment variable was introduced in #25832 to avoid import time side effects. Since reading the __version__ variable should not be affected by side effects (we are in big trouble if it does), using this environ should be safe and future-proof.

Sure. No problem with that. As soon as we add a bit more explanation in the comment about the consequences of that and why we are using it (also possibly add comment in the orignal place where it was added, I am good.

@uranusjr
Copy link
Member

the orignal place where it was added

Do you mean the place where __version__ is defined, or where it is accessed via subprocess?

@potiuk
Copy link
Member

potiuk commented Apr 11, 2023

the orignal place where it was added

Do you mean the place where __version__ is defined, or where it is accessed via subprocess?

I'd opt for both. Just to improve the reeadability - why the original one was added (and remove the scary "don't use it if you do not know what you are doing" along the way) and why we think it's good to add _IS_LIBRARY. That's about it. Just to provide the context so that next person looking it will known without having to trace it back to the original commit and guessing the reasons.

I am just thinking of the future self and future "other" maintainer or user who would stumble upon this one.

@uranusjr
Copy link
Member

@mathbou Could you make the changes discussed above?

@mathbou
Copy link
Contributor Author

mathbou commented Apr 13, 2023

@uranusjr I added some comments at both places, are they clear enough ?

@potiuk
Copy link
Member

potiuk commented Apr 13, 2023

@uranusjr I added some comments at both places, are they clear enough ?

Good enough for me.

@github-actions
Copy link

github-actions bot commented Jun 2, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added stale Stale PRs per the .github/workflows/stale.yml policy file and removed stale Stale PRs per the .github/workflows/stale.yml policy file labels Jun 2, 2023
@mathbou mathbou requested review from potiuk and uranusjr June 4, 2023 14:49
@potiuk
Copy link
Member

potiuk commented Jun 4, 2023

@uranusjr ?

@potiuk potiuk force-pushed the fix-log-output-in-airflow-version-check branch 2 times, most recently from d968445 to c6f8ccf Compare June 8, 2023 13:23
@potiuk
Copy link
Member

potiuk commented Jun 8, 2023

The failures here are likely caused by recent change in determining "runs-on" - this one should fix it #31792 , so once we merge it, further rebasing should work

mathbou and others added 6 commits June 8, 2023 19:11
---

When using `ExternalPythonOperator` in an Airflow instance with debug logging level, the airflow version check fails due to the presence of log output in the subprocess result
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
@potiuk potiuk force-pushed the fix-log-output-in-airflow-version-check branch from c6f8ccf to 0e0bb6d Compare June 8, 2023 17:11
@potiuk
Copy link
Member

potiuk commented Jun 8, 2023

REbased it after #31792 - 🤞

@potiuk potiuk merged commit 406b486 into apache:main Jun 8, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 8, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@mathbou mathbou deleted the fix-log-output-in-airflow-version-check branch June 8, 2023 22:09
@ephraimbuddy ephraimbuddy added this to the Airlfow 2.6.3 milestone Jul 6, 2023
@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Jul 6, 2023
ephraimbuddy pushed a commit that referenced this pull request Jul 6, 2023
…d debug logging level (#30367)

---------

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
(cherry picked from commit 406b486)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants