Skip to content

Conversation

@blinkseb
Copy link
Contributor

Hello,

I've found an issue with the handling of the DAG file last modified time which can trigger an infinite parsing loop, when file_parsing_sort_mode is set to modified_time (which seems to be the default value).

os.path.getmtime() returns a timestamp, which is converted to a datetime using datetime.fromtimestamp. According to the documentation:

If optional argument tz is None or not specified, the timestamp is converted to the platform’s local date and time, and the returned datetime object is naive.

The resulting datetime is then made aware using timezone.make_aware using the user configured timezone, which is not guarantee to be the same as the platform's timezone. If that's not the case, we can end up with modified time in the future and trigger an infinite parsing loop.

The fix is to use datetime.utcfromtimestamp to convert the timestamp, to force the resulting datetime in UTC, and explicitly specify the UTC timezone to make_aware.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Mar 22, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. looks good. The mtime should be in UTC indeed (Except som Windows systems which we do not support anyway it seems) so that loooks good.
@ashb @uranusjr ?

@uranusjr
Copy link
Member

uranusjr commented Mar 22, 2023

Makes sense, except calling make_aware is cumbersome since we control the input. It is better to do datetime.fromtimestamp(ts, tz=timezone.utc) instead. This should be preferred over utcfromtimestamp, see Python documentation.

@blinkseb blinkseb force-pushed the fix-dag-mtime-mode branch from 5cae2f0 to 91d4142 Compare March 22, 2023 19:41
@blinkseb
Copy link
Contributor Author

thanks for the feedback! I've updated the PR to use fromtimestamp with the tz parameter

@uranusjr
Copy link
Member

Is it possible to have a test for this?

@eladkal eladkal added this to the Airflow 2.5.3 milestone Mar 23, 2023
@blinkseb blinkseb force-pushed the fix-dag-mtime-mode branch from 91d4142 to b06e7a5 Compare March 23, 2023 10:58
@blinkseb
Copy link
Contributor Author

I've updated the existing test to trigger the issue by forcing the platform timezone to something different than UTC. Running the test reverting the fix fails (as expected) with this error:

>           assert manager._file_path_queue == collections.deque()
E           AssertionError: assert deque(['file_1.py']) == deque([])
E             Left contains one more item: 'file_1.py'
E             Full diff:
E             - deque([])
E             + deque(['file_1.py'])

tests/dag_processing/test_manager.py:485: AssertionError

Let me know if there's a better way to test that.

@blinkseb blinkseb force-pushed the fix-dag-mtime-mode branch from b06e7a5 to 9f303c9 Compare March 23, 2023 11:01
@blinkseb blinkseb force-pushed the fix-dag-mtime-mode branch from 9f303c9 to 4c2f5a5 Compare March 23, 2023 11:21
@potiuk
Copy link
Member

potiuk commented Mar 27, 2023

@uranusjr ?

@uranusjr uranusjr merged commit 7a8b3c4 into apache:main Mar 28, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 28, 2023

Awesome work, congrats on your first merged pull request!

@ephraimbuddy ephraimbuddy added the type:bug-fix Changelog: Bug Fixes label Apr 11, 2023
shinny-taojiachun added a commit to shinnytech/airflow that referenced this pull request Jul 19, 2023
Co-authored-by: Sébastien Brochet <sebastien.brochet@nielsen.com>

(cherry picked from commit 7a8b3c4)
jedcunningham added a commit to astronomer/airflow that referenced this pull request Jan 2, 2025
This fixture was, in theory, able to reproduce the bug fixed in apache#30243,
but I was unable to now. So, let's toss it.
jedcunningham added a commit to astronomer/airflow that referenced this pull request Jan 2, 2025
This fixture was, in theory, able to reproduce the bug fixed in apache#30243,
but I was unable to now. So, let's toss it.
jedcunningham added a commit that referenced this pull request Jan 7, 2025
This fixture was, in theory, able to reproduce the bug fixed in #30243,
but I was unable to now. So, let's toss it.
HariGS-DB pushed a commit to HariGS-DB/airflow that referenced this pull request Jan 16, 2025
This fixture was, in theory, able to reproduce the bug fixed in apache#30243,
but I was unable to now. So, let's toss it.
got686-yandex pushed a commit to got686-yandex/airflow that referenced this pull request Jan 30, 2025
This fixture was, in theory, able to reproduce the bug fixed in apache#30243,
but I was unable to now. So, let's toss it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants