WIP KPO log cleanup by dstandish · Pull Request #36227 · apache/airflow

dstandish · 2023-12-14T17:03:07Z

All this does in effect is revert #34412. I created this PR simply do demonstrate that when we remove the code added in #34412, the tests added in that PR still pass. Which means that either the tests are wrong, or the code is unnecessary.

I am trying to clean up KPO logging logic a bit, and resolve some problems we've encountered with logs hanging. But I don't want to regress #34412. So I need either its tests to work properly, or to conclude that the logic doesn't do anything and we can remove it.

troyharvey · 2023-12-14T19:34:29Z

@dstandish Let me know if we can help test this. We have a long-running KubernetesPodOperator running dbt and we had to turn off get_logs because of timeout issues with the task disconnecting from the pod.

dstandish · 2023-12-14T20:03:02Z

Hi @troyharvey , my secret master plan is to do a larger refactor of KPO logging.

In short here's the plan:

run a log fetcher in a thread per container
each log fetcher passes the log lines to a queue
main thread reads from queue and logs the messages
separate thread checks pod status periodically to see if it's done. if done, stop everything

it may sound complicated but i think it's ultimately simpler and more reliable than what we have now.

and it should resolve your problem.

however, in order to do this refactor, i want to try to not regress the various fixes people have contributed (such as #34412), so i need to understand what the heck it is doing or at least rely on the tests. but the tests pass even with the code removed. which is why i published this PR, so that the author of #34412 can help clarify the situation, and we can make this better.

dstandish · 2023-12-14T20:36:27Z

also @troyharvey interestingly we have received reports of this happening with DBT. it always seems to be DBT-related. we actually worked pretty hard to reproduce but could not. but that's part of the motivation for this effort. do you happen to have a reliable repro?

dstandish · 2023-12-15T19:21:23Z

hey @troyharvey i actually do have something for you to try....

there's actualy a read timeout parameter we can use for k8s log reads. you can try it out with this patch

diff --git a/airflow/providers/cncf/kubernetes/utils/pod_manager.py b/airflow/providers/cncf/kubernetes/utils/pod_manager.py
index 2f8f300096..aa2e8d5bce 100644
--- a/airflow/providers/cncf/kubernetes/utils/pod_manager.py
+++ b/airflow/providers/cncf/kubernetes/utils/pod_manager.py
@@ -627,6 +627,7 @@ class PodManager(LoggingMixin):
                 follow=follow,
                 timestamps=timestamps,
                 _preload_content=False,
+                _request_timeout=(60, 600),
                 **additional_kwargs,
             )
         except BaseHTTPError:

the tuple means (connection_timeout, read_timeout)

can play with the numbers. seems to work though! let me know what you find

hussein-awala · 2023-12-16T20:26:19Z

Just to make sure it was intentional, you didn't revert all the changes, where you removed the statement that logs the error message:

self.log.error(
    "Error parsing timestamp (no timestamp in message %r). "
    "Will continue execution but won't update timestamp",
    line,
)

For the rest, the main goal is aggregating the multiline log.

For the following log:

log = (
  "2020-10-08T14:16:17.793417674Z message1 line1\n"
  "message1 line2\n"
  "message1 line3\n"
  "2020-10-08T14:16:18.793417674Z message2 line1\n"
  "message2 line2\n"
  "2020-10-08T14:16:19.793417674Z message3 line1\n"
)

Your PR logs:

[2023-12-16T21:21:10.301+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message1 line2
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message1 line3
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message2 line1
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message2 line2
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message3 line1
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>]

While my code logs:

[2023-12-16T21:22:25.843+0100] {pod_manager.py:435} INFO - [<MagicMock id='4563553616'>] message1 line1
message1 line2
message1 line3
[2023-12-16T21:22:25.844+0100] {pod_manager.py:435} INFO - [<MagicMock id='4563553616'>] message2 line1
message2 line2
[2023-12-16T21:22:25.844+0100] {pod_manager.py:448} INFO - [<MagicMock id='4563553616'>] message3 line1

I agree that the test doesn't cover this check, but IMHO, we need to update the test and not the code to aggregate the log lines if they are really split for some reason.

dstandish · 2023-12-18T17:54:06Z

I agree that the test doesn't cover this check, but IMHO, we need to update the test and not the code to aggregate the log lines if they are really split for some reason.

Like I tried to indicate, my goal was to not necessarily to revert your change. But I wanted to understand what your change was trying to do. And one way to do that is to revert it and use the test to see expected behavior vs actual.

I agree it would be best if you can update the tests with the right "expected" cases that really reveal what it's supposed to do, both with respect to log.info and progress_callback. This is important because otherwise any future refactor could easily unknowingly remove your change.

Just to make sure it was intentional, you didn't revert all the changes, where you removed the statement that logs the error message:

It was nontrivial to apply the revert patch, maybe I missed that one. But if we are to keep the code and just update tests it's moot.

github-actions · 2024-02-02T00:11:13Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

WIP KPO log cleanup

920092b

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Dec 14, 2023

github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 2, 2024

github-actions bot closed this Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP KPO log cleanup#36227

WIP KPO log cleanup#36227
dstandish wants to merge 1 commit intoapache:mainfrom
astronomer:try-simplify-kpo-logging

dstandish commented Dec 14, 2023 •

edited

Loading

Uh oh!

troyharvey commented Dec 14, 2023

Uh oh!

dstandish commented Dec 14, 2023

Uh oh!

dstandish commented Dec 14, 2023

Uh oh!

dstandish commented Dec 15, 2023

Uh oh!

hussein-awala commented Dec 16, 2023

Uh oh!

dstandish commented Dec 18, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dstandish commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

troyharvey commented Dec 14, 2023

Uh oh!

dstandish commented Dec 14, 2023

Uh oh!

dstandish commented Dec 14, 2023

Uh oh!

dstandish commented Dec 15, 2023

Uh oh!

hussein-awala commented Dec 16, 2023

Uh oh!

dstandish commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dstandish commented Dec 14, 2023 •

edited

Loading

dstandish commented Dec 18, 2023 •

edited

Loading