Skip to content

Conversation

@dstandish
Copy link
Contributor

@dstandish dstandish commented Dec 14, 2023

All this does in effect is revert #34412. I created this PR simply do demonstrate that when we remove the code added in #34412, the tests added in that PR still pass. Which means that either the tests are wrong, or the code is unnecessary.

I am trying to clean up KPO logging logic a bit, and resolve some problems we've encountered with logs hanging. But I don't want to regress #34412. So I need either its tests to work properly, or to conclude that the logic doesn't do anything and we can remove it.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Dec 14, 2023
@troyharvey
Copy link
Contributor

@dstandish Let me know if we can help test this. We have a long-running KubernetesPodOperator running dbt and we had to turn off get_logs because of timeout issues with the task disconnecting from the pod.

@dstandish
Copy link
Contributor Author

Hi @troyharvey , my secret master plan is to do a larger refactor of KPO logging.

In short here's the plan:

  • run a log fetcher in a thread per container
  • each log fetcher passes the log lines to a queue
  • main thread reads from queue and logs the messages
  • separate thread checks pod status periodically to see if it's done. if done, stop everything

it may sound complicated but i think it's ultimately simpler and more reliable than what we have now.

and it should resolve your problem.

however, in order to do this refactor, i want to try to not regress the various fixes people have contributed (such as #34412), so i need to understand what the heck it is doing or at least rely on the tests. but the tests pass even with the code removed. which is why i published this PR, so that the author of #34412 can help clarify the situation, and we can make this better.

@dstandish
Copy link
Contributor Author

also @troyharvey interestingly we have received reports of this happening with DBT. it always seems to be DBT-related. we actually worked pretty hard to reproduce but could not. but that's part of the motivation for this effort. do you happen to have a reliable repro?

@dstandish
Copy link
Contributor Author

hey @troyharvey i actually do have something for you to try....

there's actualy a read timeout parameter we can use for k8s log reads. you can try it out with this patch

diff --git a/airflow/providers/cncf/kubernetes/utils/pod_manager.py b/airflow/providers/cncf/kubernetes/utils/pod_manager.py
index 2f8f300096..aa2e8d5bce 100644
--- a/airflow/providers/cncf/kubernetes/utils/pod_manager.py
+++ b/airflow/providers/cncf/kubernetes/utils/pod_manager.py
@@ -627,6 +627,7 @@ class PodManager(LoggingMixin):
                 follow=follow,
                 timestamps=timestamps,
                 _preload_content=False,
+                _request_timeout=(60, 600),
                 **additional_kwargs,
             )
         except BaseHTTPError:

the tuple means (connection_timeout, read_timeout)

can play with the numbers. seems to work though! let me know what you find

@hussein-awala
Copy link
Member

Just to make sure it was intentional, you didn't revert all the changes, where you removed the statement that logs the error message:

self.log.error(
    "Error parsing timestamp (no timestamp in message %r). "
    "Will continue execution but won't update timestamp",
    line,
)

For the rest, the main goal is aggregating the multiline log.

For the following log:

log = (
  "2020-10-08T14:16:17.793417674Z message1 line1\n"
  "message1 line2\n"
  "message1 line3\n"
  "2020-10-08T14:16:18.793417674Z message2 line1\n"
  "message2 line2\n"
  "2020-10-08T14:16:19.793417674Z message3 line1\n"
)

Your PR logs:

[2023-12-16T21:21:10.301+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message1 line2
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message1 line3
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message2 line1
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message2 line2
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] message3 line1
[2023-12-16T21:21:10.302+0100] {pod_manager.py:427} INFO - [<MagicMock id='4480886608'>] 

While my code logs:

[2023-12-16T21:22:25.843+0100] {pod_manager.py:435} INFO - [<MagicMock id='4563553616'>] message1 line1
message1 line2
message1 line3
[2023-12-16T21:22:25.844+0100] {pod_manager.py:435} INFO - [<MagicMock id='4563553616'>] message2 line1
message2 line2
[2023-12-16T21:22:25.844+0100] {pod_manager.py:448} INFO - [<MagicMock id='4563553616'>] message3 line1

I agree that the test doesn't cover this check, but IMHO, we need to update the test and not the code to aggregate the log lines if they are really split for some reason.

@dstandish
Copy link
Contributor Author

dstandish commented Dec 18, 2023

I agree that the test doesn't cover this check, but IMHO, we need to update the test and not the code to aggregate the log lines if they are really split for some reason.

Like I tried to indicate, my goal was to not necessarily to revert your change. But I wanted to understand what your change was trying to do. And one way to do that is to revert it and use the test to see expected behavior vs actual.

I agree it would be best if you can update the tests with the right "expected" cases that really reveal what it's supposed to do, both with respect to log.info and progress_callback. This is important because otherwise any future refactor could easily unknowingly remove your change.

Just to make sure it was intentional, you didn't revert all the changes, where you removed the statement that logs the error message:

It was nontrivial to apply the revert patch, maybe I missed that one. But if we are to keep the code and just update tests it's moot.

@github-actions
Copy link

github-actions bot commented Feb 2, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 2, 2024
@github-actions github-actions bot closed this Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants