Skip to content

Conversation

@vincbeck
Copy link
Contributor

@vincbeck vincbeck commented May 30, 2023

Both system tests example_eks_with_fargate_in_one_step and example_eks_with_fargate_profile are failing for the same reason. When the operator EksPodOperator is used with get_logs=True, the operator tries to get log once the POD started. When doing so, 90% of the time it fails because of:

HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get \\"https://10.0.5.84:10250/containerLogs/default/run-pod-6rjxgqrs/base?follow=true\\u0026timestamps=true\\": remote error: tls: internal error","code":500}\n'

After investigation, it turns out there is delay between when the pod starts and when the CSR is available, signed and approved. If you try to get logs with the command kubectl logs <pod-name> -n <namespace> when the CSR is not available, signed and approved, you'll get the exact same error. If you wait until the CSR is there, you'll get the logs.

Therefore, in order to fix it, I decided to just retry on ApiException which the is the exception we get in such scenario.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:providers area:system-tests provider:amazon AWS/Amazon - related issues labels May 30, 2023
@vincbeck
Copy link
Contributor Author

@ferruzzi who made most of the work here

@vincbeck vincbeck changed the title Ferruzzi/system tests/fargate logging Enable Fargate logging for AWS system tests May 30, 2023
@vincbeck vincbeck changed the title Enable Fargate logging for AWS system tests Fix Fargate logging for AWS system tests May 31, 2023
@ferruzzi
Copy link
Contributor

You've reverted everything except adding the exception. Was that intentional?

@vincbeck vincbeck marked this pull request as draft June 1, 2023 14:35
@vincbeck vincbeck marked this pull request as ready for review June 1, 2023 19:52
@potiuk potiuk merged commit def4b53 into apache:main Jun 4, 2023
@vincbeck vincbeck deleted the ferruzzi/system-tests/fargate-logging branch June 5, 2023 22:39
@dstandish
Copy link
Contributor

@vincbeck can you share an example traceback that caused you to add tenacity to consume_logs?

I am just digging into KPO logging a bit again, and finding that the whole thing has become extremely messy and complicated, and I'm trying to chip away at some of the mess.

And as part of that, the behavior of consume_logs is such that it is already called in a loop if there's an error. So it's a bit odd to also wrap it with tenacity, a kind of mixing of two different retry strategies that makes it a bit confusing.

Now what can be done.... Well please observe that consume_logs calls read_pod_logs, which I believe is where the APIException that you catch is actually encountered, and which already retries using tenacity. So my thought is we can just update that function to handle your case, and thus move the retry logic closer to where it is needed, and reduce the tenacity wrappers from 2 to 1. But to do this I need more information about specifically where that exception is raised. My suspicion is that it would be raised at logs = self._client.read_namespaced_pod_log in read_pod_logs. But I recognize it's also possible that it is not raised until the logs consumer is iterated in at for raw_line in logs: in consume_logs. Thanks for the help.

@vincbeck
Copy link
Contributor Author

vincbeck commented Nov 6, 2023

Hey @dstandish . I removed the tenacity wrapper around consume_logs locally and ran the system test to get the stack trace. The system test ran successfully twice. So I guess, it should be safe to remove it?

@dstandish
Copy link
Contributor

Hey, thanks @vincbeck , kind of you to check that.

Ok, i'll make a PR to remove it for now. And then if it comes back, we can revisit.

Thanks

dstandish added a commit to astronomer/airflow that referenced this pull request Nov 7, 2023
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache#31622 (comment).
dstandish added a commit that referenced this pull request Nov 7, 2023
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at #31622 (comment).
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Nov 10, 2023
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache#31622 (comment).
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 18, 2024
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 20, 2024
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 8, 2024
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 5, 2025
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 25, 2025
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 20, 2025
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 18, 2025
There are many overlapping layers and strategies of retrying in this area of code.  It appears this particular layer may be unnecessary.  See discussion starting at apache/airflow#31622 (comment).

GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers area:system-tests provider:amazon AWS/Amazon - related issues provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants