Fix Fargate logging for AWS system tests #31622

vincbeck · 2023-05-30T20:39:40Z

Both system tests example_eks_with_fargate_in_one_step and example_eks_with_fargate_profile are failing for the same reason. When the operator EksPodOperator is used with get_logs=True, the operator tries to get log once the POD started. When doing so, 90% of the time it fails because of:

HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Get \\"https://10.0.5.84:10250/containerLogs/default/run-pod-6rjxgqrs/base?follow=true\\u0026timestamps=true\\": remote error: tls: internal error","code":500}\n'

After investigation, it turns out there is delay between when the pod starts and when the CSR is available, signed and approved. If you try to get logs with the command kubectl logs <pod-name> -n <namespace> when the CSR is not available, signed and approved, you'll get the exact same error. If you wait until the CSR is there, you'll get the logs.

Therefore, in order to fix it, I decided to just retry on ApiException which the is the exception we get in such scenario.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

vincbeck · 2023-05-30T20:40:21Z

@ferruzzi who made most of the work here

ferruzzi · 2023-05-31T16:34:02Z

You've reverted everything except adding the exception. Was that intentional?

dstandish · 2023-11-03T21:27:18Z

@vincbeck can you share an example traceback that caused you to add tenacity to consume_logs?

I am just digging into KPO logging a bit again, and finding that the whole thing has become extremely messy and complicated, and I'm trying to chip away at some of the mess.

And as part of that, the behavior of consume_logs is such that it is already called in a loop if there's an error. So it's a bit odd to also wrap it with tenacity, a kind of mixing of two different retry strategies that makes it a bit confusing.

Now what can be done.... Well please observe that consume_logs calls read_pod_logs, which I believe is where the APIException that you catch is actually encountered, and which already retries using tenacity. So my thought is we can just update that function to handle your case, and thus move the retry logic closer to where it is needed, and reduce the tenacity wrappers from 2 to 1. But to do this I need more information about specifically where that exception is raised. My suspicion is that it would be raised at logs = self._client.read_namespaced_pod_log in read_pod_logs. But I recognize it's also possible that it is not raised until the logs consumer is iterated in at for raw_line in logs: in consume_logs. Thanks for the help.

vincbeck · 2023-11-06T16:04:29Z

Hey @dstandish . I removed the tenacity wrapper around consume_logs locally and ran the system test to get the stack trace. The system test ran successfully twice. So I guess, it should be safe to remove it?

dstandish · 2023-11-07T15:21:14Z

Hey, thanks @vincbeck , kind of you to check that.

Ok, i'll make a PR to remove it for now. And then if it comes back, we can revisit.

Thanks

There are many overlapping layers and strategies of retrying in this area of code. It appears this particular layer may be unnecessary. See discussion starting at apache#31622 (comment).

There are many overlapping layers and strategies of retrying in this area of code. It appears this particular layer may be unnecessary. See discussion starting at #31622 (comment).

There are many overlapping layers and strategies of retrying in this area of code. It appears this particular layer may be unnecessary. See discussion starting at apache#31622 (comment).

There are many overlapping layers and strategies of retrying in this area of code. It appears this particular layer may be unnecessary. See discussion starting at apache/airflow#31622 (comment). GitOrigin-RevId: d6c79ce340dd4cd088edfa92ed052d643ae3587d

ferruzzi and others added 5 commits May 16, 2023 15:07

Adds Fargate Logging support

3b5b57d

Adds resolve_wait_for_completion and related unit tests, and CR fixes

63ef15e

docstring deextrafication

782c2e2

Merge branch 'main' into ferruzzi/system-tests/fargate-logging

e9454f7

Retry on ApiException

78dc49a

vincbeck requested review from eladkal, jedcunningham and o-nikolas as code owners May 30, 2023 20:39

boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:providers area:system-tests provider:amazon AWS/Amazon - related issues labels May 30, 2023

vincbeck changed the title ~~Ferruzzi/system tests/fargate logging~~ Enable Fargate logging for AWS system tests May 30, 2023

vincbeck added 3 commits May 30, 2023 17:04

Fix tests

212a528

Remove logging enablement

040e202

Revert eks.py

e7e2b9b

vincbeck changed the title ~~Enable Fargate logging for AWS system tests~~ Fix Fargate logging for AWS system tests May 31, 2023

vincbeck marked this pull request as draft June 1, 2023 14:35

Add sleep statement to have timeto read logs

b3f014c

vincbeck marked this pull request as ready for review June 1, 2023 19:52

Use tenacity to retry on ApiException

982e114

o-nikolas approved these changes Jun 1, 2023

View reviewed changes

Typo

84da5b2

ferruzzi approved these changes Jun 1, 2023

View reviewed changes

potiuk approved these changes Jun 4, 2023

View reviewed changes

potiuk merged commit def4b53 into apache:main Jun 4, 2023

vincbeck deleted the ferruzzi/system-tests/fargate-logging branch June 5, 2023 22:39

eladkal mentioned this pull request Jun 20, 2023

Status of testing Providers that were prepared on June 20, 2023 #32030

Closed

86 tasks

dstandish mentioned this pull request Nov 7, 2023

Remove tenacity on KPO logs inner func consume_logs #35504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Fargate logging for AWS system tests #31622

Fix Fargate logging for AWS system tests #31622

Uh oh!

vincbeck commented May 30, 2023 •

edited

Loading

Uh oh!

vincbeck commented May 30, 2023

Uh oh!

ferruzzi commented May 31, 2023

Uh oh!

dstandish commented Nov 3, 2023

Uh oh!

vincbeck commented Nov 6, 2023 •

edited

Loading

Uh oh!

dstandish commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fix Fargate logging for AWS system tests #31622

Fix Fargate logging for AWS system tests #31622

Uh oh!

Conversation

vincbeck commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincbeck commented May 30, 2023

Uh oh!

ferruzzi commented May 31, 2023

Uh oh!

dstandish commented Nov 3, 2023

Uh oh!

vincbeck commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dstandish commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vincbeck commented May 30, 2023 •

edited

Loading

vincbeck commented Nov 6, 2023 •

edited

Loading