Follow logs using event channel#6614
Conversation
|
Hi @jgallucci32. Thanks for your PR. I'm waiting for a containers member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
|
LGTM |
|
Test failure is one I've seen elsewhere - I'm assuming it's a flake? Restarted. |
|
Code LGTM |
|
@mheon I ran the portion of the failed test by hand and confirm it works on the command line at least. I've seen a couple of these tests flake out, but this one keeps failing at the same spot several times in a row. I kicked it off again. If this keeps failing, one thing I suspect is the test was written in such a way to follow the log of the container it is testing and to exit since previously using |
@edsantiago WDYT? |
|
Unfortunately there's no indication of what the failure was. I've just filed #6626 for one flake which has been causing huge problems the last few days; I suspect it's related to recent changes on podman log. There's another one, in the "test signal handling in containers" system test, which I suspect is related but I haven't fully looked into - that one is next on my list. |
|
cirrus-flake-analyze confirms that the flake is in the signal-handling test: special_testing_rootless fedora-322020-06-15T19:43:43 system_test 2020-06-16T03:52:24 system_test 2020-06-16T03:52:36 system_test 2020-06-16T08:32:08 system_test 2020-06-16T09:43:06 system_test 2020-06-16T10:17:47 system_test 2020-06-16T11:40:00 system_test 2020-06-16T11:48:49 system_test 2020-06-16T12:47:38 system_test test fedora-31 fedora-312020-06-15T19:52:16 system_test test ubuntu-19 ubuntu-192020-06-15T19:50:05 system_test 2020-06-16T03:59:47 system_test 2020-06-16T04:01:10 system_test 2020-06-16T08:40:52 system_test 2020-06-16T09:49:33 system_test test ubuntu-20 ubuntu-202020-06-15T19:50:40 system_test Previously cirrus-flake-analyze had only seen this on Ubuntu - now it's everywhere. |
|
I'm going to guess that |
You are correct that This results in |
...at least sometimes. My hunch is that it does not always do so, but I really don't know and won't have time to poke into it today. I just think the possibility of a race condition is worth looking into. |
I agree. I suspected a race condition was created with PR #6591. It initially failed some of the tests but after a few re-runs it worked. This PR changes the logic from a loop with a 1 second sleep to use an event handler to exit in real-time. I suspect the way the tests were written they accommodated the fact |
|
@baude, @jwhonce, @QiWang19, @giuseppe, @haircommander - we need a formal definition of what
Obviously I have my strong preference, but my preference does not matter. What does matter is that the podman behavior should be clearly and unambiguously documented. |
|
It should be: |
|
@rhatdan I created PR #6632 which uses |
|
@edsantiago Even with the change to |
|
/hold I'm going to test a PR to revert the log functionality back to a week to see if that clears up all the failed tests. |
|
@rhatdan @QiWang19 @edsantiago @TomSweeneyRedHat The fix for log following has been rebased and is back to the original spot of failing the same checks it was previously. This should give us a good starting point for troubleshooting so it can be implemented correctly. Please review what is being proposed and I'm open to suggestions as to where this may be failing. |
|
|
|
@mheon @TomSweeneyRedHat @QiWang19 @edsantiago PTAL I was able to determine the root cause of the test failures. The issue was the configuration of tailLog was set to I also added better error handling to ensure |
|
@baude I remember we moved to poll around a year and a half, two years ago, to resolve a bug, but I have absolutely no recollection of what said bug was - do you remember? |
|
grepping git commits shows #3162 as a possible candidate |
|
I think that was a separate issue. I'm recalling a specific bug in the Golang inotify implementation that caused us to not get some events, maybe? |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jgallucci32, rhatdan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@rhatdan Yes, but that is for event polling. I tried to reproduce the steps described in #6664 but for following logs (have multiple terminals following logs) and it did not seem to be an issue. Additionally I will test to see if changing event reading from inotify to polling will break anything here. |
|
@rhatdan I just did a local compile including the changes from #6677 and ran through a whole slew of tests I ran to verify this PR including /hold cancel |
There was a problem hiding this comment.
@jwhonce Does this conflict with your changes to events for APIv2?
There was a problem hiding this comment.
I did a rebase with the changes just merged for events for APIv2. Running through checks now.
Changes method for following logs to use an event channel rather than a sleep timer to help prevent race conditions. Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Co-authored-by: Qi Wang <qiwan@redhat.com> Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
|
/hold After the rebase we're back to failing one of the e2e tests. It's failing in the same place it did before when using the sleep statements in PR #6591 which makes sense since looking for events using a poll is the equivalent of looping through a sleep statement looking for container state. The challenge here we have enough tests cases where going too fast (or too slow) breaks things so the timing has to be precise. There are some options here:
|
|
/Close This has been refactored into #6702 to use timers rather than event channels and has passed all checks. |
|
@jgallucci32: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This incorporates code from PR containers#6591 and containers#6614 but does not use event channels to detect container state and rather uses timers with a defined wait duration before calling t.StopAtEOF() to ensure the last log entry is output before a container exits. The polling interval is set to 250 milliseconds based on polling interval defined in hpcloud/tail here: https://github.com/hpcloud/tail/blob/v1.0.0/watch/polling.go#L117 Co-authored-by: Qi Wang <qiwan@redhat.com> Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Changes method for following logs to use an event
channel rather than a sleep timer to help prevent
race conditions.
Close #6531
Signed-off-by: jgallucci32 john.gallucci.iv@gmail.com