Follow logs using event channel by jgallucci32 · Pull Request #6614 · containers/podman

jgallucci32 · 2020-06-16T00:58:36Z

Changes method for following logs to use an event
channel rather than a sleep timer to help prevent
race conditions.

Signed-off-by: jgallucci32 john.gallucci.iv@gmail.com

openshift-ci-robot · 2020-06-16T00:58:44Z

Hi @jgallucci32. Thanks for your PR.

I'm waiting for a containers member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

giuseppe · 2020-06-16T08:06:41Z

/ok-to-test

rhatdan · 2020-06-16T15:20:54Z

LGTM

mheon · 2020-06-16T17:15:46Z

Test failure is one I've seen elsewhere - I'm assuming it's a flake? Restarted.

mheon · 2020-06-16T17:17:38Z

Code LGTM

jgallucci32 · 2020-06-16T18:30:49Z

@mheon I ran the portion of the failed test by hand and confirm it works on the command line at least. I've seen a couple of these tests flake out, but this one keeps failing at the same spot several times in a row. I kicked it off again.

If this keeps failing, one thing I suspect is the test was written in such a way to follow the log of the container it is testing and to exit since previously using -f would keep the log tailing indefinitely. Because this fix allows the container to stop tailing immediately upon container exit, the test may not be able to handle this improvement. Just my thoughts, i'll keep digging.

QiWang19 · 2020-06-16T18:45:31Z

If this keeps failing, one thing I suspect is the test was written in such a way to follow the log of the container it is testing and to exit since previously using -f would keep the log tailing indefinitely. Because this fix allows the container to stop tailing immediately upon container exit, the test may not be able to handle this improvement. Just my thoughts, i'll keep digging.

@edsantiago WDYT?

edsantiago · 2020-06-16T18:50:34Z

Unfortunately there's no indication of what the failure was. I've just filed #6626 for one flake which has been causing huge problems the last few days; I suspect it's related to recent changes on podman log. There's another one, in the "test signal handling in containers" system test, which I suspect is related but I haven't fully looked into - that one is next on my list.

edsantiago · 2020-06-16T18:53:18Z

cirrus-flake-analyze confirms that the flake is in the signal-handling test:

special_testing_rootless fedora-32

2020-06-15T19:43:43 system_test

47 podman kill - test signal handling in containers

2020-06-16T03:52:24 system_test

47 podman kill - test signal handling in containers

2020-06-16T03:52:36 system_test

47 podman kill - test signal handling in containers

2020-06-16T08:32:08 system_test

47 podman kill - test signal handling in containers

2020-06-16T09:43:06 system_test

47 podman kill - test signal handling in containers

2020-06-16T10:17:47 system_test

47 podman kill - test signal handling in containers

2020-06-16T11:40:00 system_test

47 podman kill - test signal handling in containers

2020-06-16T11:48:49 system_test

47 podman kill - test signal handling in containers

2020-06-16T12:47:38 system_test

47 podman kill - test signal handling in containers

test fedora-31 fedora-31

2020-06-15T19:52:16 system_test

47 podman kill - test signal handling in containers

test ubuntu-19 ubuntu-19

2020-06-15T19:50:05 system_test

47 podman kill - test signal handling in containers

2020-06-16T03:59:47 system_test

47 podman kill - test signal handling in containers

2020-06-16T04:01:10 system_test

47 podman kill - test signal handling in containers

2020-06-16T08:40:52 system_test

47 podman kill - test signal handling in containers

2020-06-16T09:49:33 system_test

47 podman kill - test signal handling in containers

test ubuntu-20 ubuntu-20

2020-06-15T19:50:40 system_test

47 podman kill - test signal handling in containers

Previously cirrus-flake-analyze had only seen this on Ubuntu - now it's everywhere.

edsantiago · 2020-06-16T18:55:14Z

I'm going to guess that podman logs -f is halting when the container stops, but it is not emitting the last line of container output. If this is the case, I would consider it a serious bug.

jgallucci32 · 2020-06-16T19:15:55Z

I'm going to guess that podman logs -f is halting when the container stops, but it is not emitting the last line of container output. If this is the case, I would consider it a serious bug.

You are correct that podman logs -f halts when the container stop. It does print the last line before exiting. You can test it by re-running the flake test manually:

containerID=$(podman run -d quay.io/libpod/alpine_labels:latest sh -c 'for i in 1 2 3 4 5 6 8 10 12 13 14 15 16 20 21 22 23 24 25 26 64; do trap "echo got: $i" $i; done; echo READY; while ! test -e /stop; do sleep 0.05; done; echo DONE')
podman logs -f $containerID &
podman exec $containerID touch /stop

This results in DONE being printed to the command line which is the last line before the container exits.

edsantiago · 2020-06-16T19:18:03Z

It does print the last line before exiting

...at least sometimes. My hunch is that it does not always do so, but I really don't know and won't have time to poke into it today. I just think the possibility of a race condition is worth looking into.

jgallucci32 · 2020-06-16T19:24:47Z

It does print the last line before exiting

...at least sometimes. My hunch is that it does not always do so, but I really don't know and won't have time to poke into it today. I just think the possibility of a race condition is worth looking into.

I agree. I suspected a race condition was created with PR #6591. It initially failed some of the tests but after a few re-runs it worked. This PR changes the logic from a loop with a 1 second sleep to use an event handler to exit in real-time. I suspect the way the tests were written they accommodated the fact logs -f would hang rather than exit. The tests may have to be re-written to expect logs -f to immediately exit now (which is the behavior of Docker btw)

edsantiago · 2020-06-16T19:39:40Z

@baude, @jwhonce, @QiWang19, @giuseppe, @haircommander - we need a formal definition of what podman logs -f should do when a container exits (without --rm).

podman logs -f is guaranteed to print all output from the container, up to and including the final byte emitted by the container (unless container is removed).
podman logs -f does not guarantee that it will print all output from the container, and it is up to each individual user to run into that problem, troubleshoot it, and figure out complicated ways to deal with it.

Obviously I have my strong preference, but my preference does not matter. What does matter is that the podman behavior should be clearly and unambiguously documented.

rhatdan · 2020-06-16T19:55:30Z

It should be:
podman logs -f is guaranteed to print all output from the container, up to and including the final byte emitted by the container (unless container is removed).

jgallucci32 · 2020-06-16T20:39:53Z

@rhatdan I created PR #6632 which uses StopAtEOF() which should address this use case. Looking at the hpcloud code it seems to want to print an error 100% of the time so I added a case to not print an error message when using this versus the current Stop() function. Waiting to see if it passes all the checks first try.

jgallucci32 · 2020-06-16T23:31:22Z

@edsantiago Even with the change to StopAtEOF() it is failing at test/system/130-kill.bats for what appears to be random tests. I'm seeing both Fedora and Ubuntu randomly fail this, but it's the same test at least on each one. Having trouble reproducing the issue running the test script locally. It succeeds every time.

jgallucci32 · 2020-06-17T14:04:41Z

/hold

I'm going to test a PR to revert the log functionality back to a week to see if that clears up all the failed tests.

jgallucci32 · 2020-06-17T20:09:20Z

@rhatdan @QiWang19 @edsantiago @TomSweeneyRedHat The fix for log following has been rebased and is back to the original spot of failing the same checks it was previously. This should give us a good starting point for troubleshooting so it can be implemented correctly.

Please review what is being proposed and I'm open to suggestions as to where this may be failing.

mheon · 2020-06-18T14:25:35Z

make validate is what CI uses

jgallucci32 · 2020-06-18T15:24:02Z

@mheon @TomSweeneyRedHat @QiWang19 @edsantiago PTAL

I was able to determine the root cause of the test failures. The issue was the configuration of tailLog was set to Poll: true meaning it would poll for logs rather than use inotify for detecting entries to the log. Setting it to Poll: false uses inotify watchers and prevents the race condition where the container would exit before tailLog finished polling for the final entry.

I also added better error handling to ensure StopAtEOF() is called if there are container errors in addition to it being stopped. There was a rare condition I discovered where the container could be removed before telling the tailing to stop causing it to hang indefinitely (the problem we're trying to fix here)

mheon · 2020-06-18T15:35:45Z

@baude I remember we moved to poll around a year and a half, two years ago, to resolve a bug, but I have absolutely no recollection of what said bug was - do you remember?

edsantiago · 2020-06-18T16:33:31Z

grepping git commits shows #3162 as a possible candidate

mheon · 2020-06-18T16:55:25Z

I think that was a separate issue. I'm recalling a specific bug in the Golang inotify implementation that caused us to not get some events, maybe?

jgallucci32 · 2020-06-18T16:57:10Z

grepping git commits shows #3162 as a possible candidate

It looks like #3162 was somewhat undone a few months later by #3933 which switched it back to using inotify to be notified of the container exit instead of polling continuously the runtime.

rhatdan · 2020-06-19T09:54:16Z

/approve

openshift-ci-robot · 2020-06-19T09:54:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jgallucci32, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rhatdan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rhatdan · 2020-06-19T09:55:10Z

Isn't there another active PR in here, where @jwhonce or @baude are changing from Inotify to polling?

jgallucci32 · 2020-06-19T14:06:09Z

Isn't there another active PR in here, where @jwhonce or @baude are changing from Inotify to polling?

@rhatdan Yes, but that is for event polling. I tried to reproduce the steps described in #6664 but for following logs (have multiple terminals following logs) and it did not seem to be an issue.

Additionally I will test to see if changing event reading from inotify to polling will break anything here.

jgallucci32 · 2020-06-19T14:27:14Z

@rhatdan I just did a local compile including the changes from #6677 and ran through a whole slew of tests I ran to verify this PR including 035-logs.bats and 130-kill.bats which were the ones which typically failed and all resulted in success. Since there doesn't appear to be a conflict so I'm going to release the hold here.

/hold cancel

mheon · 2020-06-19T14:38:05Z

@jwhonce Does this conflict with your changes to events for APIv2?

I did a rebase with the changes just merged for events for APIv2. Running through checks now.

Changes method for following logs to use an event channel rather than a sleep timer to help prevent race conditions. Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Co-authored-by: Qi Wang <qiwan@redhat.com> Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

jgallucci32 · 2020-06-19T19:34:47Z

/hold

After the rebase we're back to failing one of the e2e tests. It's failing in the same place it did before when using the sleep statements in PR #6591 which makes sense since looking for events using a poll is the equivalent of looping through a sleep statement looking for container state. The challenge here we have enough tests cases where going too fast (or too slow) breaks things so the timing has to be precise.

There are some options here:

See if there is a way to fix APIv2: No events when deleting image #6664 without using polling for events (revert back to using inotify)
See if there is a way to detect if the container has exited to stop following logs which does not involve looking at events (which is now polling) to avoid delays.

jgallucci32 · 2020-06-20T18:01:51Z

/Close

This has been refactored into #6702 to use timers rather than event channels and has passed all checks.

openshift-ci-robot · 2020-06-20T18:01:54Z

@jgallucci32: Closed this PR.

Details

In response to this:

/Close

This has been refactored into #6702 to use timers rather than event channels and has passed all checks.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This incorporates code from PR containers#6591 and containers#6614 but does not use event channels to detect container state and rather uses timers with a defined wait duration before calling t.StopAtEOF() to ensure the last log entry is output before a container exits. The polling interval is set to 250 milliseconds based on polling interval defined in hpcloud/tail here: https://github.com/hpcloud/tail/blob/v1.0.0/watch/polling.go#L117 Co-authored-by: Qi Wang <qiwan@redhat.com> Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

openshift-ci-robot requested review from jwhonce and mheon June 16, 2020 00:58

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 16, 2020

jgallucci32 mentioned this pull request Jun 16, 2020

Merged request to fix -f to stop following logs #6591

Merged

openshift-ci-robot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 16, 2020

QiWang19 reviewed Jun 16, 2020

View reviewed changes

Comment thread libpod/container_log.go Outdated

edsantiago mentioned this pull request Jun 16, 2020

Fix podman-remote images #6624

Merged

jgallucci32 mentioned this pull request Jun 16, 2020

Use StopAtEOF() to avoid early termination of following logs #6632

Closed

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2020

jgallucci32 mentioned this pull request Jun 17, 2020

Revert #6591 to fix issue with failed tests #6644

Merged

jgallucci32 requested a review from QiWang19 June 18, 2020 15:19

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2020

mheon reviewed Jun 19, 2020

View reviewed changes

jgallucci32 and others added 6 commits June 19, 2020 09:51

Follow logs using event channel

a51edc2

Changes method for following logs to use an event channel rather than a sleep timer to help prevent race conditions. Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Fix formatting of ReadOptions

dd01f05

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Use StopAtEOF() to ensure end of log is reached

f0042ff

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Add follow logs test case lost during rebase

eeb3bfc

Co-authored-by: Qi Wang <qiwan@redhat.com> Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Use inotify instead of polling for file changes with tail

60c1906

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

Add better error handling to stop tailing logs

2ec4e8b

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2020

jgallucci32 mentioned this pull request Jun 20, 2020

Stop following logs using timers #6702

Merged

openshift-ci-robot closed this Jun 20, 2020

github-actions Bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023

github-actions Bot locked as resolved and limited conversation to collaborators Sep 24, 2023

Conversation

jgallucci32 commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jun 16, 2020

Uh oh!

giuseppe commented Jun 16, 2020

Uh oh!

rhatdan commented Jun 16, 2020

Uh oh!

mheon commented Jun 16, 2020

Uh oh!

mheon commented Jun 16, 2020

Uh oh!

jgallucci32 commented Jun 16, 2020

Uh oh!

QiWang19 commented Jun 16, 2020

Uh oh!

edsantiago commented Jun 16, 2020

Uh oh!

edsantiago commented Jun 16, 2020

special_testing_rootless fedora-32

test fedora-31 fedora-31

test ubuntu-19 ubuntu-19

test ubuntu-20 ubuntu-20

Uh oh!

edsantiago commented Jun 16, 2020

Uh oh!

jgallucci32 commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edsantiago commented Jun 16, 2020

Uh oh!

jgallucci32 commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edsantiago commented Jun 16, 2020

Uh oh!

Uh oh!

rhatdan commented Jun 16, 2020

Uh oh!

jgallucci32 commented Jun 16, 2020

Uh oh!

jgallucci32 commented Jun 16, 2020

Uh oh!

jgallucci32 commented Jun 17, 2020

Uh oh!

jgallucci32 commented Jun 17, 2020

Uh oh!

mheon commented Jun 18, 2020

Uh oh!

jgallucci32 commented Jun 18, 2020

Uh oh!

mheon commented Jun 18, 2020

Uh oh!

edsantiago commented Jun 18, 2020

Uh oh!

mheon commented Jun 18, 2020

Uh oh!

jgallucci32 commented Jun 18, 2020

Uh oh!

rhatdan commented Jun 19, 2020

Uh oh!

openshift-ci-robot commented Jun 19, 2020

Uh oh!

rhatdan commented Jun 19, 2020

Uh oh!

jgallucci32 commented Jun 19, 2020

Uh oh!

jgallucci32 commented Jun 19, 2020

Uh oh!

mheon Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

jgallucci32 Jun 19, 2020

Choose a reason for hiding this comment

Uh oh!

jgallucci32 commented Jun 19, 2020

Uh oh!

jgallucci32 commented Jun 20, 2020

jgallucci32 commented Jun 16, 2020 •

edited

Loading

jgallucci32 commented Jun 16, 2020 •

edited

Loading

jgallucci32 commented Jun 16, 2020 •

edited

Loading