Skip to content

Follow logs using event channel#6614

Closed
jgallucci32 wants to merge 6 commits into
containers:masterfrom
jgallucci32:patch-2
Closed

Follow logs using event channel#6614
jgallucci32 wants to merge 6 commits into
containers:masterfrom
jgallucci32:patch-2

Conversation

@jgallucci32
Copy link
Copy Markdown
Contributor

@jgallucci32 jgallucci32 commented Jun 16, 2020

Changes method for following logs to use an event
channel rather than a sleep timer to help prevent
race conditions.

Close #6531

Signed-off-by: jgallucci32 john.gallucci.iv@gmail.com

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

Hi @jgallucci32. Thanks for your PR.

I'm waiting for a containers member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 16, 2020
@giuseppe
Copy link
Copy Markdown
Member

/ok-to-test

@openshift-ci-robot openshift-ci-robot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 16, 2020
@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented Jun 16, 2020

LGTM

@mheon
Copy link
Copy Markdown
Member

mheon commented Jun 16, 2020

Test failure is one I've seen elsewhere - I'm assuming it's a flake? Restarted.

@mheon
Copy link
Copy Markdown
Member

mheon commented Jun 16, 2020

Code LGTM

@jgallucci32
Copy link
Copy Markdown
Contributor Author

@mheon I ran the portion of the failed test by hand and confirm it works on the command line at least. I've seen a couple of these tests flake out, but this one keeps failing at the same spot several times in a row. I kicked it off again.

If this keeps failing, one thing I suspect is the test was written in such a way to follow the log of the container it is testing and to exit since previously using -f would keep the log tailing indefinitely. Because this fix allows the container to stop tailing immediately upon container exit, the test may not be able to handle this improvement. Just my thoughts, i'll keep digging.

@QiWang19
Copy link
Copy Markdown
Member

If this keeps failing, one thing I suspect is the test was written in such a way to follow the log of the container it is testing and to exit since previously using -f would keep the log tailing indefinitely. Because this fix allows the container to stop tailing immediately upon container exit, the test may not be able to handle this improvement. Just my thoughts, i'll keep digging.

@edsantiago WDYT?

@edsantiago
Copy link
Copy Markdown
Member

Unfortunately there's no indication of what the failure was. I've just filed #6626 for one flake which has been causing huge problems the last few days; I suspect it's related to recent changes on podman log. There's another one, in the "test signal handling in containers" system test, which I suspect is related but I haven't fully looked into - that one is next on my list.

@edsantiago
Copy link
Copy Markdown
Member

cirrus-flake-analyze confirms that the flake is in the signal-handling test:

special_testing_rootless fedora-32

2020-06-15T19:43:43 system_test

2020-06-16T03:52:24 system_test

2020-06-16T03:52:36 system_test

2020-06-16T08:32:08 system_test

2020-06-16T09:43:06 system_test

2020-06-16T10:17:47 system_test

2020-06-16T11:40:00 system_test

2020-06-16T11:48:49 system_test

2020-06-16T12:47:38 system_test

test fedora-31 fedora-31

2020-06-15T19:52:16 system_test

test ubuntu-19 ubuntu-19

2020-06-15T19:50:05 system_test

2020-06-16T03:59:47 system_test

2020-06-16T04:01:10 system_test

2020-06-16T08:40:52 system_test

2020-06-16T09:49:33 system_test

test ubuntu-20 ubuntu-20

2020-06-15T19:50:40 system_test

Previously cirrus-flake-analyze had only seen this on Ubuntu - now it's everywhere.

@edsantiago
Copy link
Copy Markdown
Member

I'm going to guess that podman logs -f is halting when the container stops, but it is not emitting the last line of container output. If this is the case, I would consider it a serious bug.

@jgallucci32
Copy link
Copy Markdown
Contributor Author

jgallucci32 commented Jun 16, 2020

I'm going to guess that podman logs -f is halting when the container stops, but it is not emitting the last line of container output. If this is the case, I would consider it a serious bug.

You are correct that podman logs -f halts when the container stop. It does print the last line before exiting. You can test it by re-running the flake test manually:

containerID=$(podman run -d quay.io/libpod/alpine_labels:latest sh -c 'for i in 1 2 3 4 5 6 8 10 12 13 14 15 16 20 21 22 23 24 25 26 64; do trap "echo got: $i" $i; done; echo READY; while ! test -e /stop; do sleep 0.05; done; echo DONE')
podman logs -f $containerID &
podman exec $containerID touch /stop

This results in DONE being printed to the command line which is the last line before the container exits.

@edsantiago
Copy link
Copy Markdown
Member

It does print the last line before exiting

...at least sometimes. My hunch is that it does not always do so, but I really don't know and won't have time to poke into it today. I just think the possibility of a race condition is worth looking into.

@jgallucci32
Copy link
Copy Markdown
Contributor Author

jgallucci32 commented Jun 16, 2020

It does print the last line before exiting

...at least sometimes. My hunch is that it does not always do so, but I really don't know and won't have time to poke into it today. I just think the possibility of a race condition is worth looking into.

I agree. I suspected a race condition was created with PR #6591. It initially failed some of the tests but after a few re-runs it worked. This PR changes the logic from a loop with a 1 second sleep to use an event handler to exit in real-time. I suspect the way the tests were written they accommodated the fact logs -f would hang rather than exit. The tests may have to be re-written to expect logs -f to immediately exit now (which is the behavior of Docker btw)

@edsantiago
Copy link
Copy Markdown
Member

@baude, @jwhonce, @QiWang19, @giuseppe, @haircommander - we need a formal definition of what podman logs -f should do when a container exits (without --rm).

  • podman logs -f is guaranteed to print all output from the container, up to and including the final byte emitted by the container (unless container is removed).
  • podman logs -f does not guarantee that it will print all output from the container, and it is up to each individual user to run into that problem, troubleshoot it, and figure out complicated ways to deal with it.

Obviously I have my strong preference, but my preference does not matter. What does matter is that the podman behavior should be clearly and unambiguously documented.

Comment thread libpod/container_log.go Outdated
@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented Jun 16, 2020

It should be:
podman logs -f is guaranteed to print all output from the container, up to and including the final byte emitted by the container (unless container is removed).

@jgallucci32
Copy link
Copy Markdown
Contributor Author

@rhatdan I created PR #6632 which uses StopAtEOF() which should address this use case. Looking at the hpcloud code it seems to want to print an error 100% of the time so I added a case to not print an error message when using this versus the current Stop() function. Waiting to see if it passes all the checks first try.

@jgallucci32
Copy link
Copy Markdown
Contributor Author

@edsantiago Even with the change to StopAtEOF() it is failing at test/system/130-kill.bats for what appears to be random tests. I'm seeing both Fedora and Ubuntu randomly fail this, but it's the same test at least on each one. Having trouble reproducing the issue running the test script locally. It succeeds every time.

@jgallucci32
Copy link
Copy Markdown
Contributor Author

/hold

I'm going to test a PR to revert the log functionality back to a week to see if that clears up all the failed tests.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2020
@jgallucci32
Copy link
Copy Markdown
Contributor Author

@rhatdan @QiWang19 @edsantiago @TomSweeneyRedHat The fix for log following has been rebased and is back to the original spot of failing the same checks it was previously. This should give us a good starting point for troubleshooting so it can be implemented correctly.

Please review what is being proposed and I'm open to suggestions as to where this may be failing.

@mheon
Copy link
Copy Markdown
Member

mheon commented Jun 18, 2020

make validate is what CI uses

@jgallucci32 jgallucci32 requested a review from QiWang19 June 18, 2020 15:19
@jgallucci32
Copy link
Copy Markdown
Contributor Author

@mheon @TomSweeneyRedHat @QiWang19 @edsantiago PTAL

I was able to determine the root cause of the test failures. The issue was the configuration of tailLog was set to Poll: true meaning it would poll for logs rather than use inotify for detecting entries to the log. Setting it to Poll: false uses inotify watchers and prevents the race condition where the container would exit before tailLog finished polling for the final entry.

I also added better error handling to ensure StopAtEOF() is called if there are container errors in addition to it being stopped. There was a rare condition I discovered where the container could be removed before telling the tailing to stop causing it to hang indefinitely (the problem we're trying to fix here)

@mheon
Copy link
Copy Markdown
Member

mheon commented Jun 18, 2020

@baude I remember we moved to poll around a year and a half, two years ago, to resolve a bug, but I have absolutely no recollection of what said bug was - do you remember?

@edsantiago
Copy link
Copy Markdown
Member

grepping git commits shows #3162 as a possible candidate

@mheon
Copy link
Copy Markdown
Member

mheon commented Jun 18, 2020

I think that was a separate issue. I'm recalling a specific bug in the Golang inotify implementation that caused us to not get some events, maybe?

@jgallucci32
Copy link
Copy Markdown
Contributor Author

grepping git commits shows #3162 as a possible candidate

It looks like #3162 was somewhat undone a few months later by #3933 which switched it back to using inotify to be notified of the container exit instead of polling continuously the runtime.

@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented Jun 19, 2020

/approve

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jgallucci32, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2020
@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented Jun 19, 2020

Isn't there another active PR in here, where @jwhonce or @baude are changing from Inotify to polling?

@jgallucci32
Copy link
Copy Markdown
Contributor Author

Isn't there another active PR in here, where @jwhonce or @baude are changing from Inotify to polling?

@rhatdan Yes, but that is for event polling. I tried to reproduce the steps described in #6664 but for following logs (have multiple terminals following logs) and it did not seem to be an issue.

Additionally I will test to see if changing event reading from inotify to polling will break anything here.

@jgallucci32
Copy link
Copy Markdown
Contributor Author

@rhatdan I just did a local compile including the changes from #6677 and ran through a whole slew of tests I ran to verify this PR including 035-logs.bats and 130-kill.bats which were the ones which typically failed and all resulted in success. Since there doesn't appear to be a conflict so I'm going to release the hold here.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2020
Comment thread libpod/container_log.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwhonce Does this conflict with your changes to events for APIv2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a rebase with the changes just merged for events for APIv2. Running through checks now.

jgallucci32 and others added 6 commits June 19, 2020 09:51
Changes method for following logs to use an event
channel rather than a sleep timer to help prevent
race conditions.

Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Co-authored-by: Qi Wang <qiwan@redhat.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
@jgallucci32
Copy link
Copy Markdown
Contributor Author

/hold

After the rebase we're back to failing one of the e2e tests. It's failing in the same place it did before when using the sleep statements in PR #6591 which makes sense since looking for events using a poll is the equivalent of looping through a sleep statement looking for container state. The challenge here we have enough tests cases where going too fast (or too slow) breaks things so the timing has to be precise.

There are some options here:

  • See if there is a way to fix APIv2: No events when deleting image #6664 without using polling for events (revert back to using inotify)
  • See if there is a way to detect if the container has exited to stop following logs which does not involve looking at events (which is now polling) to avoid delays.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2020
@jgallucci32
Copy link
Copy Markdown
Contributor Author

/Close

This has been refactored into #6702 to use timers rather than event channels and has passed all checks.

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

@jgallucci32: Closed this PR.

Details

In response to this:

/Close

This has been refactored into #6702 to use timers rather than event channels and has passed all checks.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mheon pushed a commit to mheon/libpod that referenced this pull request Jun 24, 2020
This incorporates code from PR containers#6591 and containers#6614 but does not use
event channels to detect container state and rather uses timers
with a defined wait duration before calling t.StopAtEOF() to
ensure the last log entry is output before a container exits.

The polling interval is set to 250 milliseconds based on polling
interval defined in hpcloud/tail here:
https://github.com/hpcloud/tail/blob/v1.0.0/watch/polling.go#L117

Co-authored-by: Qi Wang <qiwan@redhat.com>
Signed-off-by: jgallucci32 <john.gallucci.iv@gmail.com>
@github-actions github-actions Bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. ok-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

podman logs -f continues indefinitely on a stopped container

8 participants