rootless: Rearrange setup of rootless containers ***CIRRUS: TEST IMAGES***#3310
Conversation
|
Hi @gabibeyer. Thanks for your PR. I'm waiting for a containers or openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
|
Can one of the admins verify this patch?
|
|
If we're doing this, do we need to retain the old @giuseppe WDYT? |
There was a problem hiding this comment.
I feel like we've entirely defeated the point of postConfigureNetNS with this change. If we're unconditionally doing it before the container is created, there's no further point to the code - we should remove it and simplify the logic.
There was a problem hiding this comment.
Would we want Podman to create a network namespace for runc as well, and pass the netns path? Or keep the logic of passing the emtpy netns path, and have a more appropriately named variable to differentiate the logic?
There was a problem hiding this comment.
I like the idea of simplifying and having a consistent flow.
I would weigh in more of it was clear (to me) the scenario/use-case around postConfigureNetNS - not to nit, but can someone describe the comment on that? Happy to send a PR to add here:
https://github.com/containers/libpod/blob/master/libpod/container.go#L396-L398
There was a problem hiding this comment.
I'd make creating the networking namespace and passing it into the runtime unconditional.
There was a problem hiding this comment.
I am not sure how easily we can take rid of postConfigureNetNS. It is also needed when a user namespace is created by the OCI runtime. If we create the network namespace before the OCI runtime runs and configures the user namespace, then the network namespace is owned by the wrong user namespace
There was a problem hiding this comment.
Keeping a consistent flow for runc and kata sounds good to me, easier to understand, maintain and debug.
While you are at it @gabibeyer, while not strictly related to this PR, it will be useful to add the above explanation from @giuseppe as a doc comment for the postConfigureNetNS flag
There was a problem hiding this comment.
I am not sure how easily we can take rid of postConfigureNetNS. It is also needed when a user namespace is created by the OCI runtime. If we create the network namespace before the OCI runtime runs and configures the user namespace, then the network namespace is owned by the wrong user namespace
There was a problem hiding this comment.
Keeping a consistent flow for runc and kata sounds good to me, easier to understand, maintain and debug.
While you are at it @gabibeyer, while not strictly related to this PR, it will be useful to add the above explanation from @giuseppe as a doc comment for the postConfigureNetNS flag
There was a problem hiding this comment.
I suppose you will be dropping the if condition for PostConfigureNetNS here and everywhere else @gabibeyer
There was a problem hiding this comment.
I would move the pipe creation above before the command-line args for slirp4netns are created, so that we know what file-descriptors "3" and "4" passed to the slirp process are.
|
Per @giuseppe we might still need |
There was a problem hiding this comment.
@giuseppe Do you know the reasoning for appending the container PID with api-socket? I couldn't figure out the reasoning behind it, but some of the port mapping tests are failing and I want to make sure its not because of removing it.
|
After some debugging of the failed pod tests, it seems the infra pod either isn't storing or configuring the network namespace correctly. When attemping to execute slirp4netns the netns is nil: |
d8c1c99 to
e4e3835
Compare
|
@mheon @giuseppe I made some changes that I believe will satisfy the postConfiguration logic for OCI specified namespaces. I am having a little trouble with a healthcheck unit test; when I ran them locally I was getting this failure: However, when I attempted to run the tests against the current master and it seems to be failing as well, so I'm guessing that may be my environment. I'm having a hard time finding in the CI logs what the error reported is. Thank you for the help! |
|
@baude Can you take a look at the healthcheck test issues here? |
|
@baude Do you have any ideas on how I can proceed further, I'm a little stuck. Thank you! |
|
i dont see the failures here ... is this only local? can you push so i can see what you are talking about? |
|
@baude ah sorry, yeah I have been attempting to find where the failures in the "special_testing_rootless" tests are happening. I was attempting to run the Instead it seems to be something with seccomp and OCI runtime, maybe: |
|
i can help you debug them if you force push your code |
|
@gabibeyer slirp4netns 0.4.0-beta.2 should land in updates-testing soon |
|
lgtm @gabibeyer |
|
@mheon @vrothberg @cevich PTAL |
|
|
||
| // Setup rootless networking, requires c.state.NetNS to be set | ||
| if rootless.IsRootless() { | ||
| rootlessSetupErr = c.runtime.setupRootlessNetNS(c) |
There was a problem hiding this comment.
Question: why isn't this fatal? Are we expecting a lot of spurious errors here? I would expect a slirp failure to prevent container creation
There was a problem hiding this comment.
Because it is a go routine, it it fatal once the other go routine completes, and the error is checked.
| cmd := exec.Command(path, cmdArgs...) | ||
| cmdArgs = append(cmdArgs, "-c", "-e", "3", "-r", "4") | ||
| if !ctr.config.PostConfigureNetNS { | ||
| ctr.rootlessSlirpSyncR, ctr.rootlessSlirpSyncW, err = os.Pipe() |
There was a problem hiding this comment.
Don't we need defer Close... on these two in here?
| return errors.Wrapf(err, "failed to create rootless network sync pipe") | ||
| } | ||
| } else { | ||
| defer errorhandling.CloseQuiet(ctr.rootlessSlirpSyncR) |
There was a problem hiding this comment.
So we have two places where we initialize the pipes for slirp, and two places where we do a defer close on the pipes, and those places are not the same... Why is this? It doesn't seem to make sense.
There was a problem hiding this comment.
Right, this took me a minute to figure out. Slirp4netns is connected to conmon via this pipe; that's how it knows to die (when the container + conmon stops, the pipe closes at the conmon end, and slirp4netns stops). So, one end of the pipe must be in conmon command, and the other in the slirp4netns command. Depending on whether the new default logic is ran (setup network -> create container) or the PostConfigureNetNS logic (create container -> setup network), the timing of creating and closing the pipe changes.
I looked a little bit into creating the pipe somewhere before both operations, and then closing after both, but didn't see a good place in the code that popped out. I'm definitely open to suggestions!
| stopSignal = uint(syscall.SIGTERM) | ||
| } | ||
|
|
||
| defer func() { |
There was a problem hiding this comment.
This is not needed or desired. The cleanup process should take care of it.
There was a problem hiding this comment.
In the case of restart, the previous network namespaces are not getting cleaned up. I am slightly baffled as to where the removal of the netns bind mount is occurring in the stop call in general. It is probably horribly obvious, but I see directly after the call to c.StopWithTimeout the bind mount to the netns is gone /run/user/1000/netns/.... However, when I throw a debug statement at the end of the c.StopWithTimeout function it still exists.
I'll continue to look for the appropriate place to get this cleaned up.
There was a problem hiding this comment.
stop itself does not remove them. On the container stopping, the cleanup process (podman container cleanup, spawned by conmon) will fire, removing the network namespace, mounts, etc. On restart, we've just been reusing the old network namespace, instead of configuring a new one.
There was a problem hiding this comment.
With the new flow, Podman is now creating the network namespace instead of the runtime, and it is being bind mounted so that it can be passed to slirp and the runtime. I'm storing this in c.state.NetNS, which wasn't being set/used with the previous rootless flow. So, there needs to be an extra step of explicitly cleaning up the podman created and persisted NetNS. I traced the podman container cleanup call, specifically the CleanupContainers function in the pkg/adapter/containers.go file, and I don't see it being called in either model with a restart cmd, so I'm not entirely sure how to how to move forward.
There was a problem hiding this comment.
Does restart need to clean up the old namespace? We've been reusing them for containers running as root.
| defer func() { | ||
| if err := origNS.Set(); err != nil { | ||
| logrus.Errorf("unable to set namespace: %q", err) | ||
| logrus.Warnf("unable to set namespace: %q", err) |
There was a problem hiding this comment.
Does this always trigger on rootless?
There was a problem hiding this comment.
As far as I have seen, since it is attempting to set it back to the golang thread at /proc/...
|
|
||
| // UnmountNS unmounts the NS held by the netns object | ||
| func UnmountNS(ns ns.NetNS) error { | ||
| nsRunDir, err := getNSRunDir() |
There was a problem hiding this comment.
Hm. Where is this used? It seems like it ought to be complaining that this is unused...
There was a problem hiding this comment.
Yeah, its giving an |
|
@AkihiroSuda disregard that...it's giving an |
|
@gabibeyer could you open an issue (or PR) in https://github.com/rootless-containers/slirp4netns ? |
giuseppe
left a comment
There was a problem hiding this comment.
sorry for the late review. Great work! LGTM
|
/lgtm |
|
Woah woah woah. I am pretty convinced that the deferred removal of network namespaces in stop is not a good thing. This is potentially breaking cleanup under the covers. |
|
I don't feel comfortable with this in 1.5.0 as such. We need to either make a release branch and revert in there, or figure out if there are blockers to reusing network namespaces for rootless containers during restart. @giuseppe @gabibeyer Will slirp4netns handle this properly, assuming we don't configure the network namespace twice - we just keep using it. |
|
@mheon from what I hacked around with, it seems possible. I can submit a PR to slirp4netns, and have @AkihiroSuda review to see if it is the appropriate way of handling. |
|
Ack, works for me. For reference, we're going to make a new branch on this repo for 1.5.0, and revert these patches in the branch (not from master). Once this is fixed we can cut a 1.5.1 release from master, including these patches, for testing. |
|
We just had to revert on master - CI was completely broken (generic 'slirp4netns failed' errors) |
|
Right, it requires a version of slirp4netns only released within the test repo
|
In order to run Podman with VM-based runtimes unprivileged, the
network must be set up prior to the container creation. Therefore
this commit modifies Podman to:
to create the tap inferface
enter the netns
Closes #2897
Signed-off-by: Gabi Beyer gabrielle.n.beyer@intel.com