Skip to content

update gateway to add ability to run and exec into containers#1627

Merged
AkihiroSuda merged 19 commits into
moby:masterfrom
coryb:gateway-exec-proto
Oct 4, 2020
Merged

update gateway to add ability to run and exec into containers#1627
AkihiroSuda merged 19 commits into
moby:masterfrom
coryb:gateway-exec-proto

Conversation

@coryb
Copy link
Copy Markdown
Collaborator

@coryb coryb commented Aug 5, 2020

more updates for #749, this pr adds the gateway client and server implementation with proto updates.

Here is a sequence diagram for the grpc message flow:
image

Comment thread frontend/gateway/gateway.go Outdated
}
}

func addDefaultEnvvar(env []string, k, v string) []string {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was copied from ops/exec.go, we can move it to a common util package if desired, was not sure where to put it.

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 5, 2020

The cross compile failed 3 times with this error:

#116 FROM docker.io/tonistiigi/binfmt:buildkit
#116 sha256:90fda517d0186166992332ac17ed66c6e1f3c0b78be072165351cc2aea215ca9
#116 ERROR: pull access denied, repository does not exist or may require authorization: server message: invalid_token: authorization failed

#33 FROM docker.io/tonistiigi/binfmt:buildkit
#33 sha256:d4959a5a882759bba7601923e8ae2aa875841b7cad47faec82eb0b059698c6a3
#33 ERROR: pull access denied, repository does not exist or may require authorization: server message: invalid_token: authorization failed

not sure what this issue could be from.

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get some tests for this please 🤞

For the CI error, I think it is token expiring showing up after #1622 . I switched CI to v0.7.2 that should unblock this (but introduces another issue). I'll try to get a proper fix soon but requires custom containerd authorizer implementation.

Comment thread solver/pb/ops.proto Outdated
Comment thread frontend/gateway/pb/gateway.proto Outdated
Copy link
Copy Markdown
Collaborator

@hinshun hinshun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few potential race conditions, I may find more but submitting this for now.

Comment thread frontend/gateway/gateway.go Outdated
Comment thread frontend/gateway/gateway.go Outdated
Comment thread frontend/gateway/gateway.go Outdated
Comment thread frontend/gateway/gateway.go Outdated
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 5, 2020

@tonistiigi not sure how to create tests for this one. I could not find any unit tests for the gateway, and the integration tests (ie ./client/build_test.go) I think will require modifications to the gateway.Client interface (suggested some changes here). Do you have an example how I might test these changes without the gateway.Client changes? Otherwise, it seems like I could either add tests for this along with the next PR for client changes or try to update this PR with client changes + tests. Other thoughts?

Comment thread frontend/gateway/pb/gateway.proto Outdated
@coryb coryb force-pushed the gateway-exec-proto branch from bb7adb8 to 5962481 Compare August 6, 2020 17:34
Comment thread frontend/gateway/gateway.go Outdated
Comment thread frontend/gateway/gateway.go Outdated
Comment thread frontend/gateway/gateway.go Outdated
@coryb coryb force-pushed the gateway-exec-proto branch from f6744d5 to 1732086 Compare August 6, 2020 18:50
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 6, 2020

After reading Tonis' comment here I realized I have the starting message out of order. I kept thinking the started channel was to indicate IO ready to receive, but it really means container is ready to exec into if necessary. I need to reorder the Started message and delay sending any async Fd messages until the Started has been sent. Will update ...

@tonistiigi
Copy link
Copy Markdown
Member

@coryb Sorry, got confused as well. We do not need a started channel to signal if container is ready for exec, we know that from the create() returning. But that only works if create does not contain meta/process-info. This PR looks correct in that aspect but the client-side proposal didn't.

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 6, 2020

I see, yeah. There is still an unlikely race with my code since the server returns the StartedMessage in parallel with executor.Run. So if client receives StartedMessage and then immediately sends another InitMessage (which should call executor.Exec) then the second InitMessage might result int "container not found" because pid1 might not have updated the running map yet due to the race. It should be pretty easy to fix this.

@coryb coryb force-pushed the gateway-exec-proto branch from 57d512b to b90aaee Compare August 7, 2020 07:14
@coryb coryb marked this pull request as draft August 11, 2020 20:44
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 11, 2020

Converted to Draft until I can get the client side of this integrated as per discussion in #749

@coryb coryb force-pushed the gateway-exec-proto branch from ef6b41a to 14ffe33 Compare August 17, 2020 21:46
@coryb coryb force-pushed the gateway-exec-proto branch 2 times, most recently from 3448537 to df7d944 Compare August 31, 2020 20:30
@coryb coryb marked this pull request as ready for review August 31, 2020 20:54
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Aug 31, 2020

Sorry for the delay, life has not gone to plan for a few weeks now. I have reset the PR to include the client code and added some tests. Here is a gist for sample main:
https://gist.github.com/coryb/e9a74902f172c57cefe258c525735b94

Some issues that came up I am not sure how to deal with. In the sample main, the status displayCh is not synchronized with the container startup so output can get messed up then Stdin goes into Raw mode. It does not cause problems, just looks ugly. Ideally we would be able to flush the displayCh before going into Raw mode.

Another odd issue I am not sure how to deal with is that the integration tests fail under worker=oci-rootless. Runc is getting a panic: panic: cannot statfs cgroup root when run in rootless mode. For now I skip the tests in rootless mode.

Comment thread executor/executor.go

// ExitError will be returned from Run and Exec when the container process exits with
// a non-zero exit code.
type ExitError struct {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved ExitError to errdefs package to avoid circular package dependencies

Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb coryb force-pushed the gateway-exec-proto branch from df7d944 to 355e937 Compare August 31, 2020 21:14
Copy link
Copy Markdown
Collaborator

@hinshun hinshun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed a few times before, only small nit for now.


lbf.ctrsMu.Lock()
// ensure we are not clobbering a dup container id request
if _, ok := lbf.ctrs[in.ContainerID]; ok {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check this in the beginning of this function instead?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to move it up we would have to put most of the function under the mutex lock. It is isolate at the bottom after the container object has been created so that we can check for duplication at the same time we do the assignment and to minimize the timing and code run under the lock. If we move the dup check up, we would either have to check again anyway inside the lock (since concurrency is allowed) or lock before we create the container which could take time with creating volumes via the cache manager.

I have tweaked the code a bit to tighten up the release and unlock handling, does that look better?

Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb coryb force-pushed the gateway-exec-proto branch from c4223b4 to 8df17c8 Compare September 1, 2020 22:45
@tonistiigi
Copy link
Copy Markdown
Member

Another odd issue I am not sure how to deal with is that the integration tests fail under worker=oci-rootless. Runc is getting a panic: panic: cannot statfs cgroup root when run in rootless mode. For now I skip the tests in rootless mode.****

@AkihiroSuda any ideas?

@tonistiigi
Copy link
Copy Markdown
Member

Haven't reviewed yet but one thing I noticed was that there was no new api cap. Client-side should also detect missing cap.

I assume integration to ExecState is planned as a follow-up?

Comment thread client/build_test.go
func testClientGatewayContainerPID1Exit(t *testing.T, sb integration.Sandbox) {
if sb.Rootless() {
// TODO fix this
// We get `panic: cannot statfs cgroup root` when running this test
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open an issue in runc repo

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it appear in exec and not in regular builds?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue, with stacktrace from runc: opencontainers/runc#2573

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/sys/fs/cgroup isn't mounted?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the /sys mounts are trimmed out here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to work if we bind mount in /sys, but not sure is this is the right thing to do here. After diffing the config.json from runc spec and runc spec --rootless. I added this to that ToRootless function and it worked for my tests, but not sure if there are other complications:

	spec.Mounts = append(spec.Mounts, specs.Mount{
		Destination: "/sys",
		Type:        "none",
		Source:      "/sys",
		Options:     []string{"rbind", "nosuid", "noexec", "nodev", "ro"},
	})

Does this seem like the right solution?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice this comment.

{"rbind", "ro"} doesn't make submounts recursively read-only and can result in container breakout.
So we should mount /sys and /sys/fs/cgroup separately as {"bind", "ro"}.

In the long term, we should fix opencontainers/runc#2573

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have updated with this: 4b51fbd

It seems to work, hopefully does not cause other issues.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change ended up causing several other integration tests to fail with oci-rootless, like: TestIntegration/TestDefaultEnvWithArgs so I am going to revert it. I am leaning towards leaving my "skip rootless" hack in place for this PR and then investigating/fixing the rootless + /sys issue later.

…d.Solve

Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb coryb force-pushed the gateway-exec-proto branch from c09d63d to 03e1c19 Compare October 2, 2020 01:59
coryb added 2 commits October 2, 2020 05:36
…be per process

Signed-off-by: Cory Bennett <cbennett@netflix.com>
Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb coryb force-pushed the gateway-exec-proto branch from 349a0fd to 095a919 Compare October 2, 2020 06:25
coryb added 3 commits October 2, 2020 06:33
Signed-off-by: Cory Bennett <cbennett@netflix.com>
Signed-off-by: Cory Bennett <cbennett@netflix.com>
Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb coryb force-pushed the gateway-exec-proto branch from 5845b3c to d1a14ae Compare October 2, 2020 07:37
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Oct 2, 2020

Sorry for the delays, I was out on a much needed vacation for almost 2 weeks.

I think I have most of the items fixed. 2 open issue:

@AkihiroSuda
Copy link
Copy Markdown
Member

Sorry I didn't notice your question, please take a look at #1627 (comment)

@coryb coryb force-pushed the gateway-exec-proto branch from 4b51fbd to d1a14ae Compare October 2, 2020 17:56
Signed-off-by: Cory Bennett <cbennett@netflix.com>
@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Oct 2, 2020

I think this PR is ready now. There are several follow-up issues that I will create to track:

Comment thread frontend/gateway/grpcclient/client.go Outdated
},
exit.Error,
)
if exit.Code == containerd.UnknownExitStatus {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't quite understand this special case handling. Why isn't this just

err = grpcerrors.FromGRPC(status.ErrorProto(exit.Error)
if exit.Code != containerd.UnknownExitStatus {
   err = &errdefs.ExitError{ExitCode: exit.Code, Err: err}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes more sense. exit.Error is an *rpc.Status so has to be converted to *spb.Status first for status.ErrorProto, unless I am missing some other helper func that already does this? Anyway, updated per your outline here: 398593b

Signed-off-by: Cory Bennett <cbennett@netflix.com>
@hinshun hinshun requested a review from AkihiroSuda October 3, 2020 06:01
@tonistiigi
Copy link
Copy Markdown
Member

@coryb I started to try to use this and discovered the signal part was left unimplemented

// TODO Signal(ctx context.Context, sig os.Signal)
. That unfortunately makes it quite unusable for anything interactive (and means we still need to wait for another buildkit release for everyone 😢 ). Do you have plans to finish it?

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Oct 15, 2021

Hey @tonistiigi, I will try to look into it next week. We have been using the gateway exec functionality fairly extensively for debugging failing builds, not having signals has caused a few annoyances for us but generally it has been very usable for us. The interactive terminal allows things like ctrl-c to go through, so just curious where the missing signals are causing problems for you.

@tonistiigi
Copy link
Copy Markdown
Member

The interactive terminal allows things like ctrl-c to go through, so just curious where the missing signals are causing problems for you.

Hmm. It did not work for me.

@coryb
Copy link
Copy Markdown
Collaborator Author

coryb commented Oct 15, 2021

@tonistiigi not sure what is wrong. I have a crud client main that will exec into a container and run bash with interactive tty here: https://gist.github.com/coryb/8d733b7f1f2398a7b85a640d1f38fdc0

If you go run main.go it will get you a prompt (formatting likely off with the default progress output). You can then run tools like top and ctrl-c out of it. At least that worked for me.

@tonistiigi
Copy link
Copy Markdown
Member

@coryb That indeed seems to work in most cases. Need to test more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants