Skip to content

failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF #2064

@eoshea-cmt

Description

@eoshea-cmt

Contributing guidelines

I've found a bug and checked that ...

  • ... the documentation does not mention anything about my problem
  • ... there are no open or closed issues that are related to my problem

Description

During docker (compose) builds, we occasionally see this error in our CI:
failed to receive status: rpc error: code = Unavailable desc = error reading from server: EOF

This can happen at various stages in docker builds, including:

  • importing cache manifest from ...
  • load build context
  • RUN pip install --upgrade pip

We used our instance monitoring to investigate if there was any correlation with resource uses. We looked into network, memory, and cpu utilization and none of these spiked in correlation to these errors.

This error can kill multiple builds happening in parallel on our CI nodes, but it also happens to single builds as well.

Expected behaviour

docker compose build progress

Actual behaviour

docker compose builds fail

Buildx version

github.com/docker/buildx v0.11.2 9872040

Docker info

+ docker system info
Client: Docker Engine - Community
 Version:    24.0.6
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 16
  Running: 16
  Paused: 0
  Stopped: 0
 Images: 16
 Server Version: 24.0.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
 runc version: v1.1.8-0-g82f18fe
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.15.0-1044-aws
 Operating System: Ubuntu 20.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.67GiB
 Name: ip-10-10-15-71
 ID: 8d7a5a77-4225-4887-a2c3-419a6c5ab76e
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: cmtlouis
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 172.17.0.0/12, Size: 20
   Base: 192.168.0.0/16, Size: 24

Builders list

+ docker buildx ls
NAME/NODE DRIVER/ENDPOINT STATUS  BUILDKIT             PLATFORMS
default * docker                                       
  default default         running v0.11.6+616c3f613b54 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

Configuration

We are not able to consistently reproduce our issues, though we are building multiple images with multiple stages using docker-compose which may be relevant

we also run multiple jobs on the same instances in our CI, so multiple docker compose builds are happening in parallel at times. Furthermore it seems this error can happen to multiple docker compose builds at the same time which running on the same node in parallel.

Build logs

No response

Additional info

seems like it could be a similar error to:
microsoft/vscode-remote-release#7958

I'm wondering if it is some other race condition that only happens occasionally.
It does not seem correlated to resource usage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions