Skip to content

Intermittent crash in gRPC client setup when connecting to vminitd (EBADF / invalid connected socket) #572

@1amageek

Description

@1amageek

What did you do?

I am using containerization indirectly from a macOS app that starts Linux containers through ContainerManager.

The crash happens intermittently during VM/container startup, when the guest agent (vminitd) is being connected over vsock. It does not happen every time, which makes it look like a race rather than a deterministic configuration error.

In my case the path is roughly:

  • initialize ContainerManager
  • start the VM
  • wait for the guest agent
  • construct Vminitd
  • Vminitd.Client creates a gRPC ClientConnection with .connectedSocket(connection.fileDescriptor)

What did you expect?

I expected the guest agent connection to either succeed or fail with a normal thrown error.

It should not terminate the process with a fatal precondition.

What happened instead?

The process crashed in SwiftNIO while gRPC was creating a channel from an already connected socket.

The stack points to fcntl(F_SETNOSIGPIPE, ...) inside NIO socket setup, and the failure appears consistent with an invalid file descriptor (EBADF).

Relevant stack frames:

preconditionIsNotUnacceptableErrno(err:where:) at NIOPosix/System.swift:264
syscall<Int32>(blocking:where:_:) at NIOPosix/System.swift:328
Posix.fcntl(descriptor:command:value:) at NIOPosix/System.swift:619
BaseSocketProtocol.ignoreSIGPIPE(descriptor:) at NIOPosix/SocketProtocols.swift:113
BaseSocket.init(socket:) at NIOPosix/BaseSocket.swift:253
Socket.init(socket:setNonBlocking:) at NIOPosix/Socket.swift:86
SocketChannel.init(eventLoop:socket:) at NIOPosix/SocketChannel.swift:87
ClientBootstrap.withConnectedSocket(_:)
ClientBootstrapProtocol.connect(to:) at GRPC/ClientConnection.swift:603
DefaultChannelProvider.makeChannel(...) at GRPC/ConnectionManagerChannelProvider.swift:306
ConnectionManager.startConnecting(...) at GRPC/ConnectionManager.swift:1120

Analysis

Tracing the code path through the library:

  1. VZVirtualMachineInstance.start() calls waitForAgent(), which retries vsock connection to port 1024 up to 150 times (20 ms apart, ~3 s total).

  2. VZVirtioSocketConnection.dupHandle() (VZVirtualMachine+Helpers.swift:142) duplicates the file descriptor via dup(), then calls self.close() on the original VZVirtioSocketConnection, and returns FileHandle(fileDescriptor: fd, closeOnDealloc: false).

  3. Vminitd.Client.init(connection:group:) (Vminitd.swift:555) extracts connection.fileDescriptor and stores it in a ClientConnection.Configuration as .connectedSocket(fd).

  4. gRPC connects lazily. ClientConnection(configuration:) does not use the FD immediately. The actual NIO channel creation happens later, when ConnectionManager.startConnecting() runs on the event loop — typically triggered by the first RPC call (e.g. agent.standardSetup()agent.up(name: "lo")).

  5. By the time NIO calls fcntl(F_SETNOSIGPIPE, fd), the FD is no longer valid → EBADFpreconditionIsNotUnacceptableErrno terminates the process.

The most likely root cause is that VZVirtioSocketConnection.close() invalidates not just its own file descriptor but the underlying vsock transport, which also affects the dup()-ed FD. Unlike a regular POSIX close() on one end of a dup pair, VZVirtioSocketConnection may tear down the kernel-level vsock socket object itself, leaving the duplicated FD pointing at a destroyed resource.

A secondary contributing factor is the time window between FD capture and FD use: gRPC's ClientConnection defers actual channel creation to the event loop, so even a short-lived race can cause the FD to go stale before NIO touches it.

Suggested fixes

  1. In dupHandle(): defer self.close() until after the gRPC channel has been successfully created, or remove the close entirely and let VZVirtioSocketConnection be kept alive alongside the gRPC client.

  2. In Vminitd.Client.init: validate the FD before passing it to gRPC (e.g. fcntl(fd, F_GETFD) → if it returns -1/EBADF, throw a recoverable error instead of letting NIO hit a fatal precondition).

  3. Consider eagerly creating the NIO channel (via ClientBootstrap.withConnectedSocket) at init time rather than deferring it to the first RPC, to minimize the window in which the FD can become invalid.

Environment

  • macOS 26 (beta), Apple Silicon
  • containerization 0.26.x
  • Swift 6.2
  • grpc-swift (whatever version is resolved by containerization's Package.swift)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions