tls: add read deadline to containers/image registry connections by rphillips · Pull Request #777 · containers/container-libs

rphillips · 2026-04-20T17:30:22Z

Add a deadlineConn wrapper that sets a per-read SetReadDeadline on every Read() call to the underlying registry connection, preventing indefinite stalls in tls.Conn.Read
Add DockerReadTimeout field to SystemContext so callers can configure the per-read deadline. When zero (the default), no deadline is enforced
Handle timeout errors in bodyReader.Read() as a reconnectable condition (alongside ECONNRESET and ErrUnexpectedEOF), triggering the existing Range-based resume logic
Add ResponseHeaderTimeout = 2m to the HTTP transport
Add unit tests for isRetryableNetworkError and deadlineConn

Fixes a class of image pull hangs where a registry TLS connection stalls mid-transfer and never returns data. The pull goroutine blocks forever in crypto/tls.(*Conn).Read → docker.(*bodyReader).Read, leaving pods stuck in ContainerCreating indefinitely. Context cancellation alone cannot interrupt a blocked TLS read syscall.

With this change, callers set SystemContext.DockerReadTimeout (e.g. 5 * time.Minute) to enable stall detection. When a read exceeds the timeout, it returns a net.Error with Timeout() == true, the body is closed, and bodyReader reconnects with a Range: bytes=N- header to resume the download from where it left off.

Generated by Claude.
Reviewed by @rphillips

mtrmac

Thanks,

I’m not too happy about adding all this extra infrastructure in, essentially, application-level software (and about hard-coding timeouts, not that making it tunable would make it any better for ultimate users).

What is the theory of the network under which this helps? If this is for pulls, presumably the sender is going to be sending packets and automatically retrying unless it receives ACKs.

And the receiver already has KeepAlive: 30 set. So the connection should only stay alive if the two endpoints are fully live, it’s just that the sender is choosing not to send anything.

We also have reports of registries stalling at/around(?) EOF, reportedly because some security scan is running. A hard-coded timeout does not scale to large images for such operations.

mtrmac · 2026-04-20T17:46:12Z

 			return c.sys.DockerProxy(request.URL)
 		}
 	}
+	tr.ResponseHeaderTimeout = 2 * time.Minute


This was not documented :/

rphillips · 2026-04-20T17:57:35Z

Correct, this is on client pulls.

Claude evaluated the Node logs within a job and says there is a network "hiccup" right before the long running pulls start to happen.

I suspect we can generalize this to be an issue in the client if the network bounces for some reason which stalls the client socket read.

~~The read timeout is long (5m) and less than desirable, but I do not see a way to plumb the timeout all the way down.~~

Updated the PR with a config option.

A stalled TLS connection to a container registry (e.g. quay.io) can cause image pulls to hang indefinitely. The HTTP response body read blocks forever in tls.Conn.Read with no timeout, starving the entire pull pipeline and leaving pods stuck in ContainerCreating for hours. Wrap the HTTP transport dialer with a deadlineConn that enforces a 5-minute read deadline via SetReadDeadline on every Read call. When triggered, bodyReader treats the timeout the same as ECONNRESET and attempts a Range-based reconnect to resume the download. Also add a 2-minute ResponseHeaderTimeout to the transport. Ref: https://redhat.atlassian.net/browse/OCPBUGS-79544 Signed-off-by: Ryan Phillips <rphillips@redhat.com> add

Signed-off-by: Ryan Phillips <rphillips@redhat.com>

rphillips · 2026-04-20T18:22:52Z

@mtrmac I added a DockerReadTimeout to the SystemContext, which defaults to unlimited still. It should allow cri-o to configure a max read timeout.

mtrmac · 2026-04-20T18:44:09Z

Claude evaluated the Node logs within a job and says there is a network "hiccup" right before the long running pulls start to happen.

I don’t know what a “hiccup” means.

Why did the existing Keepalive option on the TCP socket not catch this?

rphillips · 2026-04-20T19:00:38Z

Great question. I'm not sure if this is a server side issue not sending data, or a client side issue with the keepalive. The client should be able to tell the tcp socket to not block forever though.

This could be a quay registry issue. I am not sure.

mtrmac

We have too much on our plate to take on the long-term cost of maintaining a feature+option to handle an unknown situation (which implies no way to reliably reproduce), with a code that shouldn’t be necessary and can break some users.

rphillips · 2026-04-21T15:17:06Z

We had a discussion today on the Node Team. We can defer this PR to perhaps the next release of OpenShift. I do not necessarily agree this is an unknown situation. Socket reads will block without a deadline attached to them and currently the client code is assuming the server is doing the right thing of sending data.

github-actions Bot added the image Related to "image" package label Apr 20, 2026

rphillips force-pushed the fix_read_timeout_for_tls branch from 5662998 to 43f29f7 Compare April 20, 2026 17:36

mtrmac reviewed Apr 20, 2026

View reviewed changes

rphillips force-pushed the fix_read_timeout_for_tls branch from 43f29f7 to ddd4b53 Compare April 20, 2026 17:58

rphillips added 2 commits April 20, 2026 13:15

tests: add test cases for the read deadlines

9333c62

Signed-off-by: Ryan Phillips <rphillips@redhat.com>

rphillips force-pushed the fix_read_timeout_for_tls branch from ddd4b53 to 9333c62 Compare April 20, 2026 18:20

mtrmac requested changes Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tls: add read deadline to containers/image registry connections#777

tls: add read deadline to containers/image registry connections#777
rphillips wants to merge 2 commits intocontainers:mainfrom
rphillips:fix_read_timeout_for_tls

rphillips commented Apr 20, 2026 •

edited

Loading

Uh oh!

mtrmac left a comment

Uh oh!

Uh oh!

mtrmac Apr 20, 2026

Uh oh!

rphillips commented Apr 20, 2026 •

edited

Loading

Uh oh!

rphillips commented Apr 20, 2026

Uh oh!

mtrmac commented Apr 20, 2026

Uh oh!

rphillips commented Apr 20, 2026

Uh oh!

mtrmac left a comment

Uh oh!

rphillips commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rphillips commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mtrmac Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

rphillips commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rphillips commented Apr 20, 2026

Uh oh!

mtrmac commented Apr 20, 2026

Uh oh!

rphillips commented Apr 20, 2026

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

rphillips commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rphillips commented Apr 20, 2026 •

edited

Loading

rphillips commented Apr 20, 2026 •

edited

Loading