With two different CNI plugins (and for two very different problems with them) we have seen situations where a CNI operation never returns and hangs, never releasing the the CNI lock. I think it would be helpful to add a timeout to the tentatives to get the lock and surface meaningful errors to the kubelet.
Today when this happens the CRI status call from the kubelet will time out with this log:
Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
The kubelet will then mark the node as NotReady with Reason container runtime is down which is misleading because crictl pods and crictl ps continues to work perfectly (but crictl info doesn't because it also calls the CRI status method).
The containerd call in the Status method is here: https://github.com/containerd/cri/blob/master/pkg/server/status.go#L44-L49
And the libcni.Status method is here: https://github.com/containerd/go-cni/blob/master/cni.go#L124-L132
I think adding a timeout to https://github.com/containerd/go-cni/blob/master/cni.go#L126 and returning a different meaningful error would be helpful and make debugging easier when this happens.
The default kubelet timeout for this call is 2mn: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L447-L453
So if we decide to this, we should probably use a shorter timeout (90s?).
I don't think sync/rwmutex supports timeouts so we'd need to use a different implementation. I'd be more than happy to help with a PR if you think it makes sense.
With two different CNI plugins (and for two very different problems with them) we have seen situations where a CNI operation never returns and hangs, never releasing the the CNI lock. I think it would be helpful to add a timeout to the tentatives to get the lock and surface meaningful errors to the kubelet.
Today when this happens the CRI status call from the kubelet will time out with this log:
The kubelet will then mark the node as NotReady with Reason
container runtime is downwhich is misleading because crictl pods and crictl ps continues to work perfectly (but crictl info doesn't because it also calls the CRI status method).The containerd call in the Status method is here: https://github.com/containerd/cri/blob/master/pkg/server/status.go#L44-L49
And the libcni.Status method is here: https://github.com/containerd/go-cni/blob/master/cni.go#L124-L132
I think adding a timeout to https://github.com/containerd/go-cni/blob/master/cni.go#L126 and returning a different meaningful error would be helpful and make debugging easier when this happens.
The default kubelet timeout for this call is 2mn: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L447-L453
So if we decide to this, we should probably use a shorter timeout (90s?).
I don't think sync/rwmutex supports timeouts so we'd need to use a different implementation. I'd be more than happy to help with a PR if you think it makes sense.