Skip to content

[BUG]when running on EC2, iptables segfault error leads to openshift pods trapped in CrashLoopBackOff cycle #296

@ianzhang366

Description

@ianzhang366

What happened:

After installed microshift on EC2(RHEL 8.4), I'm seing the openshift pods are in CrashLoopBackOff states with hundred of restarts

Note: turned off network manager based on known issue doc

What you expected to happen:

openshift Pods do not restart

How to reproduce it (as minimally and precisely as possible):

  1. spin up an EC2 t2.xlarge instance
  2. then turn off network manager by:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer
reboot
  1. install microshift
    curl -sfL https://raw.githubusercontent.com/redhat-et/microshift/main/install.sh | bash

  2. wait a few minutes and do
    kubectl get all -A --context microshift

You will see lots of restarts at openshift pods.

Anything else we need to know?:

@rootfs was able to identify the issue was caused by the iptables segfault:

[root@ip-172-31-85-30 ec2-user]# journalctl |grep iptables
Sep 21 19:12:51 ip-172-31-85-30.ec2.internal microshift[1297]: I0921 19:12:51.860442    1297 kubelet_network_linux.go:56] Initialized IPv4 iptables rules.
Sep 21 19:12:54 ip-172-31-85-30.ec2.internal microshift[1297]: I0921 19:12:54.399365    1297 server_others.go:185] Using iptables Proxier.
Sep 21 19:13:50 ip-172-31-85-30.ec2.internal kernel: iptables[2438]: segfault at 88 ip 00007feaf5dc0e47 sp 00007fff6f2fea08 error 4 in libnftnl.so.11.3.0[7feaf5dbc000+16000]
Sep 21 19:13:50 ip-172-31-85-30.ec2.internal systemd-coredump[2442]: Process 2438 (iptables) of user 0 dumped core.
Sep 21 20:35:57 ip-172-31-85-30.ec2.internal microshift[1297]: E0921 20:35:57.914558    1297 remote_runtime.go:143] StopPodSandbox "1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_service-ca-64547678c6-2nxnp_openshift-service-ca_6236deba-fc5f-4915-817d-f8699a4accfc_0(1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475): error removing pod openshift-service-ca_service-ca-64547678c6-2nxnp from CNI network "crio": running [/usr/sbin/iptables -t nat -D POSTROUTING -s 10.42.0.3 -j CNI-d5d0edec163ce01e4591c1c4 -m comment --comment name: "crio" id: "1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475" --wait]: exit status 2: iptables v1.8.4 (nf_tables): Chain 'CNI-d5d0edec163ce01e4591c1c4' does not exist
Sep 21 20:35:57 ip-172-31-85-30.ec2.internal microshift[1297]: Try `iptables -h' or 'iptables --help' for more information.

then @rootfs suggested a workaround,

kubectl delete ds -n kube-system                     kube-flannel-ds

then restart all openshift pods.

which is test on my env.

Environment:

  • Microshift version (use microshift version): Microshift Version: 4.7.0-0.microshift-2021-08-31-224727
  • Hardware configuration: t2.xlarge
  • OS (e.g: cat /etc/os-release): PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
  • Kernel (e.g. uname -a):
    Linux ip-172-31-41-204.ec2.internal 4.18.0-305.el8.x86_64 Init #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

Relevant Logs

Ingress pod has the following while restart happened:

[ec2-user@ip-172-31-41-204 ~]$ kubectl logs -n openshift-ingress                     router-default-6d8c9d8f57-8bphk
I0921 17:36:17.801664       1 template.go:433] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 9cc0c8fc\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2021-06-11T16:32:09Z\n"
I0921 17:36:17.803371       1 metrics.go:154] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0921 17:36:17.810815       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0921 17:36:17.810872       1 router.go:270] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0921 17:36:17.811332       1 router.go:332] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0921 17:36:17.811391       1 router.go:262] router "msg"="router is including routes in all namespaces"
E0921 17:36:17.914638       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0921 17:36:17.948417       1 router.go:579] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0921 17:38:57.445655       1 template.go:690] router "msg"="Shutdown requested, waiting 45s for new connections to cease"
W0921 17:39:02.274166       1 reflector.go:436] github.com/openshift/router/pkg/router/template/service_lookup.go:33: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions