[BUG]when running on EC2, iptables segfault error leads to openshift pods trapped in CrashLoopBackOff cycle

#### What happened:
After installed `microshift` on EC2(RHEL 8.4), I'm seing the `openshift` pods are in `CrashLoopBackOff ` states with hundred of restarts

**Note: turned off network manager based on known issue doc** 

#### What you expected to happen:
`openshift` Pods do not restart

#### How to reproduce it (as minimally and precisely as possible):

1. spin up an EC2 t2.xlarge instance
2. then turn off network manager by:
```
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer
reboot
```
3. install microshift
`curl -sfL https://raw.githubusercontent.com/redhat-et/microshift/main/install.sh | bash`

4. wait a few minutes and do 
`kubectl get all -A --context microshift`

You will see lots of restarts at openshift pods.

#### Anything else we need to know?:
@rootfs was able to identify the issue was caused by the iptables segfault:
```
[root@ip-172-31-85-30 ec2-user]# journalctl |grep iptables
Sep 21 19:12:51 ip-172-31-85-30.ec2.internal microshift[1297]: I0921 19:12:51.860442    1297 kubelet_network_linux.go:56] Initialized IPv4 iptables rules.
Sep 21 19:12:54 ip-172-31-85-30.ec2.internal microshift[1297]: I0921 19:12:54.399365    1297 server_others.go:185] Using iptables Proxier.
Sep 21 19:13:50 ip-172-31-85-30.ec2.internal kernel: iptables[2438]: segfault at 88 ip 00007feaf5dc0e47 sp 00007fff6f2fea08 error 4 in libnftnl.so.11.3.0[7feaf5dbc000+16000]
Sep 21 19:13:50 ip-172-31-85-30.ec2.internal systemd-coredump[2442]: Process 2438 (iptables) of user 0 dumped core.
Sep 21 20:35:57 ip-172-31-85-30.ec2.internal microshift[1297]: E0921 20:35:57.914558    1297 remote_runtime.go:143] StopPodSandbox "1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_service-ca-64547678c6-2nxnp_openshift-service-ca_6236deba-fc5f-4915-817d-f8699a4accfc_0(1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475): error removing pod openshift-service-ca_service-ca-64547678c6-2nxnp from CNI network "crio": running [/usr/sbin/iptables -t nat -D POSTROUTING -s 10.42.0.3 -j CNI-d5d0edec163ce01e4591c1c4 -m comment --comment name: "crio" id: "1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475" --wait]: exit status 2: iptables v1.8.4 (nf_tables): Chain 'CNI-d5d0edec163ce01e4591c1c4' does not exist
Sep 21 20:35:57 ip-172-31-85-30.ec2.internal microshift[1297]: Try `iptables -h' or 'iptables --help' for more information.
```

then @rootfs suggested a workaround, 
```
kubectl delete ds -n kube-system                     kube-flannel-ds
```
then restart all openshift pods.

which is test on my env.

#### Environment:
- Microshift version (use `microshift version`): Microshift Version: 4.7.0-0.microshift-2021-08-31-224727
- Hardware configuration:  t2.xlarge
- OS (e.g: `cat /etc/os-release`): PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
- Kernel (e.g. `uname -a`): 
Linux ip-172-31-41-204.ec2.internal 4.18.0-305.el8.x86_64 #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
- Others:

#### Relevant Logs 
Ingress pod has the following while restart happened:
```
[ec2-user@ip-172-31-41-204 ~]$ kubectl logs -n openshift-ingress                     router-default-6d8c9d8f57-8bphk
I0921 17:36:17.801664       1 template.go:433] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 9cc0c8fc\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2021-06-11T16:32:09Z\n"
I0921 17:36:17.803371       1 metrics.go:154] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0921 17:36:17.810815       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0921 17:36:17.810872       1 router.go:270] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0921 17:36:17.811332       1 router.go:332] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0921 17:36:17.811391       1 router.go:262] router "msg"="router is including routes in all namespaces"
E0921 17:36:17.914638       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0921 17:36:17.948417       1 router.go:579] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0921 17:38:57.445655       1 template.go:690] router "msg"="Shutdown requested, waiting 45s for new connections to cease"
W0921 17:39:02.274166       1 reflector.go:436] github.com/openshift/router/pkg/router/template/service_lookup.go:33: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]when running on EC2, iptables segfault error leads to openshift pods trapped in CrashLoopBackOff cycle #296

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Relevant Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]when running on EC2, iptables segfault error leads to openshift pods trapped in CrashLoopBackOff cycle #296

Description

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Relevant Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions