-
Notifications
You must be signed in to change notification settings - Fork 46
Description
We had a production outage some time ago but reporting this here got lost during other tasks.
I pinpointed the issue to a big number of running coredump-composer processes that never finished processing the coredump. This in turn lead to the crashed process being kept alive (they can't be killed until the coredump handling finished) and in turn containerd becoming very unhappy about pods not terminating.
During the outage I unfortunately didn't debug further where the composer was stuck, but I could see the processes clearly running for a long time not doing active work.
My suggestion would be extending the coredump composer to have an upper bound of processing time after which it terminates itself to prevent those stuck processes staying around and potentially guarding the different parts of the file creation process with timeouts (I'm imagining crictl being stuck, but we have already written out the dump itself and I'd rather have it without lots of extra info compared to not having it at all).