Skip to content

Possible bug - coredump hangs #110

@timbuchwaldt

Description

@timbuchwaldt

We had a production outage some time ago but reporting this here got lost during other tasks.
I pinpointed the issue to a big number of running coredump-composer processes that never finished processing the coredump. This in turn lead to the crashed process being kept alive (they can't be killed until the coredump handling finished) and in turn containerd becoming very unhappy about pods not terminating.

During the outage I unfortunately didn't debug further where the composer was stuck, but I could see the processes clearly running for a long time not doing active work.

My suggestion would be extending the coredump composer to have an upper bound of processing time after which it terminates itself to prevent those stuck processes staying around and potentially guarding the different parts of the file creation process with timeouts (I'm imagining crictl being stuck, but we have already written out the dump itself and I'd rather have it without lots of extra info compared to not having it at all).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions