Skip to content

COS-2781: CephCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+#5204

Merged
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
petr-muller:COS-2781-CephFSCapDropPanic
May 7, 2024
Merged

COS-2781: CephCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+#5204
openshift-merge-bot[bot] merged 3 commits intoopenshift:masterfrom
petr-muller:COS-2781-CephFSCapDropPanic

Conversation

@petr-muller
Copy link
Copy Markdown
Member

Clusters that use CephFS volumes are at risk until a fixed kernel is provided. Set up risks for edges between unaffected and affected versions. Regexes are a bit tricky, here are links to regex101 I used for testing:

4.12: https://regex101.com/r/wqaLUv/1
4.13: https://regex101.com/r/LpY9h1/1
4.14: https://regex101.com/r/jIfQ2g/1
4.15: https://regex101.com/r/XfSOPa/1

I have created a file for the lowest of each minor version and copied the rest like this (fish):

for r4 in (seq 15 24)
    cp blocked-edges/4.14.14-RHELKernelCephFSCapDropPanic.yaml blocked-edges/4.14.$r4-RHELKernelCephFSCapDropPanic.yaml
    sed -i "s|to: 4\.14\.14|to: 4\.14\.$r4|" blocked-edges/4.14.$r4-RHELKernelCephFSCapDropPanic.yaml
end

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2024
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 7, 2024

@petr-muller: This pull request references COS-2781 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set.

Details

In response to this:

Clusters that use CephFS volumes are at risk until a fixed kernel is provided. Set up risks for edges between unaffected and affected versions. Regexes are a bit tricky, here are links to regex101 I used for testing:

4.12: https://regex101.com/r/wqaLUv/1
4.13: https://regex101.com/r/LpY9h1/1
4.14: https://regex101.com/r/jIfQ2g/1
4.15: https://regex101.com/r/XfSOPa/1

I have created a file for the lowest of each minor version and copied the rest like this (fish):

for r4 in (seq 15 24)
   cp blocked-edges/4.14.14-RHELKernelCephFSCapDropPanic.yaml blocked-edges/4.14.$r4-RHELKernelCephFSCapDropPanic.yaml
   sed -i "s|to: 4\.14\.14|to: 4\.14\.$r4|" blocked-edges/4.14.$r4-RHELKernelCephFSCapDropPanic.yaml
end

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller
Copy link
Copy Markdown
Member Author

I only tested the PromQL on clusters without Ceph, dont have one using Ceph at hand

@openshift-ci openshift-ci Bot requested review from LalatenduMohanty and wking May 7, 2024 16:56
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024
# 4.12 before 4.12.54 and 4.11
from: 4[.](11[.].*|12[.]([1-4]?[0-9]|5[0123]))[+].*
url: https://issues.redhat.com/browse/COS-2705
name: RHELKernelCephFSCapDropPanic
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: our only Ceph declaration so far is:

$ git grep -oh 'name: .*Ceph.*' | sort | uniq -c
     13 name: CephParallelFsync

We didn't call out RHELKernel or FS then. If we follow that pattern here, it would be CephCapDropPanic. But 🤷, no need to be particularly consistent vs. old risk names, if you prefer the additional length/context.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed RHELKernelHighLoadIOWait but I kinda like shorter names and maybe we dont want to give RHEL bad publicity

from: 4[.](11[.].*|12[.]([1-4]?[0-9]|5[0123]))[+].*
url: https://issues.redhat.com/browse/COS-2705
name: RHELKernelCephFSCapDropPanic
message: "Nodes in clusters running workloads that mount Ceph volumes may experience kernel panics due to a CephFS client bug"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for the quotes, although they're fine if you want to keep them. You're also missing a trailing period, which we usually include in our messages.

$ for X in blocked-edges/*.yaml; do yaml2json < "${X}" | jq '(.message // "-")[-1:]'; done | sort | uniq -c | sort -n
      1 "z"
      2 "O"
      4 "e"
     12 "n"
    117 "-"
    652 "."

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I get by copying RHELKernelHighLoadIOWait files

topk(1,
label_replace(group(ceph_health_status), "ceph", "yes", "", "")
or
label_replace(0 * group(cluster_version), "ceph", "no", "", "")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add #3591's _id="" for HyperShift-compatibility to cluster_version and ceph_health_status?

@petr-muller petr-muller force-pushed the COS-2781-CephFSCapDropPanic branch 2 times, most recently from 694d996 to b81419d Compare May 7, 2024 17:29
@petr-muller petr-muller changed the title COS-2781: RHELKernelCephFSCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+ COS-2781: CephCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+ May 7, 2024
@petr-muller petr-muller force-pushed the COS-2781-CephFSCapDropPanic branch from b81419d to d7d8cb6 Compare May 7, 2024 17:41
@petr-muller petr-muller force-pushed the COS-2781-CephFSCapDropPanic branch from d7d8cb6 to ebef68f Compare May 7, 2024 17:44
Copy link
Copy Markdown
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
@sdodson
Copy link
Copy Markdown
Member

sdodson commented May 7, 2024

/retest-required

This release never shipped / was tombstoned
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
@sdodson
Copy link
Copy Markdown
Member

sdodson commented May 7, 2024

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Copy Markdown
Member Author

I think we'll need to drop 4.15.4 too

- 4.15.2
- 4.15.3
- 4.15.5
- 4.15.6
- 4.15.7
- 4.15.8

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2024

New changes are detected. LGTM label has been removed.

@petr-muller petr-muller added the lgtm Indicates that a PR is ready to be merged. label May 7, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2024

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 3c45d0f into openshift:master May 7, 2024
@petr-muller petr-muller deleted the COS-2781-CephFSCapDropPanic branch May 7, 2024 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants