COS-2781: CephCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+#5204
Conversation
|
@petr-muller: This pull request references COS-2781 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I only tested the PromQL on clusters without Ceph, dont have one using Ceph at hand |
| # 4.12 before 4.12.54 and 4.11 | ||
| from: 4[.](11[.].*|12[.]([1-4]?[0-9]|5[0123]))[+].* | ||
| url: https://issues.redhat.com/browse/COS-2705 | ||
| name: RHELKernelCephFSCapDropPanic |
There was a problem hiding this comment.
nit: our only Ceph declaration so far is:
$ git grep -oh 'name: .*Ceph.*' | sort | uniq -c
13 name: CephParallelFsyncWe didn't call out RHELKernel or FS then. If we follow that pattern here, it would be CephCapDropPanic. But 🤷, no need to be particularly consistent vs. old risk names, if you prefer the additional length/context.
There was a problem hiding this comment.
I followed RHELKernelHighLoadIOWait but I kinda like shorter names and maybe we dont want to give RHEL bad publicity
| from: 4[.](11[.].*|12[.]([1-4]?[0-9]|5[0123]))[+].* | ||
| url: https://issues.redhat.com/browse/COS-2705 | ||
| name: RHELKernelCephFSCapDropPanic | ||
| message: "Nodes in clusters running workloads that mount Ceph volumes may experience kernel panics due to a CephFS client bug" |
There was a problem hiding this comment.
nit: no need for the quotes, although they're fine if you want to keep them. You're also missing a trailing period, which we usually include in our messages.
$ for X in blocked-edges/*.yaml; do yaml2json < "${X}" | jq '(.message // "-")[-1:]'; done | sort | uniq -c | sort -n
1 "z"
2 "O"
4 "e"
12 "n"
117 "-"
652 "."There was a problem hiding this comment.
That's what I get by copying RHELKernelHighLoadIOWait files
| topk(1, | ||
| label_replace(group(ceph_health_status), "ceph", "yes", "", "") | ||
| or | ||
| label_replace(0 * group(cluster_version), "ceph", "no", "", "") |
There was a problem hiding this comment.
nit: can you add #3591's _id="" for HyperShift-compatibility to cluster_version and ceph_health_status?
694d996 to
b81419d
Compare
RHELKernelCephFSCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+CephCapDropPanic on 4.12.54+, 4.13.36+, 4.14.14+, 4.15.0+
b81419d to
d7d8cb6
Compare
d7d8cb6 to
ebef68f
Compare
|
/retest-required |
This release never shipped / was tombstoned
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: petr-muller, sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I think we'll need to drop 4.15.4 too cincinnati-graph-data/channels/candidate-4.15.yaml Lines 58 to 63 in 261733a |
|
New changes are detected. LGTM label has been removed. |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Clusters that use CephFS volumes are at risk until a fixed kernel is provided. Set up risks for edges between unaffected and affected versions. Regexes are a bit tricky, here are links to regex101 I used for testing:
4.12: https://regex101.com/r/wqaLUv/1
4.13: https://regex101.com/r/LpY9h1/1
4.14: https://regex101.com/r/jIfQ2g/1
4.15: https://regex101.com/r/XfSOPa/1
I have created a file for the lowest of each minor version and copied the rest like this (fish):