OCPBUGS-56876: gather: collect logs & analyze node-image-pull#9761
OCPBUGS-56876: gather: collect logs & analyze node-image-pull#9761patrickdillon wants to merge 2 commits intoopenshift:mainfrom
Conversation
As part of the overlay node image, a new service was introduced to pull the node image in 60c63bb This commit updates the installer gather and analyze to collect these logs and analyze them.
|
@patrickdillon: This pull request references Jira Issue OCPBUGS-56876, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@patrickdillon: This pull request references Jira Issue OCPBUGS-56876, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
tthvo
left a comment
There was a problem hiding this comment.
Nice, with this change, I can now see the logs from journal for the node-image-pull service under bootstrap/journals/node-image-pull.log 😄
However, the installer could not analyze the bundle (i.e. openshift-install analyze) for such image-pull errors. I believe the service record for the node-image-pull is missing
$ ls -la <log-bundle-dir>/bootstrap/services/
total 4
drwxr-xr-x. 1 thvo thvo 40 May 29 16:35 .
drwxr-xr-x. 1 thvo thvo 94 May 29 16:35 ..Looking at the template for node-image-pull script. Looks like it is missing the crucial . /usr/local/bin/bootstrap-service-record.sh that records the service (See here).
Adding the . /usr/local/bin/bootstrap-service-record.sh at the top of the template file seems to record the service phases and the installer could then analyze the failed service.
| check func(analysis) bool | ||
| optional bool | ||
| }{ | ||
| {name: "node-image-pull", check: checkReleaseImageDownload, optional: false}, |
There was a problem hiding this comment.
I was thinking about adding unit tests case for node-image-pull in:
But the service release-image and node-image-pull are handled the same way. Let's just rename test cases to node-image-pull instead to avoid dups + reflect the new "actually being used" service?
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
Update analyze command to check for the failed node-image-pull service, so that users are presented with a helpful error message if they have a bad pull secret.
| {name: "release-image", check: checkReleaseImageDownload, optional: false}, | ||
| {name: "node-image-pull", check: checkNodeImagePull, optional: false}, | ||
| {name: "bootkube", check: checkBootkubeService, optional: false}, |
There was a problem hiding this comment.
| {name: "release-image", check: checkReleaseImageDownload, optional: false}, | |
| {name: "node-image-pull", check: checkNodeImagePull, optional: false}, | |
| {name: "bootkube", check: checkBootkubeService, optional: false}, | |
| {name: "node-image-pull", check: checkNodeImagePull, optional: false}, | |
| {name: "release-image", check: checkReleaseImageDownload, optional: false}, | |
| {name: "bootkube", check: checkBootkubeService, optional: false}, |
I think the order matters right, according to #4751 (comment)?
IIUC, node-image-pull is first to start before the other two 🤔 as I saw the release-image never seemed to start when node-image-pull is throwing errors...Though, I am clueless how that works because the service unit files don't define such dependencies 😞
$ cat log-bundle-20251027132247/bootstrap/journals/node-image-pull.log
...output-omitted...
Oct 27 19:48:33 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 19:48:43 ip-10-0-160-222 ostree-containe[2243]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bfa8d59154df08085bb75510454b98aa0fda51e
Oct 27 19:48:44 ip-10-0-160-222 node-image-pull.sh[2243]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest sha256:c7ba2a9638c369c24f9d564f9bfa8d59154df08085bb75510454b98aa0fda51e in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
...output-omitted...
$ cat log-bundle-20251027132247/bootstrap/journals/release-image.log
-- No entries --
There was a problem hiding this comment.
With the current change, we will only ever see the below, which is not what we want right?
$ openshift-install analyze --file=log-bundle-20251027132247.tar.gz
ERROR The bootstrap machine did not execute the release-image.service systemd unit If I change the order as above comment, we can now see:
$ openshift-install analyze --file=log-bundle-20251027132247.tar.gz
ERROR Node image pull failed on the bootstrap machine
INFO
There was a problem hiding this comment.
I noticed that the empty INFO line, which is supposed to print the last 3 lines of service logs. Here, it is not. It seems like the node-image-pull service is looping on the bootstrap and never ends; so its error is never captured.
$ systemctl status node-image-pull
● node-image-pull.service - Node Image Pull
Loaded: loaded (/etc/systemd/system/node-image-pull.service; static)
Active: activating (start) since Mon 2025-10-27 19:47:56 UTC; 1h 9min ago
Process: 1943 ExecStartPre=chcon --reference=/usr/bin/ostree /usr/local/bin/node-image-pull.sh (code=exited, status=0/SUCCESS)
Main PID: 1949 (node-image-pull)
Tasks: 2 (limit: 99952)
Memory: 608.0M
CPU: 1min 10.703s
CGroup: /system.slice/node-image-pull.service
├─1949 /bin/bash /usr/local/bin/node-image-pull.sh
└─7897 sleep 10
Oct 27 20:57:05 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 20:57:15 ip-10-0-160-222 ostree-containe[7814]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bf>
Oct 27 20:57:15 ip-10-0-160-222 node-image-pull.sh[7814]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest s>
Oct 27 20:57:15 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 20:57:25 ip-10-0-160-222 ostree-containe[7826]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bf>
Oct 27 20:57:26 ip-10-0-160-222 node-image-pull.sh[7826]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest s>There was a problem hiding this comment.
I think we can also improve the UX a bit by checking if the error message is present. If not, we can direct the user to the log file. It seems like the simplest way. WDYT @patrickdillon ?
installer/pkg/gather/service/analyze.go
Lines 188 to 192 in d7dc751
|
/jira refresh |
|
@gpei: This pull request references Jira Issue OCPBUGS-56876, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@gpei: This pull request references Jira Issue OCPBUGS-56876, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
@patrickdillon: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
/remove-lifecycle rotten |
As part of the overlay node image, a new service was introduced to pull the node image in
60c63bb
This commit updates the installer gather and analyze to collect these logs and analyze them.
Still testing this...