openshift/os: fix user issues on validate job#28790
openshift/os: fix user issues on validate job#28790openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Conversation
2b01bcc to
97dcf63
Compare
97dcf63 to
4ca23e4
Compare
|
/retest |
|
/test ci/rehearse/openshift/os/c9s/validate |
|
@miabbott: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest |
|
/test pj-rehearse |
2 similar comments
|
/test pj-rehearse |
|
/test pj-rehearse |
52a1452 to
df64658
Compare
|
/test pj-rehearse |
df64658 to
50f3327
Compare
|
/test pj-rehearse |
50f3327 to
ddace0f
Compare
|
/test pj-rehearse |
9454f59 to
844fb2d
Compare
The `validate` job started to fail with:
```
fatal: unsafe repository ('/go/src/github.com/openshift/os' is owned by someone else)
To add an exception for this directory, call:
git config --global --add safe.directory /go/src/github.com/openshift/os
```
Applying this suggestion as part of the `ci/validate.sh` script
(see openshift/os#802) fails with the error:
```
+ git config --global --add safe.directory /go/src/github.com/openshift/os
error: could not lock config file //.gitconfig: Permission denied
```
I suspect this is related to how the random user ID is configured in
OCP pods, similar to what is described in openshift/os#781, so I tried
using the `ci/set-openshift-user.sh` script as part of the `validate`
job.
Through trial and error, I found that using the `fcos-buildroot`
container based on F36 would not work with this change and had to
switch to using the `cosa:latest` container. Going further down the
rabbit hole, I found that I didn't need to use the
`ci/set-openshift-user.sh` script at all and just the `cosa:latest`
container was enough to get the `validate` job to pass. I don't claim
to fully understand why this is the case, but it does effectively
unblock the `validate` job.
844fb2d to
c497602
Compare
|
cc: @cheesesashimi Wanted to make you aware of this change and see if the error I encountered rang any bells |
| ./ci/validate.sh | ||
| container: | ||
| from: src | ||
| from: coreos_coreos-assembler_latest |
There was a problem hiding this comment.
Changing this to coreos_coreos-assembler_latest would probably break because src points to what's defined under .images.build_root and I'm not sure if the COSA image has a ./ci/validate.sh script inside it.
There was a problem hiding this comment.
Thinking about this and re-reading your description of this change, we may want to change this to build-test-qemu-img instead since that image will be the result of https://github.com/openshift/os/blob/master/ci/Dockerfile, which is the COSA image and the contents of the openshift/os repo. Doing that will include both the ./ci/vaildate.sh and ./ci/set-openshift-user.sh scripts.
There was a problem hiding this comment.
Changing this to coreos_coreos-assembler_latest would probably break because src points to what's defined under .images.build_root and I'm not sure if the COSA image has a ./ci/validate.sh script inside it.
I'm confused about this, since the pj-rehearse jobs appear to have passed. If you look at the history for the job on this PR, you can see some failures where during some iterations the scripts weren't found due to incorrect paths, etc. From what I can tell it looks like the openshift/os repo is present (magic!) and the ci/validate.sh script is able to be executed successfully.
See the latest job where I stuck some debug output as part of the job - https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/28790/rehearse-28790-pull-ci-openshift-os-master-validate/1529165858037829632/artifacts/test/build-log.txt
There was a problem hiding this comment.
Now I'm a little confused as well! But I think I have an understanding of what's going on here and can explain the source of my confusion (and hopefully make you less confused as well!):
- The Dockerfile for
openshift/osis inci/Dockerfileand is what becomesbuild-test-qemu-img. We probably should make this the.build_rootpart of the CI config, but that's a separate concern and not relevant right now. The resulting image is what the COSA build scripts use to run. It has the scripts and the layering test binary fromopenshift/oslayered on top of thecoreos-assemblerimage. This is why I thought we should use that image to run./ci/validate.sh. - The
srcimage isregistry.ci.openshift.org/coreos/fcos-buildroot:testing-devel, produced by thecoreos/fedora-coreos-configrepo. Interestingly, thevalidatestep is the only part of theopenshift/osCI config that uses this image. None of the other image builds or tests directly consume or use this image. - The "magic" is that before the
validatetest runs, a bunch of init containers are run. Amongst those is acloneRefsstep ($ curl -s 'https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/28790/rehearse-28790-pull-ci-openshift-os-master-validate/1529165858037829632/artifacts/ci-operator-step-graph.json' | jq -r '.[1].manifests[0].spec.initContainers[].name') which clonesopenshift/os(and thefedora-coreos-configsubmodule).
So in a nutshell, what's happening in the validate step is we're taking registry.ci.openshift.org/coreos/coreos-assembler:latest, cloning openshift/os (and the fedora-coreos-config submodule) into it and running ./ci/validate.sh. My confusion came from forgetting about the cloneRefs step and thinking that since coreos-assembler:latest doesn't have the ./ci/validate.sh script present that it would fail.
There was a problem hiding this comment.
Thanks for the detailed information, Zack! It certainly helps improve my understanding of the flow of the jobs.
Interestingly, the validate step is the only part of the openshift/os CI config that uses this image. None of the other image builds or tests directly consume or use this image.
Should we drop the use of the fcos-buildroot as part of this PR? Would we need to reconfigure build_root to point to another image?
Is the cloneRefs step logged in any of the test artifacts?
If you think the PR is good to go as is, please drop an /approve if you can.
There was a problem hiding this comment.
Should we drop the use of the fcos-buildroot as part of this PR? Would we need to reconfigure build_root to point to another image?
We could configure it to point to ci/Dockerfile and build that. We'd then need to replace build-test-qemu-img with src. There's no rush in doing that, so if we want to do that, let's handle that as a separate PR.
Is the cloneRefs step logged in any of the test artifacts?
Unfortunately it's not. From what I can tell, the container used purposely does not create logs (although I don't know why). You might be able to catch it while the job is running if you click through to the console for the CI namespace. The only place I was really able to find it was in the job test steps artifact and even then, I had to know that it was an init container.
If you think the PR is good to go as is, please drop an /approve if you can.
Will do.
bc94f3c to
c497602
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheesesashimi, miabbott The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@miabbott: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@miabbott: Updated the following 3 configmaps:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The
validatejob started to fail with:Applying this suggestion as part of the
ci/validate.shscript(see openshift/os#802) fails with the error:
I suspect this is related to how the random user ID is configured in
OCP pods, similar to what is described in openshift/os#781, so I tried
using the
ci/set-openshift-user.shscript as part of thevalidatejob.
Through trial and error, I found that using the
fcos-buildrootcontainer based on F36 would not work with this change and had to
switch to using the
cosa:latestcontainer. Going further down therabbit hole, I found that I didn't need to use the
ci/set-openshift-user.shscript at all and just thecosa:latestcontainer was enough to get the
validatejob to pass. I don't claimto fully understand why this is the case, but it does effectively
unblock the
validatejob.