Skip to content

pkg/steps: emit detailed build errors in debug log#3385

Merged
openshift-ci[bot] merged 6 commits intoopenshift:masterfrom
bbguimaraes:build_log
Apr 24, 2023
Merged

pkg/steps: emit detailed build errors in debug log#3385
openshift-ci[bot] merged 6 commits intoopenshift:masterfrom
bbguimaraes:build_log

Conversation

@bbguimaraes
Copy link
Copy Markdown
Contributor

It can be difficult to identify when builds are reused based on the main
output:

…
INFO[2023-04-20T13:18:59Z] Building failure
INFO[2023-04-20T13:19:31Z] Build failure failed, printing logs:
…
STEP 3/5: RUN false
error: build error: error building at STEP "RUN false": error while running runtime: exit status 1
INFO[2023-04-20T13:19:32Z] Ran for 44s
ERRO[2023-04-20T13:19:32Z] Some steps failed:
ERRO[2023-04-20T13:19:32Z]
  * could not run steps: step failure failed: error occurred handling build failure: the build failure failed after 32s with reason DockerBuildFailed: Dockerfile build strategy has failed.
…

Other than a few accidental log messages from the build environment, there is
no temporal information in the build logs. This is especially problematic for
permanent failures which will never succeed without the deletion of the build,
either directly or via the deletion of the test namespace.

Identifying this type of scenario requires searching the artifacts for the
generated Build objects, which is laborious at best but can also be
impossible in the infamous case of the reuse of a build scheduled on a node
which has since been removed from the cluster.

To make it easier to identify these cases, more information is now emitted:

  • a message in the main log informs whether builds are created or reused
  • a message in the debug log details failure conditions
INFO[2023-04-20T13:23:57Z] Building failure
INFO[2023-04-20T13:23:57Z] Found existing build "failure"
INFO[2023-04-20T13:23:57Z] Build failure failed, printing logs:
$ jq --raw-output '[.time,.msg]|join(" ")' ci-operator.log
2023-04-20T13:23:50Z Building failure
…
2023-04-20T13:23:57Z Building failure
2023-04-20T13:23:57Z Found existing build "failure"
2023-04-20T13:23:57Z Waiting for build to be complete.
2023-04-20T13:23:57Z Build failure failed, printing logs:
2023-04-20T13:23:58Z Build "failure" (created at 2023-04-20 13:20:45 +0000 UTC) classified as legitimate failure, will not be retried
…

For nonexistent pods:

$ ci-operator …
…
INFO[2023-04-20T13:26:54Z] Building failure
INFO[2023-04-20T13:26:55Z] Found existing build "failure"
INFO[2023-04-20T13:26:55Z] Build failure failed, printing logs:
WARN[2023-04-20T13:26:55Z] Unable to retrieve logs from failed build     error=pod "failure-build" not found
…
$ jq … ci-operator.log
2022-08-15T10:21:58Z unset version 0
…
2023-04-20T13:26:54Z Building failure
2023-04-20T13:26:55Z Found existing build "failure"
2023-04-20T13:26:55Z Waiting for build to be complete.
2023-04-20T13:26:55Z Build failure failed, printing logs:
2023-04-20T13:26:55Z Unable to retrieve logs from failed build
2023-04-20T13:26:55Z Build "failure" (created at 2023-04-20 13:20:45 +0000 UTC) classified as legitimate failure, will not be retried
…

https://issues.redhat.com/browse/DPTP-2836

@openshift-ci openshift-ci Bot requested review from droslean and smg247 April 20, 2023 14:00
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2023
This parameter is only ever set in tests.
It can be difficult to identify when builds are reused based on the main
output:

```
…
INFO[2023-04-20T13:18:59Z] Building failure
INFO[2023-04-20T13:19:31Z] Build failure failed, printing logs:
…
STEP 3/5: RUN false
error: build error: error building at STEP "RUN false": error while running runtime: exit status 1
INFO[2023-04-20T13:19:32Z] Ran for 44s
ERRO[2023-04-20T13:19:32Z] Some steps failed:
ERRO[2023-04-20T13:19:32Z]
  * could not run steps: step failure failed: error occurred handling build failure: the build failure failed after 32s with reason DockerBuildFailed: Dockerfile build strategy has failed.
…
```

Other than a few accidental log messages from the build environment, there is
no temporal information in the build logs.  This is especially problematic for
permanent failures which will never succeed without the deletion of the build,
either directly or via the deletion of the test namespace.

Identifying this type of scenario requires searching the artifacts for the
generated `Build` objects, which is laborious at best but can also be
impossible in the infamous case of the reuse of a build scheduled on a node
which has since been removed from the cluster.

To make it easier to identify these cases, more information is now emitted:

- a message in the main log informs whether builds are created or reused
- a message in the debug log details failure conditions

```
INFO[2023-04-20T13:23:57Z] Building failure
INFO[2023-04-20T13:23:57Z] Found existing build "failure"
INFO[2023-04-20T13:23:57Z] Build failure failed, printing logs:
```

```
$ jq --raw-output '[.time,.msg]|join(" ")' ci-operator.log
2023-04-20T13:23:50Z Building failure
…
2023-04-20T13:23:57Z Building failure
2023-04-20T13:23:57Z Found existing build "failure"
2023-04-20T13:23:57Z Waiting for build to be complete.
2023-04-20T13:23:57Z Build failure failed, printing logs:
2023-04-20T13:23:58Z Build "failure" (created at 2023-04-20 13:20:45 +0000 UTC) classified as legitimate failure, will not be retried
…
```

For nonexistent pods:

```
$ ci-operator …
…
INFO[2023-04-20T13:26:54Z] Building failure
INFO[2023-04-20T13:26:55Z] Found existing build "failure"
INFO[2023-04-20T13:26:55Z] Build failure failed, printing logs:
WARN[2023-04-20T13:26:55Z] Unable to retrieve logs from failed build     error=pod "failure-build" not found
…
```

```
$ jq … ci-operator.log
2022-08-15T10:21:58Z unset version 0
…
2023-04-20T13:26:54Z Building failure
2023-04-20T13:26:55Z Found existing build "failure"
2023-04-20T13:26:55Z Waiting for build to be complete.
2023-04-20T13:26:55Z Build failure failed, printing logs:
2023-04-20T13:26:55Z Unable to retrieve logs from failed build
2023-04-20T13:26:55Z Build "failure" (created at 2023-04-20 13:20:45 +0000 UTC) classified as legitimate failure, will not be retried
…
```
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2023
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbguimaraes, droslean

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bbguimaraes,droslean]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 11d22d5 and 2 for PR HEAD 411ddc2 in total

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2023

@bbguimaraes: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci Bot merged commit 52d273e into openshift:master Apr 24, 2023
@bbguimaraes bbguimaraes deleted the build_log branch April 24, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants