Skip to content

OTA-1159: [3/3] Propagate reconciliation errors via ReconciliationIssues condition#1033

Closed
petr-muller wants to merge 2 commits intoopenshift:masterfrom
petr-muller:ota-1159-result-of-work-testing
Closed

OTA-1159: [3/3] Propagate reconciliation errors via ReconciliationIssues condition#1033
petr-muller wants to merge 2 commits intoopenshift:masterfrom
petr-muller:ota-1159-result-of-work-testing

Conversation

@petr-muller
Copy link
Copy Markdown
Member

@petr-muller petr-muller commented Feb 2, 2024

SyncWorkerStatus: Add Failures field with all failures

Previously, the errors that happened during reconciliation were flattened by summarizeTaskGraphErrors method call in consistentReporter.Errors() before they were sent to status reconciliation method syncStatus, which propagates this flattened error to the ClusterVersion status field.

Rename the Failure field to FailureSummary, it still caries the flattened error. Add a new Failures field to carry the full slice of the original errors.

The error processing is overall a bit weird because the errors get flattened first at the resource reconciling loop side, and then the status reporting loop side uses the flattened error to somehow reconstruct what actually happened, but cleaning up that code would be a much larger effort, so this commit simply makes original errors to be passed to status reporting loop. It is not easy to do all error processing at a single side; it would make sense to do that on status reporting level, but the initial flattening uses some task graph / resource reconciliation data that are not available on the status reporting side.


status sync: propagate reconciliation errors via RRI condition

Fulfilling OTA-1159, this change adds a JSON-serialized form of all reconciliation errors encountered during a single reconciliation loop run to the Message field of the ReconciliationIssues-type condition, present only when ReconciliationIssuesCondition feature gate is enabled (=in tech preview).

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 2, 2024
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Feb 2, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

  • cvo: read enabled feature flags from cluster
  • cvo: set ResourceReconciliationIssues condition
  • SyncWorkerStatus: Add Failures field with all failures
  • status sync: propagate reconciliation errors via RRI condition

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 2, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2024
@petr-muller petr-muller force-pushed the ota-1159-result-of-work-testing branch from 242ab7a to cd475fc Compare February 2, 2024 10:05
@petr-muller
Copy link
Copy Markdown
Member Author

/test all

@petr-muller petr-muller changed the title WIP: OTA-1159: Propagate reconciliation errors via RRI condition WIP: OTA-1159: [3/3] Propagate reconciliation errors via RRI condition Feb 2, 2024
@petr-muller petr-muller force-pushed the ota-1159-result-of-work-testing branch from cd475fc to f968631 Compare March 25, 2024 16:52
@petr-muller
Copy link
Copy Markdown
Member Author

/test all

@petr-muller petr-muller force-pushed the ota-1159-result-of-work-testing branch 3 times, most recently from 862f473 to f9f59ce Compare March 28, 2024 16:29
@petr-muller petr-muller changed the title WIP: OTA-1159: [3/3] Propagate reconciliation errors via RRI condition OTA-1159: [3/3] Propagate reconciliation errors via RRI condition Mar 28, 2024
@petr-muller petr-muller marked this pull request as ready for review March 28, 2024 16:31
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2024
@petr-muller
Copy link
Copy Markdown
Member Author

 --- FAIL: TestRunGraph (1.04s)
    --- FAIL: TestRunGraph/mid-task_cancellation_with_work_in_queue_does_not_deadlock

Known flake
/test unit

@petr-muller petr-muller changed the title OTA-1159: [3/3] Propagate reconciliation errors via RRI condition OTA-1159: [3/3] Propagate reconciliation errors via ReconciliationIssues condition Mar 28, 2024
Previously, the errors that happened during reconciliation were flattened by `summarizeTaskGraphErrors` method call in `consistentReporter.Errors()` before they were sent to status reconciliation method `syncStatus`, which propagates this flattened error to the ClusterVersion status field.

Rename the `Failure` field to `FailureSummary`, it still caries the flattened error. Add a new `Failures` field to carry the full slice of the original errors.

The error processing is overall a bit weird because the errors get flattened first at the resource reconciling loop side, and then the status reporting loop side uses the flattened error to somehow reconstruct what actually happened, but cleaning up that code would be a much larger effort, so this commit simply makes original errors to be passed to status reporting loop. It is not easy to do all error processing at a single side; it would make sense to do that on status reporting level, but the initial flattening uses some task graph / resource reconciliation data that are not available on the status reporting side.
Fulfilling OTA-1159, this change adds a JSON-serialized form of all reconciliation errors encountered during a single reconciliation loop run to the `Message` field of the `ReconciliationIssues`-type condition, present only when `ReconciliationIssuesCondition` feature gate is enabled (=in tech preview).
@petr-muller petr-muller force-pushed the ota-1159-result-of-work-testing branch from f9f59ce to 7e000f0 Compare March 28, 2024 16:55
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 28, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

SyncWorkerStatus: Add Failures field with all failures

Previously, the errors that happened during reconciliation were flattened by summarizeTaskGraphErrors method call in consistentReporter.Errors() before they were sent to status reconciliation method syncStatus, which propagates this flattened error to the ClusterVersion status field.

Rename the Failure field to FailureSummary, it still caries the flattened error. Add a new Failures field to carry the full slice of the original errors.

The error processing is overall a bit weird because the errors get flattened first at the resource reconciling loop side, and then the status reporting loop side uses the flattened error to somehow reconstruct what actually happened, but cleaning up that code would be a much larger effort, so this commit simply makes original errors to be passed to status reporting loop. It is not easy to do all error processing at a single side; it would make sense to do that on status reporting level, but the initial flattening uses some task graph / resource reconciliation data that are not available on the status reporting side.


status sync: propagate reconciliation errors via RRI condition

Fulfilling OTA-1159, this change adds a JSON-serialized form of all reconciliation errors encountered during a single reconciliation loop run to the Message field of the ReconciliationIssues-type condition, present only when ReconciliationIssuesCondition feature gate is enabled (=in tech preview).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 28, 2024

@petr-muller: This pull request references OTA-1159 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

SyncWorkerStatus: Add Failures field with all failures

Previously, the errors that happened during reconciliation were flattened by summarizeTaskGraphErrors method call in consistentReporter.Errors() before they were sent to status reconciliation method syncStatus, which propagates this flattened error to the ClusterVersion status field.

Rename the Failure field to FailureSummary, it still caries the flattened error. Add a new Failures field to carry the full slice of the original errors.

The error processing is overall a bit weird because the errors get flattened first at the resource reconciling loop side, and then the status reporting loop side uses the flattened error to somehow reconstruct what actually happened, but cleaning up that code would be a much larger effort, so this commit simply makes original errors to be passed to status reporting loop. It is not easy to do all error processing at a single side; it would make sense to do that on status reporting level, but the initial flattening uses some task graph / resource reconciliation data that are not available on the status reporting side.


status sync: propagate reconciliation errors via RRI condition

Fulfilling OTA-1159, this change adds a JSON-serialized form of all reconciliation errors encountered during a single reconciliation loop run to the Message field of the ReconciliationIssues-type condition, present only when ReconciliationIssuesCondition feature gate is enabled (=in tech preview).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 28, 2024

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn 7e000f0 link true /test e2e-agnostic-ovn
ci/prow/e2e-agnostic-ovn-upgrade-out-of-change 7e000f0 link true /test e2e-agnostic-ovn-upgrade-out-of-change
ci/prow/e2e-hypershift 7e000f0 link true /test e2e-hypershift
ci/prow/e2e-agnostic-ovn-upgrade-into-change 7e000f0 link true /test e2e-agnostic-ovn-upgrade-into-change
ci/prow/e2e-hypershift-conformance 7e000f0 link true /test e2e-hypershift-conformance

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Comment thread pkg/cvo/sync_worker.go

Failure error
FailureSummary error
Failures []error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid having to pass these in parallel. I've floated #1041 with a sketch showing Unwrap() and Errors() to walk the nested errors inside Failure. It probably needs more filling out with the structured information we store for each node in the tree, but it will avoid us having to worry about things like "what if the actor populating SyncWorkerStatus puts something in FailureSummary without updating Failures to hold the appropriate children?".

@petr-muller
Copy link
Copy Markdown
Member Author

/close

We did #1041 instead

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2024
@openshift-ci openshift-ci Bot closed this Apr 9, 2024
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 9, 2024

@petr-muller: Closed this PR.

Details

In response to this:

/close

We did #1041 instead

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants