e2e Kubevirt disk timeout issues

@openshift-cherrypick-robot: The following test **failed**, say `/retest` to rerun all failed tests or `/retest-required` to rerun all mandatory failed tests:

Test name | Commit | Details | Required | Rerun command
--- | --- | --- | --- | ---
ci/prow/4.22-e2e-test-kubevirt-aws | b739e831c4f9509342105b591ac27b426a6db46a | [link](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oadp-operator/2182/pull-ci-openshift-oadp-operator-oadp-1.6-4.22-e2e-test-kubevirt-aws/2049876625738174464) | true | `/test 4.22-e2e-test-kubevirt-aws`

[Full PR test history](https://prow.ci.openshift.org/pr-history?org=openshift&repo=oadp-operator&pr=2182). [Your PR dashboard](https://prow.ci.openshift.org/pr?query=is:pr+state:open+author:openshift-cherrypick-robot).

<details>

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md).  If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
</details>


_Originally posted by @openshift-ci[bot] in https://github.com/openshift/oadp-operator/issues/2182#issuecomment-4355026387_
            


[Claude analysis](https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_oadp-operator/2182/pull-ci-openshift-oadp-operator-oadp-1.6-4.22-e2e-test-kubevirt-aws/2049876625738174464#1:build-log.txt%3A4508)

> # OADP E2E Test Failure Analysis
> *Generated by Claude via Vertex AI on 2026-04-30 17:40:00 UTC*
> ## Executive Summary
> - **Total Tests**: 50
> - **Failed Tests**: 1
> - **Known Flakes**: 0 (failure does not match known flake patterns)
> - **Critical Issues**: 0 (real bugs requiring immediate attention)
> - **Environmental Issues**: 1 (DataVolume provisioning delay)
> ## Failed Tests Analysis
> ### 1. todolist CSI backup and restore, in a Fedora VM [ENVIRONMENTAL]
> **Root Cause**: DataVolume provisioning timeout - VM failed to start within the 10-minute test timeout due to slow PVC cloning
> **Evidence**:
> ```
> junit_report.xml: "context deadline exceeded" (test timeout at 17:38:18, duration: 648.028s ~10.8 minutes)
> CDI logs (openshift-cnv/cdi-deployment):
> - 17:28:18: DataVolume "fedora-todolist-disk" created, cloning from openshift-virtualization-os-images/fedora-1217dcc8c58d scheduled
> - 17:28:18: PVC "fedora-todolist-disk" in "Pending" state (not bound)
> virt-handler logs (openshift-cnv/virt-handler):  
> - 17:38:21: VMI fedora-todolist first appeared - "VMI is in phase: Scheduled | Domain does not exist"
> - 17:38:22: VMI reached Running state - "VMI is in phase: Running | Domain status: Running, reason: Unknown"
> Test code (virt_backup_restore_suite_test.go:101-105):
> - wait.PollUntilContextTimeout with 10-minute timeout waiting for VM to reach "Running" status
> - Failure at line 105: gomega.Expect(err).ToNot(gomega.HaveOccurred())
> ```
> **Diagnosis**: 
> The test created the Fedora VM at ~17:28:18, which required cloning a 30Gi DataVolume from the source `fedora-1217dcc8c58d` in the `openshift-virtualization-os-images` namespace. The test code waits 10 minutes for the VM to reach "Running" status (virt_backup_restore_suite_test.go:101-105).
> The PVC `fedora-todolist-disk` remained in "Pending" state and wasn't bound quickly enough. The VM launcher pod could not start until the DataVolume was fully provisioned. The VM actually reached "Running" state at 17:38:22, which was **4 seconds after** the test timed out at 17:38:18.
> This is a timing issue where:
> 1. DataVolume cloning took >10 minutes (likely due to 30Gi size and cluster storage performance)
> 2. VM startup happened mere seconds after timeout
> 3. Test timeout (10 minutes) was insufficient for the DataVolume provisioning operation
> **Likely Cause**: Environmental - Slow cluster storage performance causing DataVolume clone operation to exceed the 10-minute test timeout. The VM successfully started immediately after the DataVolume completed, indicating no functional issue.
> **Recommended Actions**:
> 1. **Increase timeout** - The test already has a 45-minute `BackupTimeout`, but the VM startup poll is hardcoded to 10 minutes. Consider increasing the VM readiness timeout to 15-20 minutes for tests involving large DataVolume cloning operations (virt_backup_restore_suite_test.go:101).
> 2. **Add DataVolume readiness check** - Before starting the VM startup wait, explicitly wait for the DataVolume to reach "Succeeded" phase. This would provide better error messages when DataVolume provisioning is slow.
> 3. **Investigate cluster storage performance** - The 30Gi clone taking >10 minutes suggests potential storage backend slowness. Review AWS EBS performance metrics for the test cluster.
> 4. **Consider smaller test images** - If feasible, use a smaller Fedora VM image for E2E tests to reduce provisioning time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e Kubevirt disk timeout issues #2183

OADP E2E Test Failure Analysis

Executive Summary

Failed Tests Analysis

1. todolist CSI backup and restore, in a Fedora VM [ENVIRONMENTAL]

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

e2e Kubevirt disk timeout issues #2183

Description

OADP E2E Test Failure Analysis

Executive Summary

Failed Tests Analysis

1. todolist CSI backup and restore, in a Fedora VM [ENVIRONMENTAL]

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions