Skip to content

Automate network#78184

Open
SachinNinganure wants to merge 53 commits into
openshift:mainfrom
SachinNinganure:automate-network
Open

Automate network#78184
SachinNinganure wants to merge 53 commits into
openshift:mainfrom
SachinNinganure:automate-network

Conversation

@SachinNinganure
Copy link
Copy Markdown
Contributor

@SachinNinganure SachinNinganure commented Apr 22, 2026

Summary by CodeRabbit

  • New Features

    • Added many new perf/scale CI pipelines and scheduled nightly tests across multiple OCP variants and cloud platforms, plus new network-expansion, max-node scaling, and egress-IP scaling test workflows.
  • Bug Fixes

    • Fixed IBM Cloud script environment export and corrected must-gather Dockerfile references.
  • Chores

    • Expanded presubmit/periodic job matrix and added governance/metadata for new CI steps and workflows.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds numerous CI configs, scheduled tests, step-registry refs/workflows, scripts, OWNERS/metadata, and must-gather Dockerfile path fixes across perfscale and stolostron CI for OCP 4.17–4.22; also extensively updates periodic and presubmit job manifests.

Changes

Cohort / File(s) Summary
Perfscale CI configs
ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/...__4.20-nightly-datapath.yaml, ...__4.21-nightly-datapath.yaml, ...__4.22-nightly-datapath.yaml, ...__aws-4.17-nightly-x86.yaml, ...__aws-4.18-nightly-x86.yaml, ...__aws-4.22-nightly-x86.yaml
Added new ci-operator pipeline YAMLs registering build_root, release candidates, global resource defaults and multiple scheduled tests (IBM/Nutanix/GCP/AWS variants) with platform-specific env, cluster_profile, workflows, and timeouts.
Periodic jobs (large update)
ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-periodics.yaml
Extensive rescheduling and retargeting of periodic jobs: renames/label changes, cron/cluster updates, timeouts added/removed, expanded container specs, extra_refs and Slack reporter_config added across many entries.
Presubmits (adds)
ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml
Added multiple presubmit job entries for AWS variants (notably aws-4.17/4.18) with contexts, triggers, ci-operator container specs, decoration timeouts and required secret mounts.
Must-gather Dockerfile path updates
ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.{11,12,13,14,15,16,17}.yaml
Replaced images.items[].dockerfile_path value from build/Dockerfile to Dockerfile across seven release manifests.
AWS OVN custom-network workflow & metadata
ci-operator/step-registry/ipi/aws/ovn-custom-network/OWNERS, ...-workflow.metadata.json, ...-workflow.yaml
New workflow ipi-aws-ovn-custom-network composing pre (ipi-conf-aws, ipi-install-config-network-custom, ovn-conf, ipi-install) and post steps plus OWNERS and metadata.
Install-config network-custom step
ci-operator/step-registry/ipi/install-config/network-custom/OWNERS, ...-ref.metadata.json, ...-commands.sh, ...-ref.yaml
New step ipi-install-config-network-custom with script that patches ${SHARED_DIR}/install-config.yaml via yq using env-driven CIDRs/host-prefix/network-type; includes resource requests, env defaults, docs, OWNERS and metadata.
OCP-61589 network expansion test
ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/OWNERS, ...-commands.sh, ...-ref.metadata.json, ...-ref.yaml
New test step that expands clusterNetwork CIDR, verifies operator health, scales three non-infra machinesets to large replica counts, polls node readiness, gathers artifacts and validates readiness/timeouts.
OCPBUGS-77510 TCP RST test
ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/OWNERS, ...-commands.sh, ...-ref.metadata.json, ...-ref.yaml
New E2E test script/ref for TCP RST detection that supports scale modes (small/medium/large/progressive), deploys workloads, runs tcpdump on a node, triggers OVN/API-server restarts or baseline traffic, analyzes RST counts and emits reports/artifacts.
OCPBUGS-45891 EgressIP scale test
ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh, ...-ref.yaml
New script/ref that creates many EgressIP resources, labels nodes, waits for assignments, validates assigned counts and load-balancing (max-min ≤2), records CloudPrivateIPConfig and egressip artifacts, and fails on violated validations.
AWS network-expansion installer workflow
ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/OWNERS, ...-workflow.metadata.json, ...-workflow.yaml
New workflow openshift-qe-installer-aws-network-expansion that sequences pre steps (including ipi-install-config-network-custom) and post, plus OWNERS and metadata.
Network perf ref env addition
ci-operator/step-registry/openshift-qe/network-perf/openshift-qe-network-perf-ref.yaml
Added PROFILE_TYPE env var (default "both") to network-perf ref.
Misc OWNERS/metadata
multiple ci-operator/step-registry/.../OWNERS, *.metadata.json files
Added approver/reviewer entries (perfscale-ocp-approvers/perfscale-ocp-reviewers) for several new step-registry directories.
IBM Cloud perfscale script fix
ci-operator/step-registry/openshift-qe/perfscale-ibm-data-path-sg/openshift-qe-perfscale-ibm-data-path-sg-commands.sh
Fixed typo expore REGIONexport REGION and ensured final newline at EOF.
Small tests added to AWS 4.22 config
ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/...__aws-4.22-nightly-x86.yaml
Added network-expansion-520nodes and egress-ip-scale-2xlnodes tests with specific env, timeouts and workflows.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~240 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Automate network' is vague and does not clearly summarize the specific changes in this changeset, which adds multiple CI configurations, workflows, and test scripts across several OpenShift components. Consider using a more specific title that reflects the main changes, such as 'Add network expansion and egress IP scaling tests for OCP 4.22' or 'Add perfscale CI configurations and network test automation'.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 84.21% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR contains CI/CD configuration and test orchestration scripts, not Ginkgo test definitions. The check for stable Ginkgo test names is not applicable.
Test Structure And Quality ✅ Passed This PR contains no Ginkgo test code; it consists only of OpenShift CI infrastructure configuration files, making the Ginkgo-focused check not applicable.
Microshift Test Compatibility ✅ Passed PR adds only CI configuration files (YAML), bash scripts, and workflow definitions. No Go test files or Ginkgo e2e test syntax are present.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request does not add any Ginkgo e2e tests. Changes consist exclusively of CI configuration YAML files, shell scripts, JSON metadata, and OWNERS files.
Topology-Aware Scheduling Compatibility ✅ Passed This PR contains only OpenShift CI infrastructure configuration files (ci-operator configs, step registry entries, job definitions) for test automation. No Kubernetes pod/deployment manifests with topology-sensitive scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed PR contains only CI configuration YAML files, shell scripts, JSON metadata, and OWNERS files with no Go source code or OTE binaries.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only CI infrastructure files (YAML, bash scripts, workflows) without new Go Ginkgo e2e test patterns, making this check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from chaitanyaenr and chentex April 22, 2026 13:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (1)
ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml (1)

3-6: Consider parameterizing the CLI image tag.

from_image is pinned to ocp/4.21:cli. If this step is invoked against a newer OCP nightly (e.g., 4.22+ as seen elsewhere in this PR), a hard-pinned 4.21 oc client may be older than the server by more than the supported skew and could produce warnings or fail on newer APIs. Consider using a floating tag or the release payload's cli image via release:latest / stable to keep the client matched to the cluster under test.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml`
around lines 3 - 6, The from_image block currently pins the oc client to a fixed
image (from_image with namespace: ocp, name: "4.21", tag: cli) which can lead to
client/server skew; change this to accept a variable or floating tag instead of
hardcoding "4.21" — e.g., expose a parameter or template for the name/tag (or
use release:latest / stable / the release payload's cli image) so the oc client
image is derived from the test run or payload and stays compatible with the
cluster under test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@ci-operator/config/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main.yaml`:
- Line 29: The job-level timeout value `timeout: 8h0m0s` in
openshift-qe-network-performance-tests-main.yaml conflicts with the OCP-61589
step which uses an 8h step timeout plus a 30m grace period; increase the job
timeout (the `timeout` field) to exceed the step total (for example to `8h30m0s`
or larger) so the job cannot be preempted before the step’s grace/cleanup
completes, or alternatively reduce the step timeout to fit under the current job
timeout; update the `timeout` key accordingly.
- Around line 20-28: The job uses the openshift-qe-installer-aws workflow which
does not consume CLUSTER_NETWORK_CIDR so the cluster is installed with the
default /14 and the expansion test is skipped; fix by adding a pre-install
reference step that applies the /20 network (use the existing
ipi-install-config-network-custom-ref/ipi-install-config-network-custom ref) or
swap to a customized installer workflow that chains into
ipi-install-config-network-custom; specifically, insert a step before the
installer (in the steps: sequence for this job) that references
ipi-install-config-network-custom-ref (or the ipi-install-config-network-custom
ref) and set CLUSTER_NETWORK_CIDR to the desired /20 value so the installer
receives the correct network CIDR before openshift-qe-installer-aws runs.

In
`@ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml`:
- Around line 997-1020: The job definitions mount a volume named
"gcs-credentials" (seen in volumeMounts) but the corresponding volumes entry is
missing; for each affected job (e.g., where "gcs-credentials" appears in
volumeMounts) add a matching volumes[] item named "gcs-credentials" under the
job spec with the appropriate secret reference (secretName: gcs-credentials) so
every volumeMount name has a corresponding volumes.name; update the volumes list
near other entries like "manifest-tool-local-pusher" and "ci-pull-credentials"
to include this new secret-mounted volume.

In
`@ci-operator/jobs/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main-periodics.yaml`:
- Around line 61-63: The pod spec is mounting a volume named "gcs-credentials"
at /secrets/gcs (mountPath) but no corresponding volume is declared, causing the
job to fail; add a volumes entry to the Job's podSpec that declares a volume
named "gcs-credentials" (e.g., a secretVolume referencing the GCS credentials
secret) so the container mount (name: gcs-credentials) can be satisfied; update
the Job spec where the container mounts /secrets/gcs (and repeat the same fix
for the other job(s) around the 74-92 area) to reference the same
"gcs-credentials" volume name.

In
`@ci-operator/step-registry/ipi/aws/ovn-custom-network/ipi-aws-ovn-custom-network-workflow.yaml`:
- Around line 23-24: The workflow documentation incorrectly claims it "includes
monitoring PVC setup and RBAC configuration"; update the YAML so the description
matches implemented phases by either adding the missing chains (e.g., include
the monitoring/RBAC chain like create-infra-move-ingress-monitoring-registry
into the pre/post lists alongside ipi-conf-aws,
ipi-install-config-network-custom, ovn-conf, ipi-install, gather-core-dump,
ipi-aws-post) or remove/replace the monitoring/RBAC claim in the human-readable
description to accurately reflect only the listed chains; locate references to
the workflow description in ipi-aws-ovn-custom-network-workflow.yaml and adjust
the text or the pre/post chain arrays (names: ipi-conf-aws,
ipi-install-config-network-custom, ovn-conf, ipi-install, gather-core-dump,
ipi-aws-post, and the suggested create-infra-move-ingress-monitoring-registry)
so doc and implementation are consistent.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`:
- Around line 146-156: The script is scaling machinesets using absolute totals
and then validating total cluster nodes against MAX_NODES_TARGET (which excludes
control-plane nodes), causing off-by-control-plane failures; change the scaling
logic so each machineset replica count is computed from the worker-only target
(distribute WORKER_NODES_TARGET across the machinesets array and set replicas
per machineset accordingly where the oc scale commands are used), and change the
validation (the oc get nodes check that uses MAX_NODES_TARGET) to count only
worker nodes (filter by node-role.kubernetes.io/worker) and compare that worker
count against WORKER_NODES_TARGET instead of comparing total cluster nodes to
MAX_NODES_TARGET.
- Around line 196-199: The pipeline currently fails if there are zero NotReady
nodes because the command oc get nodes --no-headers | grep "NotReady" >
"$RESULTS_DIR/final-notready-nodes.txt" will exit non‑zero under
errexit/pipefail; change that line so it does not cause script exit when there
are no matches (for example, append || true to the pipeline or use grep -q/grep
-c and redirect output accordingly) while still writing an empty or zero-count
result file to RESULTS_DIR/final-notready-nodes.txt so later warning logic
(lines ~251-255) can proceed.
- Around line 35-37: The timeout parser using sed+bc is incorrect for compound
Kubernetes durations and should be replaced with a proper parser that sums each
time component (h, m, s) rather than multiplying; update the logic that assigns
timeout_seconds (the line creating timeout_seconds from $timeout) to iterate
over regex matches for number+unit, convert each unit to seconds (h→3600, m→60,
s→1) and accumulate into timeout_seconds so values like "8h0m0s" and "1h30m"
produce the correct totals; ensure the new parser handles missing components and
empty units and preserves behavior of start_time and subsequent timeout usage.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml`:
- Around line 30-32: The TEST_TIMEOUT default is inconsistent between ref.yaml
(TEST_TIMEOUT="2h") and the commands script (uses
TEST_TIMEOUT="${TEST_TIMEOUT:-6h}"); pick one canonical default and make both
places match — either change ref.yaml's TEST_TIMEOUT default to "6h" to match
the script or update the commands.sh fallback to "2h" so both sources of truth
agree; ensure the exported env name TEST_TIMEOUT and the documentation string
are updated accordingly.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/openshift-qe-ocpbugs-77510-e2e-test-commands.sh`:
- Around line 605-607: The progressive branch currently calls
run_progressive_test and returns without producing the test artifact; update the
progressive path so it also generates the same report as the non-progressive
path by invoking create_test_report (or ensuring run_progressive_test creates
/tmp/ocpbugs-77510-test.log) after run_progressive_test completes, so the
scheduled progressive job produces the expected /tmp/ocpbugs-77510-test.log
artifact.
- Around line 165-170: The test setup creates deployments/services named
test-app-* but the fallback traffic path expects a traffic-client pod and
services named test-svc-*, so add creation of a traffic-client pod/deployment
(e.g., create a pod or deployment named traffic-client using an image with curl)
and create services named test-svc-$i that point to the test-app-$i pods (or
create Service objects with the test-svc-$i names mapping to the test-app-$i
selectors) so the fallback exec into traffic-client and curl test-svc-* actually
generates traffic; apply the same fix for the duplicate block around the 275-282
area.
- Around line 532-574: The progressive-summary logic incorrectly treats any
completed run as a detection because it relies only on run_single_scale_test's
exit code (which always returns 0) and analyze_results' 0-for-no-storm behavior;
change the contract so run_single_scale_test (or analyze_results called inside
it) returns a non-zero/distinct exit code or sets/echoes a detection flag when
an RST storm is actually detected, then update the caller loop that builds
overall_results (the block using run_single_scale_test, overall_results+=, and
later detection_count calculation) to inspect that explicit detection indicator
(e.g., a returned status code or a printed token) instead of plain exit 0 before
marking PASS and incrementing detection_count; ensure the final summary and
detection_count logic use that indicator so only true detections count as PASS.
- Around line 522-528: The loop currently overwrites the global NAMESPACE
(scale_namespace / NAMESPACE) so earlier generated namespaces are left behind
because the EXIT trap only removes the last one; to fix, save the previous
namespace into a temporary variable (e.g., namespace_for_cleanup) before setting
NAMESPACE to the new scale_namespace, and at the end of the iteration (before
the loop advances) explicitly delete that previous namespace (e.g., oc delete
namespace "$namespace_for_cleanup" --ignore-not-found) or call the existing
cleanup function for that namespace so no intermediate namespaces/pods are
leaked; ensure you reference and update scale_namespace, NAMESPACE, and
namespace_for_cleanup accordingly.
- Around line 137-142: The pipeline that sets WORKER_NODE can fail under set
-euo pipefail when grep finds no matches, making the fallback unreachable;
change the assignment so the first pipeline cannot cause a non-zero exit (e.g.,
append "|| true" to the grep pipeline or replace the pipeline with a single awk
that matches "(worker|compute)" and exits after the first hit) so that when no
worker|compute node is found the command returns empty and the subsequent
fallback block (the second WORKER_NODE assignment) runs; update the code where
WORKER_NODE is computed (the oc get nodes | grep -E "(worker|compute)" | head -1
| awk '{print $1}' pipeline) accordingly.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/openshift-qe-ocpbugs-77510-e2e-test-ref.yaml`:
- Around line 24-36: The documentation block in the
openshift-qe-ocpbugs-77510-e2e-test YAML is ambiguous about pass criteria;
update the documentation string so it explicitly states two distinct expected
outcomes: for unfixed clusters a successful run is detecting a sufficient number
of TCP RST packets (indicating the bug), and for fixed/healthy clusters a
successful run is observing no or only low/acceptable RST counts. Edit the
"documentation" field text to replace the current sentences that conflate
success with RST capture and validating fixed clusters so the pass condition for
fixed clusters is clearly described as "no/low RSTs" being the expected success
result.

---

Nitpick comments:
In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml`:
- Around line 3-6: The from_image block currently pins the oc client to a fixed
image (from_image with namespace: ocp, name: "4.21", tag: cli) which can lead to
client/server skew; change this to accept a variable or floating tag instead of
hardcoding "4.21" — e.g., expose a parameter or template for the name/tag (or
use release:latest / stable / the release payload's cli image) so the oc client
image is derived from the test run or payload and stays compatible with the
cluster under test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f5a86571-a378-486d-a796-565181a8b692

📥 Commits

Reviewing files that changed from the base of the PR and between 4323d3f and cbaa964.

📒 Files selected for processing (34)
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.20-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.21-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.17-nightly-x86.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.18-nightly-x86.yaml
  • ci-operator/config/openshift-qe/network-performance-tests/OWNERS
  • ci-operator/config/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.11.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.12.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.13.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.14.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.15.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.16.yaml
  • ci-operator/config/stolostron/must-gather/stolostron-must-gather-release-2.17.yaml
  • ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-periodics.yaml
  • ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml
  • ci-operator/jobs/openshift-qe/network-performance-tests/OWNERS
  • ci-operator/jobs/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main-periodics.yaml
  • ci-operator/step-registry/ipi/aws/ovn-custom-network/OWNERS
  • ci-operator/step-registry/ipi/aws/ovn-custom-network/ipi-aws-ovn-custom-network-workflow.metadata.json
  • ci-operator/step-registry/ipi/aws/ovn-custom-network/ipi-aws-ovn-custom-network-workflow.yaml
  • ci-operator/step-registry/ipi/install-config/network-custom/OWNERS
  • ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-commands.sh
  • ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-ref.metadata.json
  • ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-ref.yaml
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/OWNERS
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.metadata.json
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml
  • ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/OWNERS
  • ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/openshift-qe-ocpbugs-77510-e2e-test-commands.sh
  • ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/openshift-qe-ocpbugs-77510-e2e-test-ref.metadata.json
  • ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/openshift-qe-ocpbugs-77510-e2e-test-ref.yaml
  • ci-operator/step-registry/openshift-qe/perfscale-ibm-data-path-sg/openshift-qe-perfscale-ibm-data-path-sg-commands.sh

Comment on lines +20 to +28
steps:
cluster_profile: aws-perfscale-qe
env:
EXPECTED_READY_NODES: "510"
MAX_NODES_TARGET: "520"
TEST_TIMEOUT: 2h
test:
- ref: openshift-qe-ocp-61589-max-nodes-network-expansion
workflow: openshift-qe-installer-aws
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Description: Inspect installer workflow/step registry references for AWS installer network CIDR support.
# Expected: `openshift-qe-installer-aws` or one of its chains/refs should consume CLUSTER_NETWORK_CIDR
# before cluster installation. If not, this job should use a customized installer workflow or set
# the supported installer env var for the initial /20 network.

rg -n -C3 'as:\s+openshift-qe-installer-aws|workflow:\s+openshift-qe-installer-aws|CLUSTER_NETWORK_CIDR|CLUSTER_NETWORK_EXPANDED_CIDR|clusterNetwork|machineNetwork|serviceNetwork'

Repository: openshift/release

Length of output: 50374


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the openshift-qe-installer-aws workflow definition
echo "=== Searching for openshift-qe-installer-aws workflow ==="
fd -e yaml | xargs grep -l "openshift-qe-installer-aws" | head -20

# Find the test ref definition
echo ""
echo "=== Searching for ocp-61589-max-nodes-network-expansion ref ==="
fd -e yaml | xargs grep -l "ocp-61589-max-nodes-network-expansion" | head -20

Repository: openshift/release

Length of output: 2134


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look specifically in ci-operator/step-registry for installer workflow
echo "=== Looking for openshift-qe-installer-aws in step registry ==="
find ci-operator/step-registry -name "*.yaml" -type f | xargs grep -l "as:\s*openshift-qe-installer-aws\|name:\s*openshift-qe-installer-aws" 2>/dev/null | head -5

# Look for the test ref
echo ""
echo "=== Looking for ocp-61589-max-nodes-network-expansion ref ==="
find ci-operator/step-registry -name "*.yaml" -type f | xargs grep -l "ocp-61589-max-nodes-network-expansion" 2>/dev/null | head -5

Repository: openshift/release

Length of output: 682


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for step registry definitions with CLUSTER_NETWORK patterns
echo "=== Search for CLUSTER_NETWORK usage in step registry ==="
rg -t yaml "CLUSTER_NETWORK_CIDR|CLUSTER_NETWORK_EXPANDED_CIDR" ci-operator/step-registry 2>/dev/null | head -30

Repository: openshift/release

Length of output: 853


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== openshift-qe-installer-aws-workflow.yaml ==="
cat ci-operator/step-registry/openshift-qe/installer/aws/openshift-qe-installer-aws-workflow.yaml

echo ""
echo "=== openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml ==="
cat ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml

Repository: openshift/release

Length of output: 2423


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for any install-config related steps that might consume CLUSTER_NETWORK_CIDR
echo "=== Searching for install-config steps in openshift-qe step registry ==="
find ci-operator/step-registry/openshift-qe -name "*install-config*" -type f | head -10

# Search for ipi-install-config steps that handle custom CIDR
echo ""
echo "=== ipi-install-config-network-custom-ref.yaml ==="
cat ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-ref.yaml

Repository: openshift/release

Length of output: 1585


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== ipi-aws-pre chain ==="
find ci-operator/step-registry -name "*ipi-aws-pre*" -type f | head -5

Repository: openshift/release

Length of output: 615


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check what's in the ipi-aws-pre chain
echo "=== Content of ipi-aws-pre chain ==="
cat ci-operator/step-registry/ipi/aws/pre/chain-ipi-aws-pre.yaml 2>/dev/null || echo "File not found, searching..."

# Alternative location
find ci-operator/step-registry -path "*ipi*aws*pre*" -name "*.yaml" | xargs cat 2>/dev/null | head -100

Repository: openshift/release

Length of output: 4279


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for ipi-aws-pre more carefully
rg -t yaml "as:\s+ipi-aws-pre|name:\s+ipi-aws-pre" ci-operator/step-registry 2>/dev/null | head -20

Repository: openshift/release

Length of output: 1245


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== ipi-aws-pre base chain ==="
cat ci-operator/step-registry/ipi/aws/pre/ipi-aws-pre-chain.yaml

Repository: openshift/release

Length of output: 607


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== ipi-conf-aws chain ==="
cat ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-chain.yaml

Repository: openshift/release

Length of output: 415


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== ipi-conf ref ==="
cat ci-operator/step-registry/ipi/conf/ipi-conf-ref.yaml

echo ""
echo "=== ipi-conf-aws ref ==="
cat ci-operator/step-registry/ipi/conf/aws/ipi-conf-aws-ref.yaml

Repository: openshift/release

Length of output: 9353


The installer workflow cannot consume CLUSTER_NETWORK_CIDR; the /20 pre-expansion network will not be configured.

The openshift-qe-installer-aws workflow chains through ipi-conf-aws, which includes ipi-conf and ipi-conf-aws refs. Neither of these refs accept CLUSTER_NETWORK_CIDR environment variables. Although ipi-install-config-network-custom-ref exists and can set custom network CIDRs, it is not part of the installer chain used by this job.

As a result, the cluster will install with the default /14 network, causing the network expansion test to be skipped (no expansion occurs if already at /14).

Add a reference step to configure the initial /20 network before installation, or use a customized installer workflow:

Possible solution

Option 1: Add network-custom step to the installer chain (requires workflow modification)

Option 2: Pass CLUSTER_NETWORK_CIDR to a step that supports it before the installer runs

Option 3: Use a customized installer workflow that chains to ipi-install-config-network-custom

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/config/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main.yaml`
around lines 20 - 28, The job uses the openshift-qe-installer-aws workflow which
does not consume CLUSTER_NETWORK_CIDR so the cluster is installed with the
default /14 and the expansion test is skipped; fix by adding a pre-install
reference step that applies the /20 network (use the existing
ipi-install-config-network-custom-ref/ipi-install-config-network-custom ref) or
swap to a customized installer workflow that chains into
ipi-install-config-network-custom; specifically, insert a step before the
installer (in the steps: sequence for this job) that references
ipi-install-config-network-custom-ref (or the ipi-install-config-network-custom
ref) and set CLUSTER_NETWORK_CIDR to the desired /20 value so the installer
receives the correct network CIDR before openshift-qe-installer-aws runs.

test:
- ref: openshift-qe-ocp-61589-max-nodes-network-expansion
workflow: openshift-qe-installer-aws
timeout: 8h0m0s
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Give the job timeout headroom over the step timeout.

The referenced OCP-61589 step already has an 8h0m0s timeout plus 30m0s grace period, so an overall job timeout of 8h0m0s can preempt the step during install/test/grace handling. Consider increasing the job timeout or reducing the step timeout so artifacts and cleanup have room to finish.

⏱️ Proposed timeout adjustment
-  timeout: 8h0m0s
+  timeout: 10h0m0s
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
timeout: 8h0m0s
timeout: 10h0m0s
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/config/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main.yaml`
at line 29, The job-level timeout value `timeout: 8h0m0s` in
openshift-qe-network-performance-tests-main.yaml conflicts with the OCP-61589
step which uses an 8h step timeout plus a 30m grace period; increase the job
timeout (the `timeout` field) to exceed the step total (for example to `8h30m0s`
or larger) so the job cannot be preempted before the step’s grace/cleanup
completes, or alternatively reduce the step timeout to fit under the current job
timeout; update the `timeout` key accordingly.

Comment on lines +997 to +1020
- mountPath: /secrets/gcs
name: gcs-credentials
readOnly: true
- mountPath: /secrets/manifest-tool
name: manifest-tool-local-pusher
readOnly: true
- mountPath: /etc/pull-secret
name: pull-secret
readOnly: true
- mountPath: /etc/report
name: result-aggregator
readOnly: true
serviceAccountName: ci-operator
volumes:
- name: boskos
secret:
items:
- key: credentials
path: credentials
secretName: boskos-credentials
- name: ci-pull-credentials
secret:
secretName: ci-pull-credentials
- name: manifest-tool-local-pusher
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify added Prow job PodSpecs do not mount volumes that are missing from spec.volumes.
# Expected result after the fix: no output.

python3 - <<'PY'
from pathlib import Path

path = Path("ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml")
text = path.read_text().splitlines()

jobs = []
current = None

for idx, line in enumerate(text, start=1):
    if line.startswith("  - agent: kubernetes"):
        if current:
            jobs.append(current)
        current = {"start": idx, "lines": []}
    if current:
        current["lines"].append((idx, line))

if current:
    jobs.append(current)

for job in jobs:
    mounts = set()
    volumes = set()
    in_mounts = False
    in_volumes = False
    current_section_indent = None

    for lineno, line in job["lines"]:
        stripped = line.strip()

        if stripped == "volumeMounts:":
            in_mounts = True
            in_volumes = False
            current_section_indent = len(line) - len(line.lstrip())
            continue

        if stripped == "volumes:":
            in_volumes = True
            in_mounts = False
            current_section_indent = len(line) - len(line.lstrip())
            continue

        indent = len(line) - len(line.lstrip())
        if current_section_indent is not None and indent <= current_section_indent and stripped:
            in_mounts = False
            in_volumes = False

        if (in_mounts or in_volumes) and stripped.startswith("- name: "):
            name = stripped.removeprefix("- name: ").strip()
            if in_mounts:
                mounts.add(name)
            if in_volumes:
                volumes.add(name)

    missing = mounts - volumes
    if missing:
        name = next((line.strip().removeprefix("name: ") for _, line in job["lines"] if line.strip().startswith("name: pull-ci-")), f"job starting line {job['start']}")
        print(f"{path}:{job['start']}: {name} mounts missing volumes: {sorted(missing)}")
PY

Repository: openshift/release

Length of output: 43


🏁 Script executed:

# Read the specific line ranges mentioned in the review to check for gcs-credentials volume
cat -n ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml | sed -n '1000,1040p'

Repository: openshift/release

Length of output: 1585


🏁 Script executed:

# Check if gcs-credentials appears anywhere near the job definitions
rg "gcs-credentials" ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml -B 2 -A 2

Repository: openshift/release

Length of output: 38033


🏁 Script executed:

# Get a complete picture of one of the jobs mentioned (around line 1020) to see all volumes
cat -n ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml | sed -n '980,1050p'

Repository: openshift/release

Length of output: 2892


Add the missing gcs-credentials volume for every new job.

Each new job mounts gcs-credentials (lines 997-998) but the volumes section is missing the corresponding volume declaration. Kubernetes requires every volumeMounts[].name to have a matching volumes[].name entry; this omission will cause pod creation to fail.

Proposed fix pattern

Add this to each job's volumes list:

      - name: ci-pull-credentials
        secret:
          secretName: ci-pull-credentials
+     - name: gcs-credentials
+       secret:
+         secretName: gcs-credentials
      - name: manifest-tool-local-pusher

Also applies to: 1083-1106, 1168-1191, 1253-1276, 1338-1361, 1424-1447, 1509-1532, 1595-1618, 1681-1704, 1766-1789

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-presubmits.yaml`
around lines 997 - 1020, The job definitions mount a volume named
"gcs-credentials" (seen in volumeMounts) but the corresponding volumes entry is
missing; for each affected job (e.g., where "gcs-credentials" appears in
volumeMounts) add a matching volumes[] item named "gcs-credentials" under the
job spec with the appropriate secret reference (secretName: gcs-credentials) so
every volumeMount name has a corresponding volumes.name; update the volumes list
near other entries like "manifest-tool-local-pusher" and "ci-pull-credentials"
to include this new secret-mounted volume.

Comment on lines +61 to +63
- mountPath: /secrets/gcs
name: gcs-credentials
readOnly: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Declare the gcs-credentials volume used by the container.

Line 61 mounts gcs-credentials, but the pod only declares boskos, ci-pull-credentials, manifest-tool-local-pusher, pull-secret, and result-aggregator. This job will fail before ci-operator can upload results.

🐛 Proposed fix
     - name: ci-pull-credentials
       secret:
         secretName: ci-pull-credentials
+    - name: gcs-credentials
+      secret:
+        secretName: gcs-credentials
     - name: manifest-tool-local-pusher
       secret:
         secretName: manifest-tool-local-pusher

Also applies to: 74-92

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/jobs/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main-periodics.yaml`
around lines 61 - 63, The pod spec is mounting a volume named "gcs-credentials"
at /secrets/gcs (mountPath) but no corresponding volume is declared, causing the
job to fail; add a volumes entry to the Job's podSpec that declares a volume
named "gcs-credentials" (e.g., a secretVolume referencing the GCS credentials
secret) so the container mount (name: gcs-credentials) can be satisfied; update
the Job spec where the container mounts /secrets/gcs (and repeat the same fix
for the other job(s) around the 74-92 area) to reference the same
"gcs-credentials" volume name.

Comment on lines +23 to +24
It includes monitoring PVC setup and RBAC configuration for extended
cluster testing scenarios. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation overstates workflow scope.

The documentation claims the workflow "includes monitoring PVC setup and RBAC configuration for extended cluster testing scenarios," but the pre/post phases only run ipi-conf-aws, ipi-install-config-network-custom, ovn-conf, ipi-install, gather-core-dump, and ipi-aws-post — none of which set up monitoring PVCs or RBAC. Either add the relevant chain (e.g., create-infra-move-ingress-monitoring-registry) or trim the documentation to accurately reflect what this workflow does, to avoid misleading consumers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/ipi/aws/ovn-custom-network/ipi-aws-ovn-custom-network-workflow.yaml`
around lines 23 - 24, The workflow documentation incorrectly claims it "includes
monitoring PVC setup and RBAC configuration"; update the YAML so the description
matches implemented phases by either adding the missing chains (e.g., include
the monitoring/RBAC chain like create-infra-move-ingress-monitoring-registry
into the pre/post lists alongside ipi-conf-aws,
ipi-install-config-network-custom, ovn-conf, ipi-install, gather-core-dump,
ipi-aws-post) or remove/replace the monitoring/RBAC claim in the human-readable
description to accurately reflect only the listed chains; locate references to
the workflow description in ipi-aws-ovn-custom-network-workflow.yaml and adjust
the text or the pre/post chain arrays (names: ipi-conf-aws,
ipi-install-config-network-custom, ovn-conf, ipi-install, gather-core-dump,
ipi-aws-post, and the suggested create-infra-move-ingress-monitoring-registry)
so doc and implementation are consistent.

Comment on lines +165 to +170
# Create services with potential for serviceUpdateNotNeeded() bug
# EXACT copy of working verification script approach
for i in $(seq 1 $SERVICE_COUNT); do
oc create deployment test-app-$i --image=nginx:alpine --replicas=$PODS_PER_SERVICE -n "$NAMESPACE"
oc expose deployment test-app-$i --port=80 --target-port=80 -n "$NAMESPACE"
log " Created service $i/$SERVICE_COUNT with $PODS_PER_SERVICE pods"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Create the resources that the baseline traffic path actually uses.

The fallback path execs into traffic-client and curls test-svc-*, but the setup only creates test-app-* services and never creates traffic-client; fallback runs therefore silently generate no traffic because of || true.

🔧 Proposed fix
     for i in $(seq 1 $SERVICE_COUNT); do
-        oc create deployment test-app-$i --image=nginx:alpine --replicas=$PODS_PER_SERVICE -n "$NAMESPACE"
-        oc expose deployment test-app-$i --port=80 --target-port=80 -n "$NAMESPACE"
+        oc create deployment "test-app-$i" --image=nginx:alpine --replicas="$PODS_PER_SERVICE" -n "$NAMESPACE"
+        oc expose deployment "test-app-$i" --name="test-svc-$i" --port=80 --target-port=80 -n "$NAMESPACE"
         log "   Created service $i/$SERVICE_COUNT with $PODS_PER_SERVICE pods"
     done
+
+    oc run traffic-client --image=curlimages/curl:8.10.1 --restart=Never -n "$NAMESPACE" -- sleep 3600
+    oc wait --for=condition=Ready pod/traffic-client -n "$NAMESPACE" --timeout=120s

Also applies to: 275-282

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 168-168: Double quote to prevent globbing and word splitting.

(SC2086)


[info] 169-169: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-77510-e2e-test/openshift-qe-ocpbugs-77510-e2e-test-commands.sh`
around lines 165 - 170, The test setup creates deployments/services named
test-app-* but the fallback traffic path expects a traffic-client pod and
services named test-svc-*, so add creation of a traffic-client pod/deployment
(e.g., create a pod or deployment named traffic-client using an image with curl)
and create services named test-svc-$i that point to the test-app-$i pods (or
create Service objects with the test-svc-$i names mapping to the test-app-$i
selectors) so the fallback exec into traffic-client and curl test-svc-* actually
generates traffic; apply the same fix for the duplicate block around the 275-282
area.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml`:
- Around line 60-73: The gcp-data-path-9nodes variant is missing a post-install
day-2 scaling step; update the steps.test array for the gcp-data-path-9nodes
configuration (the block containing COMPUTE_NODE_REPLICAS: "9" and ref:
openshift-qe-perfscale-gcp-data-path-fw) to include the workers-scale step
(e.g., add either "- ref: workers-scale" or "- chain: workers-scale" following
the pattern used by IBM and Nutanix), so the job performs a workers-scale
operation after install to match the other platform variants.

In
`@ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.18-nightly-x86.yaml`:
- Around line 210-234: The two new presubmit jobs ocpbugs-77510-rst-validation
and ocpbugs-77510-rst-etcd-encryption lack scheduling flags and will run on
every PR; update each job block (ocpbugs-77510-rst-validation,
ocpbugs-77510-rst-etcd-encryption) to either add a cron: "<cron schedule>" for
periodic execution or set always_run: false to make them gated/on-demand (choose
one approach consistent with other jobs in this file) and keep the existing
timeout/env/test settings unchanged.
- Line 217: Remove the unused environment variable declaration "TIMEOUT: +5
hours" from the job config because the step registry only declares TEST_SCALE
and the step script reads a hardcoded TIMEOUT_MINUTES variable (not TIMEOUT);
the job already uses the correct step-level timeout "timeout: 7h0m0s" and "+5
hours" is an invalid duration format, so simply delete the TIMEOUT line to avoid
misleading/unused config.
- Line 50: The VERSION value is incorrect for the 4.18 job(s): update the
VERSION key used in the job definitions named payload-control-plane-6nodes and
udn-density-l3-24nodes from "4.21" to "4.18" so the Orion metadata
(es_metadata_index/es_benchmark_index) correctly reflects release 4.18; locate
the VERSION: "4.21" lines in those job blocks and change them to VERSION:
"4.18".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f783c1b3-0783-416d-951c-483211b597dc

📥 Commits

Reviewing files that changed from the base of the PR and between cbaa964 and 207662e.

📒 Files selected for processing (7)
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.20-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.21-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.17-nightly-x86.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.18-nightly-x86.yaml
  • ci-operator/config/openshift-qe/network-performance-tests/OWNERS
  • ci-operator/config/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main.yaml
✅ Files skipped from review due to trivial changes (3)
  • ci-operator/config/openshift-qe/network-performance-tests/OWNERS
  • ci-operator/config/openshift-qe/network-performance-tests/openshift-qe-network-performance-tests-main.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.20-nightly-datapath.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.21-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.17-nightly-x86.yaml

Comment on lines +60 to +73
- as: gcp-data-path-9nodes
cron: 0 12 5,20 * *
steps:
allow_skip_on_success: true
cluster_profile: gcp-perfscale-qe
env:
COMPUTE_NODE_REPLICAS: "9"
COMPUTE_NODE_TYPE: n1-standard-8
CONTROL_PLANE_NODE_TYPE: n1-standard-8
OPENSHIFT_INFRA_NODE_INSTANCE_TYPE: n1-standard-16
test:
- ref: openshift-qe-perfscale-gcp-data-path-fw
- chain: openshift-qe-data-path-tests
workflow: openshift-qe-installer-gcp-ipi-ovn-etcd-encryption-fips
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Compare how other *-data-path-9nodes GCP variants structure their test steps.
rg -n -C2 'gcp-data-path-9nodes' ci-operator/config 2>/dev/null

Repository: openshift/release

Length of output: 2449


🏁 Script executed:

cat ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml | head -80

Repository: openshift/release

Length of output: 2167


GCP gcp-data-path-9nodes variant lacks workers-scale step — deviates from IBM and Nutanix patterns.

The GCP test (lines 60–73) does not include a workers-scale step, unlike the IBM variant (- ref: workers-scale, line 30) and Nutanix variant (- chain: workers-scale, line 43). The GCP variant sizes exclusively at install time via COMPUTE_NODE_REPLICAS: "9". If the intent is to exercise day-2 scale-out across all three platforms, consider adding a corresponding workers-scale step to the GCP variant for consistency. If install-time-only sizing is intentional for GCP, no action is needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml`
around lines 60 - 73, The gcp-data-path-9nodes variant is missing a post-install
day-2 scaling step; update the steps.test array for the gcp-data-path-9nodes
configuration (the block containing COMPUTE_NODE_REPLICAS: "9" and ref:
openshift-qe-perfscale-gcp-data-path-fw) to include the workers-scale step
(e.g., add either "- ref: workers-scale" or "- chain: workers-scale" following
the pattern used by IBM and Nutanix), so the job performs a workers-scale
operation after install to match the other platform variants.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml (2)

1-52: Consider declaring an explicit timeout (and grace_period) on the ref.

The sibling ref openshift-qe-ocp-61589-max-nodes-network-expansion-ref.yaml in this PR declares timeout: 8h0m0s and grace_period: 30m0s explicitly. This ref relies on the step-registry default, which means a runaway test (e.g., wait_for_condition mis-loop) can burn the step's default budget rather than the intended ~30 min TEST_TIMEOUT. Note that the TEST_TIMEOUT env var only controls the in-script poll loop — it has no effect on the overall step timeout.

♻️ Suggested addition
   commands: openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh
+  timeout: 1h0m0s
+  grace_period: 10m0s
   env:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml`
around lines 1 - 52, The ref does not set an explicit step timeout/grace_period
so the job can exceed the intended TEST_TIMEOUT; update the ref definition for
openshift-qe-ocpbugs-45891-egress-ip-scale to include explicit timeout and
grace_period fields (e.g. timeout: 8h0m0s and grace_period: 30m0s) alongside the
existing as/from_image/commands/env entries so the CI step budget is enforced
independent of the TEST_TIMEOUT environment variable.

38-38: Avoid personal attribution in user-facing docs.

"Jean's validation methodology" will age poorly as contributors rotate and isn't meaningful to readers outside the immediate team. Prefer a neutral phrasing (e.g., "the reference validation methodology" or a link to the source doc/JIRA).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml`
at line 38, The phrase "Jean's validation methodology" is a personal
attribution; replace it with a neutral, enduring reference such as "the
reference validation methodology" or "the validation methodology" and optionally
add a pointer to the source (e.g., a doc or JIRA link) so readers can find the
original procedure—update the sentence that currently contains "Jean's
validation methodology" in the YAML to use the neutral wording.
ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.22-nightly-x86.yaml (1)

309-342: New perfscale test entries look well-formed.

Both entries are correctly wired to their respective refs and use the standard openshift-qe-installer-aws workflow. egress-ip-scale-2xlnodes follows the same pattern as conc-builds-3nodes/compact-cp-3nodes (no openshift-qe-workers-scale ref, relying on COMPUTE_NODE_REPLICAS: "2" + COMPUTE_NODE_TYPE: m6a.16xlarge at install time), which is consistent. The TEST_TIMEOUT values on both tests match the ref defaults.

One minor nit: EXPECTED_ASSIGNED_EGRESSIPS: "98", TOTAL_EGRESSIP_OBJECTS: "100", WORKER_VM_TYPE: m6a.16xlarge, and EXPECTED_WORKER_NODES: "2" all duplicate defaults already declared in the ref. Leaving them explicit is fine (makes the job self-documenting), but if you expect to retune only in the ref going forward, dropping them here avoids drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.22-nightly-x86.yaml`
around lines 309 - 342, The egress-ip-scale-2xlnodes job duplicates defaults
already declared in the ref openshift-qe-ocpbugs-45891-egress-ip-scale; remove
the redundant env vars EXPECTED_ASSIGNED_EGRESSIPS, TOTAL_EGRESSIP_OBJECTS,
WORKER_VM_TYPE and EXPECTED_WORKER_NODES from the job's env block so the ref
defaults are used (leave COMPUTE_NODE_REPLICAS and COMPUTE_NODE_TYPE as-is).
ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh (3)

150-170: Batch the oc apply to cut ~100 API round-trips.

Each iteration spawns a new oc apply process and issues a separate API call, which is unnecessarily slow when you already know all objects up front. Stream all manifests through a single oc apply -f -:

♻️ Proposed refactor
-for i in $(seq 0 $((TOTAL_EGRESSIP_OBJECTS - 1))); do
-    egressip_name="${EGRESSIP_NAME_PREFIX}-${i}"
-
-    cat <<EOF | oc apply -f -
-apiVersion: k8s.ovn.org/v1
-kind: EgressIP
-metadata:
-  name: $egressip_name
-spec:
-  egressIPs: []
-  namespaceSelector: {}
-  podSelector:
-    matchLabels:
-      egress-test: "$egressip_name"
-EOF
-
-    # Log progress every 20 objects
-    if [ $((i % 20)) -eq 0 ]; then
-        echo "Created $((i + 1))/$TOTAL_EGRESSIP_OBJECTS EgressIP objects..."
-    fi
-done
+{
+  for i in $(seq 0 $((TOTAL_EGRESSIP_OBJECTS - 1))); do
+    egressip_name="${EGRESSIP_NAME_PREFIX}-${i}"
+    cat <<EOF
+---
+apiVersion: k8s.ovn.org/v1
+kind: EgressIP
+metadata:
+  name: $egressip_name
+spec:
+  egressIPs: []   # TODO: replace with a real IP per object (see other comment)
+  namespaceSelector: {}
+  podSelector:
+    matchLabels:
+      egress-test: "$egressip_name"
+EOF
+  done
+} | oc apply -f -
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`
around lines 150 - 170, Current loop calls oc apply -f - inside each iteration
causing many API calls; instead accumulate all EgressIP manifests into a single
multi-document stream and call oc apply -f - once. Modify the for loop that
builds egressip_name (uses EGRESSIP_NAME_PREFIX and TOTAL_EGRESSIP_OBJECTS) to
append each generated YAML to a buffer or temp file (separated by ---) rather
than piping to oc each time, keep the progress logging, and after the loop run a
single oc apply -f - (or oc apply -f <tempfile>) to submit all manifests in one
API round-trip.

248-271: Minor: avoid the 999 sentinel and the duplicated per-node query.

min_assignments=999 is a magic value that happens to be safe today (max 100 objects) but is easy to break later. Also, this loop re-runs oc get egressip -o jsonpath=... per node (same query as line 210 and the summary loop at 293), which shells out oc 5+ times on the same data. Pull the assignment list once and reduce locally:

♻️ Suggested refactor
-total_assignments=0
-max_assignments=0
-min_assignments=999
-
-for node in "${worker_nodes[@]}"; do
-    node_assignments=$(oc get egressip -o jsonpath='{range .items[*]}{.status.items[0].node}{"\n"}{end}' | grep "^$node$" | wc -l)
-    total_assignments=$((total_assignments + node_assignments))
-
-    if [ "$node_assignments" -gt "$max_assignments" ]; then
-        max_assignments=$node_assignments
-    fi
-
-    if [ "$node_assignments" -lt "$min_assignments" ]; then
-        min_assignments=$node_assignments
-    fi
-done
+assignments=$(oc get egressip -o jsonpath='{range .items[*]}{.status.items[0].node}{"\n"}{end}' | grep -v '^$')
+declare -a counts=()
+for node in "${worker_nodes[@]}"; do
+  counts+=( "$(printf '%s\n' "$assignments" | grep -c "^$node$" || true)" )
+done
+max_assignments=$(printf '%s\n' "${counts[@]}" | sort -n | tail -1)
+min_assignments=$(printf '%s\n' "${counts[@]}" | sort -n | head -1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`
around lines 248 - 271, Replace the per-node repeated oc query and the magic
sentinel by capturing the egressip node assignment list once (e.g., store output
of oc get egressip -o jsonpath... into a shell variable or array) and then
iterate over worker_nodes to count occurrences from that local data; initialize
min_assignments in a safe way (for example set min_assignments to
total_assignments after computing all counts or set it to the first node's
count) instead of using 999, and update total_assignments, max_assignments,
min_assignments and compute load_balance_diff exactly as before using the local
counts so you avoid shelling out multiple times and remove the magic constant.

30-34: sed | bc timeout parser is fragile for composite durations.

echo "1h30m" | sed 's/h/*3600/g; s/m/*60/g; s/s//g' | bc evaluates 1*360030*60 → 21,601,800 rather than 5400. Today's inputs (30m, 2h) happen to parse correctly, but the helper is wait_for_condition-generic and the default TEST_TIMEOUT doc (ref L26) doesn't advertise a restriction. Either document the restriction or use a parser that handles h/m/s concatenation:

♻️ More robust parser
-    timeout_seconds=$(echo "$timeout" | sed 's/h/*3600/g; s/m/*60/g; s/s//g' | bc)
+    # Accepts forms like "30m", "2h", "1h30m", "90s", or a bare number of seconds.
+    timeout_seconds=0
+    remainder="$timeout"
+    while [[ "$remainder" =~ ^([0-9]+)([hms])(.*)$ ]]; do
+      case "${BASH_REMATCH[2]}" in
+        h) timeout_seconds=$((timeout_seconds + BASH_REMATCH[1] * 3600));;
+        m) timeout_seconds=$((timeout_seconds + BASH_REMATCH[1] * 60));;
+        s) timeout_seconds=$((timeout_seconds + BASH_REMATCH[1]));;
+      esac
+      remainder="${BASH_REMATCH[3]}"
+    done
+    [[ -n "$remainder" ]] && timeout_seconds=$((timeout_seconds + remainder))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`
around lines 30 - 34, The current timeout parsing using sed|bc to set
timeout_seconds from timeout is fragile for concatenated durations (e.g.,
"1h30m"); replace it with a parser that extracts each number+unit pair from the
timeout string and sums their contributions to seconds (e.g., multiply hours by
3600, minutes by 60, seconds by 1) before assigning timeout_seconds, or
alternatively document the restriction in TEST_TIMEOUT; update the code paths
that use timeout (e.g., wait_for_condition) and keep the existing variables
start_time and timeout_seconds names so callers remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`:
- Around line 172-173: The script currently calls wait_for_nodes
"$EXPECTED_READY_NODES" "$TEST_TIMEOUT" which, under errexit, will abort the
script on timeout and skip the final diagnostics; change the call to capture its
exit status (e.g., call wait_for_nodes ...; wait_for_nodes_failed=$?), then
always run the remaining Step 5/6 diagnostics (node snapshots, operator state,
NotReady diagnostics) regardless of that status, and after those summaries if
wait_for_nodes_failed is non-zero exit with a non-zero code (for example echo a
FAIL message and exit 1). Ensure references to wait_for_nodes and the captured
status variable (e.g., wait_for_nodes_failed) are used so diagnostics always
execute before failing the step.
- Around line 61-65: The 30-minute snapshot if-condition using [[ $((elapsed %
1800)) -eq 0 ]] rarely fires because elapsed includes the 60s sleep and won't
hit exact multiples; change the logic in the block that references elapsed to
instead track the last snapshot time (e.g., last_snapshot) or a next_snapshot
threshold and trigger when elapsed - last_snapshot >= 1800 (or elapsed >=
next_snapshot), then update last_snapshot=elapsed (or next_snapshot+=1800)
before taking the oc get nodes/machineset snapshots so snapshots run reliably
every 30 minutes.
- Around line 44-47: The grep-based pipelines used to set ready_count and
notready_count (and the later final-count occurrences) can fail under
errexit+pipefail when there are zero matches; replace those pipelines with a
safe counting command that always exits 0, e.g. use awk to count matches:
replace `oc get nodes --no-headers | grep " Ready " | wc -l` and `oc get nodes
--no-headers | grep "NotReady" | wc -l` with `oc get nodes --no-headers | awk '/
Ready /{c++}END{print c+0}'` and `oc get nodes --no-headers | awk
'/NotReady/{c++}END{print c+0}'` (apply the same change to the final-count paths
that reference ready_count/notready_count) so the script won’t exit when counts
are zero.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`:
- Around line 180-183: The current wait_for_condition uses a looser slack
(EXPECTED_ASSIGNED_EGRESSIPS - 5) than the later strict STEP 7 check, so change
the wait to use the same tolerance or implement a stabilization check: replace
the condition expression passed to wait_for_condition so it requires assigned
count >= $((EXPECTED_ASSIGNED_EGRESSIPS - 2)) (matching the ±2 tolerance), or
(preferably) modify wait_for_condition invocation to loop until the observed
assigned count is identical for N consecutive polls (e.g., 3) before returning
success; reference the wait_for_condition call, EXPECTED_ASSIGNED_EGRESSIPS, and
TEST_TIMEOUT when making this change.
- Around line 150-170: The loop creating EgressIP objects currently writes
spec.egressIPs: [] which violates the CRD; change the creation to populate
spec.egressIPs with real IPs derived from the node annotation
cloud.network.openshift.io/egress-ipconfig (read earlier in the script) instead
of an empty array. Parse the egress-ipconfig subnet, allocate a distinct usable
IP per EgressIP (e.g., increment the host portion for each index), and
substitute that IP into the cat <<EOF block for each egressip_name in the for
loop that uses EGRESSIP_NAME_PREFIX and TOTAL_EGRESSIP_OBJECTS so spec.egressIPs
contains at least one valid address per object and addresses stay within the
subnet. Ensure uniqueness and valid host addresses (skip network/broadcast) so
the controller can bind them and downstream waits/assertions succeed.

---

Nitpick comments:
In
`@ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.22-nightly-x86.yaml`:
- Around line 309-342: The egress-ip-scale-2xlnodes job duplicates defaults
already declared in the ref openshift-qe-ocpbugs-45891-egress-ip-scale; remove
the redundant env vars EXPECTED_ASSIGNED_EGRESSIPS, TOTAL_EGRESSIP_OBJECTS,
WORKER_VM_TYPE and EXPECTED_WORKER_NODES from the job's env block so the ref
defaults are used (leave COMPUTE_NODE_REPLICAS and COMPUTE_NODE_TYPE as-is).

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`:
- Around line 150-170: Current loop calls oc apply -f - inside each iteration
causing many API calls; instead accumulate all EgressIP manifests into a single
multi-document stream and call oc apply -f - once. Modify the for loop that
builds egressip_name (uses EGRESSIP_NAME_PREFIX and TOTAL_EGRESSIP_OBJECTS) to
append each generated YAML to a buffer or temp file (separated by ---) rather
than piping to oc each time, keep the progress logging, and after the loop run a
single oc apply -f - (or oc apply -f <tempfile>) to submit all manifests in one
API round-trip.
- Around line 248-271: Replace the per-node repeated oc query and the magic
sentinel by capturing the egressip node assignment list once (e.g., store output
of oc get egressip -o jsonpath... into a shell variable or array) and then
iterate over worker_nodes to count occurrences from that local data; initialize
min_assignments in a safe way (for example set min_assignments to
total_assignments after computing all counts or set it to the first node's
count) instead of using 999, and update total_assignments, max_assignments,
min_assignments and compute load_balance_diff exactly as before using the local
counts so you avoid shelling out multiple times and remove the magic constant.
- Around line 30-34: The current timeout parsing using sed|bc to set
timeout_seconds from timeout is fragile for concatenated durations (e.g.,
"1h30m"); replace it with a parser that extracts each number+unit pair from the
timeout string and sums their contributions to seconds (e.g., multiply hours by
3600, minutes by 60, seconds by 1) before assigning timeout_seconds, or
alternatively document the restriction in TEST_TIMEOUT; update the code paths
that use timeout (e.g., wait_for_condition) and keep the existing variables
start_time and timeout_seconds names so callers remain unchanged.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml`:
- Around line 1-52: The ref does not set an explicit step timeout/grace_period
so the job can exceed the intended TEST_TIMEOUT; update the ref definition for
openshift-qe-ocpbugs-45891-egress-ip-scale to include explicit timeout and
grace_period fields (e.g. timeout: 8h0m0s and grace_period: 30m0s) alongside the
existing as/from_image/commands/env entries so the CI step budget is enforced
independent of the TEST_TIMEOUT environment variable.
- Line 38: The phrase "Jean's validation methodology" is a personal attribution;
replace it with a neutral, enduring reference such as "the reference validation
methodology" or "the validation methodology" and optionally add a pointer to the
source (e.g., a doc or JIRA link) so readers can find the original
procedure—update the sentence that currently contains "Jean's validation
methodology" in the YAML to use the neutral wording.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 17d88b9d-e34a-4da8-bbf4-f3db0433cf0c

📥 Commits

Reviewing files that changed from the base of the PR and between 207662e and 750897b.

📒 Files selected for processing (5)
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.22-nightly-x86.yaml
  • ci-operator/jobs/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main-periodics.yaml
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh
  • ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh
  • ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml

Comment on lines +44 to +47
local ready_count
ready_count=$(oc get nodes --no-headers | grep " Ready " | wc -l)
local notready_count
notready_count=$(oc get nodes --no-headers | grep "NotReady" | wc -l)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid grep-based counts here under pipefail.

When there are zero NotReady nodes, Line 47 returns a non-zero status and the loop exits immediately because the script is running with errexit + pipefail. That makes a healthy cluster fail before any waiting happens. The same counting pattern should be fixed in the later final-count paths too.

🐛 Proposed fix
-        local ready_count
-        ready_count=$(oc get nodes --no-headers | grep " Ready " | wc -l)
-        local notready_count
-        notready_count=$(oc get nodes --no-headers | grep "NotReady" | wc -l)
+        local ready_count
+        ready_count=$(oc get nodes --no-headers | awk '$2 ~ /^Ready/ {c++} END {print c+0}')
+        local notready_count
+        notready_count=$(oc get nodes --no-headers | awk '$2 ~ /^NotReady/ {c++} END {print c+0}')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local ready_count
ready_count=$(oc get nodes --no-headers | grep " Ready " | wc -l)
local notready_count
notready_count=$(oc get nodes --no-headers | grep "NotReady" | wc -l)
local ready_count
ready_count=$(oc get nodes --no-headers | awk '$2 ~ /^Ready/ {c++} END {print c+0}')
local notready_count
notready_count=$(oc get nodes --no-headers | awk '$2 ~ /^NotReady/ {c++} END {print c+0}')
🧰 Tools
🪛 Shellcheck (0.11.0)

[style] 45-45: Consider using 'grep -c' instead of 'grep|wc -l'.

(SC2126)


[style] 47-47: Consider using 'grep -c' instead of 'grep|wc -l'.

(SC2126)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`
around lines 44 - 47, The grep-based pipelines used to set ready_count and
notready_count (and the later final-count occurrences) can fail under
errexit+pipefail when there are zero matches; replace those pipelines with a
safe counting command that always exits 0, e.g. use awk to count matches:
replace `oc get nodes --no-headers | grep " Ready " | wc -l` and `oc get nodes
--no-headers | grep "NotReady" | wc -l` with `oc get nodes --no-headers | awk '/
Ready /{c++}END{print c+0}'` and `oc get nodes --no-headers | awk
'/NotReady/{c++}END{print c+0}'` (apply the same change to the final-count paths
that reference ready_count/notready_count) so the script won’t exit when counts
are zero.

Comment on lines +61 to +65
if [[ $((elapsed % 1800)) -eq 0 && $elapsed -gt 0 ]]; then # Every 30 minutes
echo "$(date): Saving 30-minute snapshot" | tee -a "$RESULTS_DIR/network-expansion-timeline.txt"
oc get nodes -o wide > "$RESULTS_DIR/nodes-snapshot-${elapsed}s.txt"
oc get machineset -n openshift-machine-api -o wide > "$RESULTS_DIR/machinesets-snapshot-${elapsed}s.txt"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The 30-minute snapshot trigger will usually never fire.

elapsed includes command runtime plus the 60-second sleep, so it rarely lands on an exact multiple of 1800. In practice these snapshots will often be skipped entirely.

🐛 Proposed fix
-    while true; do
+    local next_snapshot=1800
+    while true; do
@@
-        if [[ $((elapsed % 1800)) -eq 0 && $elapsed -gt 0 ]]; then  # Every 30 minutes
+        if [[ $elapsed -ge $next_snapshot ]]; then
             echo "$(date): Saving 30-minute snapshot" | tee -a "$RESULTS_DIR/network-expansion-timeline.txt"
             oc get nodes -o wide > "$RESULTS_DIR/nodes-snapshot-${elapsed}s.txt"
             oc get machineset -n openshift-machine-api -o wide > "$RESULTS_DIR/machinesets-snapshot-${elapsed}s.txt"
+            next_snapshot=$((next_snapshot + 1800))
         fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`
around lines 61 - 65, The 30-minute snapshot if-condition using [[ $((elapsed %
1800)) -eq 0 ]] rarely fires because elapsed includes the 60s sleep and won't
hit exact multiples; change the logic in the block that references elapsed to
instead track the last snapshot time (e.g., last_snapshot) or a next_snapshot
threshold and trigger when elapsed - last_snapshot >= 1800 (or elapsed >=
next_snapshot), then update last_snapshot=elapsed (or next_snapshot+=1800)
before taking the oc get nodes/machineset snapshots so snapshots run reliably
every 30 minutes.

Comment on lines +172 to +173
# Wait for nodes to reach expected capacity
wait_for_nodes "$EXPECTED_READY_NODES" "$TEST_TIMEOUT"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t let a timeout skip all final diagnostics.

If wait_for_nodes times out, errexit stops the script at Line 173 and Step 5/6 never run. That drops the final node snapshots, operator state, and NotReady diagnostics հենց when the job fails.

🐛 Proposed fix
-# Wait for nodes to reach expected capacity
-wait_for_nodes "$EXPECTED_READY_NODES" "$TEST_TIMEOUT"
+# Wait for nodes to reach expected capacity, but keep collecting diagnostics on timeout
+wait_for_nodes_failed=0
+if ! wait_for_nodes "$EXPECTED_READY_NODES" "$TEST_TIMEOUT"; then
+    wait_for_nodes_failed=1
+fi

Then fail after the summaries and diagnostics are written, for example:

if [[ $wait_for_nodes_failed -ne 0 ]]; then
    echo "❌ FAIL: Timed out waiting for expected ready node count"
    exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`
around lines 172 - 173, The script currently calls wait_for_nodes
"$EXPECTED_READY_NODES" "$TEST_TIMEOUT" which, under errexit, will abort the
script on timeout and skip the final diagnostics; change the call to capture its
exit status (e.g., call wait_for_nodes ...; wait_for_nodes_failed=$?), then
always run the remaining Step 5/6 diagnostics (node snapshots, operator state,
NotReady diagnostics) regardless of that status, and after those summaries if
wait_for_nodes_failed is non-zero exit with a non-zero code (for example echo a
FAIL message and exit 1). Ensure references to wait_for_nodes and the captured
status variable (e.g., wait_for_nodes_failed) are used so diagnostics always
execute before failing the step.

Comment on lines +150 to +170
for i in $(seq 0 $((TOTAL_EGRESSIP_OBJECTS - 1))); do
egressip_name="${EGRESSIP_NAME_PREFIX}-${i}"

cat <<EOF | oc apply -f -
apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
name: $egressip_name
spec:
egressIPs: []
namespaceSelector: {}
podSelector:
matchLabels:
egress-test: "$egressip_name"
EOF

# Log progress every 20 objects
if [ $((i % 20)) -eq 0 ]; then
echo "Created $((i + 1))/$TOTAL_EGRESSIP_OBJECTS EgressIP objects..."
fi
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the actual script file
find . -name "*ocpbugs-45891-egress-ip-scale*" -type f

Repository: openshift/release

Length of output: 707


🏁 Script executed:

# Search for "Jean" or references to original scripts in the codebase
rg -i "jean" --type sh --type md -A 3 -B 3

Repository: openshift/release

Length of output: 2435


🏁 Script executed:

# Search for EgressIP-related documentation or comments
rg -i "egressip" --type sh --type md -C 5

Repository: openshift/release

Length of output: 50376


🏁 Script executed:

cat -n ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh

Repository: openshift/release

Length of output: 14170


🏁 Script executed:

# Search for other EgressIP creation patterns in the codebase
rg "kind: EgressIP" -A 10 --type sh --type yaml | head -100

Repository: openshift/release

Length of output: 4622


🏁 Script executed:

# Look for any documentation or comments about EgressIP spec requirements
rg -i "egressip.*spec|spec.*egressip" -B 2 -A 5 --type sh --type md | head -80

Repository: openshift/release

Length of output: 7672


🏁 Script executed:

# Check if there's any reference to auto-allocation or empty egressIPs handling
rg -i "empty.*egress|auto.*alloc.*egress" --type sh --type md

Repository: openshift/release

Length of output: 43


🌐 Web query:

OpenShift OVN-Kubernetes EgressIP spec egressIPs field required

💡 Result:

Yes, the egressIPs field in the OpenShift OVN-Kubernetes EgressIP spec is required. Official Red Hat and OKD documentation for OpenShift Container Platform versions 4.12 through 4.19, as well as the OVN-Kubernetes project site, consistently describe it as mandatory. The CRD schema explicitly lists egressIPs under required fields in .spec, alongside namespaceSelector. All example YAML configurations include at least one IP address in the egressIPs array, described as "An array of one or more IP addresses" or "the list of egress IP addresses requested. Can be IPv4 and/or IPv6. This field is mandatory."

Citations:


Critical: egressIPs: [] violates CRD requirements — the field is mandatory and must contain IP addresses.

Per official OpenShift OVN-Kubernetes documentation (versions 4.12–4.19), the spec.egressIPs field is required and must contain at least one IP address. The CRD schema explicitly enforces this. Creating EgressIP objects with an empty array leaves no addresses for the controller to bind to egress-assignable nodes, resulting in:

  • Zero assignments across all 100 EgressIP objects
  • Zero CloudPrivateIPConfig objects created
  • Test timeout at line 182 (wait condition never reaches threshold)
  • Assertion failures at lines 233 and 241 (expected ~98 vs. actual 0 assignments)

STEP 2 correctly reads the node's cloud.network.openshift.io/egress-ipconfig annotation (line 102–114) but STEP 4 never uses it. The codebase's reference implementation (ovn-sdn-migration-feature-migration-config-egress-ip) demonstrates the correct approach: derive IP addresses from the egress-ipconfig subnet and populate spec.egressIPs with actual addresses.

Fix pattern from reference implementation
+# Derive usable EgressIPs from node egress-ipconfig (like sdn-migration reference)
+declare -a EGRESS_IP_POOL
+for node in "${worker_nodes[@]}"; do
+    egress_config=$(oc get node "$node" -o jsonpath='{.metadata.annotations.cloud\.network\.openshift\.io/egress-ipconfig}')
+    subnet=$(echo "$egress_config" | jq -r '.[0].ifaddr.ipv4')
+    # Derive unique IPs from subnet for use in EGRESS_IP_POOL[@]
+done
+
 for i in $(seq 0 $((TOTAL_EGRESSIP_OBJECTS - 1))); do
     egressip_name="${EGRESSIP_NAME_PREFIX}-${i}"
     cat <<EOF | oc apply -f -
 apiVersion: k8s.ovn.org/v1
 kind: EgressIP
 metadata:
   name: $egressip_name
 spec:
-  egressIPs: []
+  egressIPs:
+  - ${EGRESS_IP_POOL[$i]}
   namespaceSelector: {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`
around lines 150 - 170, The loop creating EgressIP objects currently writes
spec.egressIPs: [] which violates the CRD; change the creation to populate
spec.egressIPs with real IPs derived from the node annotation
cloud.network.openshift.io/egress-ipconfig (read earlier in the script) instead
of an empty array. Parse the egress-ipconfig subnet, allocate a distinct usable
IP per EgressIP (e.g., increment the host portion for each index), and
substitute that IP into the cat <<EOF block for each egressip_name in the for
loop that uses EGRESSIP_NAME_PREFIX and TOTAL_EGRESSIP_OBJECTS so spec.egressIPs
contains at least one valid address per object and addresses stay within the
subnet. Ensure uniqueness and valid host addresses (skip network/broadcast) so
the controller can bind them and downstream waits/assertions succeed.

Comment on lines +180 to +183
wait_for_condition \
"EgressIP assignments to stabilize" \
"[ \$(oc get egressip -o jsonpath='{range .items[*]}{.status.items[0].node}{\"\n\"}{end}' | grep -v '^\$' | wc -l) -ge $((EXPECTED_ASSIGNED_EGRESSIPS - 5)) ]" \
"$TEST_TIMEOUT"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait-condition slack (−5) is looser than the pass/fail slack (±2).

The wait returns success as soon as assigned ≥ EXPECTED_ASSIGNED_EGRESSIPS - 5 (i.e., 93), but STEP 7 (line 233) only passes if assigned ∈ [EXPECTED-2, EXPECTED+2] (i.e., 96–100). If the system stabilizes at 93–95 assignments, the wait completes "successfully" and then STEP 7 immediately fails without giving the controller any more time to converge. Use the same tolerance — or better, wait for stabilization (same count observed N consecutive polls) rather than a threshold.

♻️ Minimum-churn fix: align the thresholds
 wait_for_condition \
     "EgressIP assignments to stabilize" \
-    "[ \$(oc get egressip -o jsonpath='{range .items[*]}{.status.items[0].node}{\"\n\"}{end}' | grep -v '^\$' | wc -l) -ge $((EXPECTED_ASSIGNED_EGRESSIPS - 5)) ]" \
+    "[ \$(oc get egressip -o jsonpath='{range .items[*]}{.status.items[0].node}{\"\n\"}{end}' | grep -v '^\$' | wc -l) -ge $((EXPECTED_ASSIGNED_EGRESSIPS - 2)) ]" \
     "$TEST_TIMEOUT"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-commands.sh`
around lines 180 - 183, The current wait_for_condition uses a looser slack
(EXPECTED_ASSIGNED_EGRESSIPS - 5) than the later strict STEP 7 check, so change
the wait to use the same tolerance or implement a stabilization check: replace
the condition expression passed to wait_for_condition so it requires assigned
count >= $((EXPECTED_ASSIGNED_EGRESSIPS - 2)) (matching the ±2 tolerance), or
(preferably) modify wait_for_condition invocation to loop until the observed
assigned count is identical for N consecutive polls (e.g., 3) before returning
success; reference the wait_for_condition call, EXPECTED_ASSIGNED_EGRESSIPS, and
TEST_TIMEOUT when making this change.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`:
- Around line 18-27: The check_cluster_health function's failure is lost because
the while loop runs in a subshell via a pipe; change the loop to run in the
current shell (e.g., use process substitution: while read name _ available
progressing degraded _ _; do ... done < <(oc get co --no-headers)) and ensure
failures propagate by setting a local flag or returning immediately from the
function (not the subshell) when a non-healthy operator is detected; update
references inside check_cluster_health (the oc get co invocation and the while
loop) so the function actually returns non-zero on error and preserves the
existing error message format.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0443e0db-4e82-406b-8bbc-f0835630972e

📥 Commits

Reviewing files that changed from the base of the PR and between 750897b and e8c0045.

📒 Files selected for processing (1)
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-ref.yaml`:
- Around line 3-7: The step's base image lacks yq; update the from_image to use
the installer image by changing the tag from "cli" to "upi-installer" (i.e. keep
namespace: ocp and name: "4.22" but set tag: upi-installer) so the script can
use yq, and to avoid unnecessarily hard-pinning the step for all consumers
consider removing or generalizing the version pin (the name: "4.22" field) so
future workflows can use newer OCP branches instead of being locked to 4.22.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`:
- Line 58: The pending_count calculation is incorrect and can go negative;
replace the meaningless formula pending_count=$((target_ready + 10 -
total_count)) with a real machine-vs-node delta: compute the current machine
count (e.g. machines_count) minus total_count to represent how many machines
lack corresponding nodes, clamp the result to zero (no negatives) before writing
the CSV, or simply drop the column; update usages of pending_count accordingly
(look for pending_count, total_count, target_ready in the script).
- Around line 253-256: The pipeline that sets notready_node using `oc get nodes
--no-headers | grep "NotReady" | head -1 | awk '{print $1}'` can trigger SIGPIPE
under `set -o pipefail`; replace that pipeline with a single awk invocation to
find the first NotReady node (e.g., use `oc get nodes --no-headers | awk '$0 ~
/NotReady/ {print $1; exit}'`) so the variable assignment in notready_node and
subsequent logic (the echo and `oc describe node "$notready_node" | tee ...`
blocks) will not be aborted by a broken pipe.
- Around line 37-46: The current elif chain only captures the first time unit;
instead parse and accumulate all unit matches from the timeout string (e.g.,
"1h30m") by looping with a regex like ([0-9]+)([hms]) and summing into
timeout_seconds: use while [[ $timeout =~ ([0-9]+)([hms]) ]]; do
value=${BASH_REMATCH[1]}; unit=${BASH_REMATCH[2]}; case $unit in h)
timeout_seconds=$((timeout_seconds + value*3600)) ;; m)
timeout_seconds=$((timeout_seconds + value*60)) ;; s)
timeout_seconds=$((timeout_seconds + value)) ;; esac; truncate the matched
prefix from timeout (e.g., timeout=${timeout#*${BASH_REMATCH[0]}}) and repeat
until no match; initialize timeout_seconds=0 before the loop so all units
(hours, minutes, seconds) are accumulated correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 832d49fc-a852-4237-9fdf-cd16de7ed385

📥 Commits

Reviewing files that changed from the base of the PR and between e8c0045 and 6a762cc.

📒 Files selected for processing (11)
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.20-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.21-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.22-nightly-x86.yaml
  • ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-ref.yaml
  • ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/OWNERS
  • ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/openshift-qe-installer-aws-network-expansion-workflow.metadata.json
  • ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/openshift-qe-installer-aws-network-expansion-workflow.yaml
  • ci-operator/step-registry/openshift-qe/network-perf/openshift-qe-network-perf-ref.yaml
  • ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh
  • ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml
✅ Files skipped from review due to trivial changes (7)
  • ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/OWNERS
  • ci-operator/step-registry/openshift-qe/network-perf/openshift-qe-network-perf-ref.yaml
  • ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/openshift-qe-installer-aws-network-expansion-workflow.metadata.json
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.22-nightly-x86.yaml
  • ci-operator/step-registry/openshift-qe/installer/aws/network-expansion/openshift-qe-installer-aws-network-expansion-workflow.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.21-nightly-datapath.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.20-nightly-datapath.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • ci-operator/step-registry/openshift-qe/ocpbugs-45891-egress-ip-scale/openshift-qe-ocpbugs-45891-egress-ip-scale-ref.yaml
  • ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__4.22-nightly-datapath.yaml

Comment on lines +3 to +7
from_image:
namespace: ocp
name: "4.22"
tag: cli
commands: ipi-install-config-network-custom-commands.sh
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1) What image do nearby install-config refs use for the same kind of yaml-patching work?
fd -t f -e yaml . ci-operator/step-registry/ipi/install-config | xargs -I{} sh -c 'echo "=== {} ==="; sed -n "1,20p" {}'

# 2) Confirm that the commands script relies on yq (so the image must provide it).
fd -t f 'ipi-install-config-network-custom-commands.sh' --exec cat {}

# 3) Find any other refs pinned to ocp/4.22:cli to see if this is an established pattern or a one-off.
rg -nP --type=yaml -C2 'namespace:\s*ocp' ci-operator/step-registry | rg -nP -C3 '"4\.22"'

Repository: openshift/release

Length of output: 6517


🌐 Web query:

Does the OpenShift CI ocp/:cliimage include theyqbinary, or onlyoc?

💡 Result:

The OpenShift CI ocp/<version>:cli image includes only the oc binary (and symlinks like kubectl), not the yq binary.

Citations:


🌐 Web query:

Does the OpenShift ocp/:upi-installerimage include theyq binary?

💡 Result:

Yes. OpenShift’s UPI installer image includes the yq binary. Evidence: the OpenShift installer repository’s UPI installer container build explicitly downloads/copies yq into the image (e.g., it adds a curl step that writes the amd64 yq binary to /bin/yq-go, and later changes copy yq3/yq4 into the image under /bin/yq-go and /bin/yq-v4). Separate PRs also describe adding yq-v4/yq3 into the upi-installer image so CI/installation steps don’t need to download it at runtime.

Citations:


This step will fail at runtime: ocp/4.22:cli does not include yq, but the commands script requires it.

The script uses yq eval-all to merge the network configuration patch into install-config.yaml. However, the OpenShift CLI image (ocp/<version>:cli) ships only the oc binary and symlinks like kubectl—it does not include yq. Switch to ocp/4.22:upi-installer, which includes yq and is already used by other installer-related steps in the registry.

Additionally, avoid hard-pinning a version-agnostic YAML-patching step to OCP 4.22. This ref is already reused by ipi-aws-ovn-custom-network-workflow.yaml and will constrain all future consumers (4.23+, earlier branches) to a stale CLI image. Once the image tag is corrected, consider whether version pinning is necessary.

🔧 Fix
   from_image:
     namespace: ocp
     name: "4.22"
-    tag: cli
+    tag: upi-installer
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from_image:
namespace: ocp
name: "4.22"
tag: cli
commands: ipi-install-config-network-custom-commands.sh
from_image:
namespace: ocp
name: "4.22"
tag: upi-installer
commands: ipi-install-config-network-custom-commands.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/ipi/install-config/network-custom/ipi-install-config-network-custom-ref.yaml`
around lines 3 - 7, The step's base image lacks yq; update the from_image to use
the installer image by changing the tag from "cli" to "upi-installer" (i.e. keep
namespace: ocp and name: "4.22" but set tag: upi-installer) so the script can
use yq, and to avoid unnecessarily hard-pinning the step for all consumers
consider removing or generalizing the version pin (the name: "4.22" field) so
future workflows can use newer OCP branches instead of being locked to 4.22.

Comment on lines +37 to +46
local timeout_seconds
if [[ "$timeout" =~ ([0-9]+)h ]]; then
timeout_seconds=$((${BASH_REMATCH[1]} * 3600))
elif [[ "$timeout" =~ ([0-9]+)m ]]; then
timeout_seconds=$((${BASH_REMATCH[1]} * 60))
elif [[ "$timeout" =~ ([0-9]+)s ]]; then
timeout_seconds=${BASH_REMATCH[1]}
else
timeout_seconds=7200 # Default 2 hours
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "*openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh" -type f

Repository: openshift/release

Length of output: 203


🏁 Script executed:

cat -n ./ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh

Repository: openshift/release

Length of output: 18662


🏁 Script executed:

cd ./ci-operator/step-registry && grep -r "TEST_TIMEOUT" . --include="*.sh" | head -20

Repository: openshift/release

Length of output: 3417


🏁 Script executed:

grep -r "TEST_TIMEOUT" ./ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/ --include="*.sh" -B2 -A2

Repository: openshift/release

Length of output: 1797


🏁 Script executed:

find ./ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion -type f | xargs cat | grep -i timeout

Repository: openshift/release

Length of output: 781


🏁 Script executed:

find ./ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion -type f \( -name "*.yaml" -o -name "*.yml" \) | xargs cat 2>/dev/null

Repository: openshift/release

Length of output: 1999


🏁 Script executed:

python3 << 'EOF'
# Verify bash regex behavior for compound timeouts
import subprocess

test_cases = [
    "1h",
    "30m", 
    "1h30m",
    "2h",
    "8h0m0s",
    "1h30m15s"
]

bash_script = """
test_timeout() {
    local timeout="$1"
    local timeout_seconds
    
    if [[ "$timeout" =~ ([0-9]+)h ]]; then
        timeout_seconds=$((${BASH_REMATCH[1]} * 3600))
    elif [[ "$timeout" =~ ([0-9]+)m ]]; then
        timeout_seconds=$((${BASH_REMATCH[1]} * 60))
    elif [[ "$timeout" =~ ([0-9]+)s ]]; then
        timeout_seconds=${BASH_REMATCH[1]}
    else
        timeout_seconds=7200
    fi
    
    echo "$timeout_seconds"
}

for tc in "$@"; do
    result=$(test_timeout "$tc")
    echo "$tc -> $result seconds"
done
"""

# Write script to temp file and execute
with open('/tmp/test_timeout.sh', 'w') as f:
    f.write(bash_script)

cmd = ['bash', '/tmp/test_timeout.sh'] + test_cases
result = subprocess.run(cmd, capture_output=True, text=True)
print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)
EOF

Repository: openshift/release

Length of output: 191


🏁 Script executed:

grep -r "TEST_TIMEOUT" ./ci-operator/step-registry/openshift-qe/ --include="*.md" -B2 -A2

Repository: openshift/release

Length of output: 43


🏁 Script executed:

find ./ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion -type f \( -name "README*" -o -name "*.md" \)

Repository: openshift/release

Length of output: 43


Compound timeouts like 1h30m are parsed incorrectly.

The elif chain stops at the first matching unit. For TEST_TIMEOUT="1h30m" you get 1*3600 = 3600s (the 30m is dropped); 8h0m0s happens to work only because the first branch captures 8h. The default 6h also works today, but any compound value supplied by callers will silently short-circuit the wait and fall through to Step 5 with an under-provisioned cluster.

Current test configuration uses only single-unit formats (6h default in script, 2h in YAML, 8h0m0s at step registry level), but the function is reusable and parameterizable, making this a latent bug that should be fixed to prevent issues when compound timeouts are eventually passed.

🐛 Proposed fix — accumulate all unit matches
-    local timeout_seconds
-    if [[ "$timeout" =~ ([0-9]+)h ]]; then
-        timeout_seconds=$((${BASH_REMATCH[1]} * 3600))
-    elif [[ "$timeout" =~ ([0-9]+)m ]]; then
-        timeout_seconds=$((${BASH_REMATCH[1]} * 60))
-    elif [[ "$timeout" =~ ([0-9]+)s ]]; then
-        timeout_seconds=${BASH_REMATCH[1]}
-    else
-        timeout_seconds=7200  # Default 2 hours
-    fi
+    local timeout_seconds=0
+    local remaining="$timeout"
+    if [[ "$remaining" =~ ^([0-9]+)h ]]; then
+        timeout_seconds=$((timeout_seconds + BASH_REMATCH[1] * 3600))
+        remaining="${remaining#${BASH_REMATCH[0]}}"
+    fi
+    if [[ "$remaining" =~ ^([0-9]+)m ]]; then
+        timeout_seconds=$((timeout_seconds + BASH_REMATCH[1] * 60))
+        remaining="${remaining#${BASH_REMATCH[0]}}"
+    fi
+    if [[ "$remaining" =~ ^([0-9]+)s ]]; then
+        timeout_seconds=$((timeout_seconds + BASH_REMATCH[1]))
+        remaining="${remaining#${BASH_REMATCH[0]}}"
+    fi
+    if [[ -n "$remaining" || "$timeout_seconds" -eq 0 ]]; then
+        echo "ERROR: Unsupported timeout format: $timeout" >&2
+        timeout_seconds=7200
+    fi
🧰 Tools
🪛 Shellcheck (0.11.0)

[style] 39-39: $/${} is unnecessary on arithmetic variables.

(SC2004)


[style] 41-41: $/${} is unnecessary on arithmetic variables.

(SC2004)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`
around lines 37 - 46, The current elif chain only captures the first time unit;
instead parse and accumulate all unit matches from the timeout string (e.g.,
"1h30m") by looping with a regex like ([0-9]+)([hms]) and summing into
timeout_seconds: use while [[ $timeout =~ ([0-9]+)([hms]) ]]; do
value=${BASH_REMATCH[1]}; unit=${BASH_REMATCH[2]}; case $unit in h)
timeout_seconds=$((timeout_seconds + value*3600)) ;; m)
timeout_seconds=$((timeout_seconds + value*60)) ;; s)
timeout_seconds=$((timeout_seconds + value)) ;; esac; truncate the matched
prefix from timeout (e.g., timeout=${timeout#*${BASH_REMATCH[0]}}) and repeat
until no match; initialize timeout_seconds=0 before the loop so all units
(hours, minutes, seconds) are accumulated correctly.

notready_count=$(oc get nodes --no-headers | grep "NotReady" | wc -l)
local total_count
total_count=$(oc get nodes --no-headers | wc -l)
local pending_count=$((target_ready + 10 - total_count)) # Estimate pending
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

pending_count formula is meaningless and can go negative.

pending_count=$((target_ready + 10 - total_count)) has no grounding: once total_count > target_ready + 10 it becomes negative, and it double-counts NotReady nodes (which are already part of total_count). The CSV column this feeds is misleading for later analysis.

Either drop the column, or compute the actual machine-vs-node delta, e.g.:

♻️ Proposed fix
-        local pending_count=$((target_ready + 10 - total_count))  # Estimate pending
+        local desired_machines
+        desired_machines=$(oc get machineset -n openshift-machine-api --no-headers \
+            -o custom-columns=DESIRED:.spec.replicas 2>/dev/null \
+            | awk '{s+=$1} END {print s+0}')
+        local pending_count=$(( desired_machines > total_count ? desired_machines - total_count : 0 ))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`
at line 58, The pending_count calculation is incorrect and can go negative;
replace the meaningless formula pending_count=$((target_ready + 10 -
total_count)) with a real machine-vs-node delta: compute the current machine
count (e.g. machines_count) minus total_count to represent how many machines
lack corresponding nodes, clamp the result to zero (no negatives) before writing
the CSV, or simply drop the column; update usages of pending_count accordingly
(look for pending_count, total_count, target_ready in the script).

Comment on lines +253 to +256
notready_node=$(oc get nodes --no-headers | grep "NotReady" | head -1 | awk '{print $1}')
if [[ -n "$notready_node" ]]; then
echo "Sample error from $notready_node:"
oc describe node "$notready_node" | tee "$RESULTS_DIR/notready-node-details-$notready_node.txt"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

wc -l ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh

Repository: openshift/release

Length of output: 205


🏁 Script executed:

head -30 ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh

Repository: openshift/release

Length of output: 1185


🏁 Script executed:

sed -n '240,270p' ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh

Repository: openshift/release

Length of output: 1831


Replace pipeline with awk to avoid SIGPIPE under pipefail.

The script has both set -o errexit and set -o pipefail enabled (lines 2–3). When head -1 closes stdin, the upstream grep may receive SIGPIPE when trying to write the second matching line, causing the entire pipeline to fail and abort script execution before NotReady diagnostics are collected.

Proposed fix
-    notready_node=$(oc get nodes --no-headers | grep "NotReady" | head -1 | awk '{print $1}')
+    notready_node=$(oc get nodes --no-headers \
+        | awk '$2 ~ /^NotReady/ {print $1; exit}')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift-qe/ocp-61589-max-nodes-network-expansion/openshift-qe-ocp-61589-max-nodes-network-expansion-commands.sh`
around lines 253 - 256, The pipeline that sets notready_node using `oc get nodes
--no-headers | grep "NotReady" | head -1 | awk '{print $1}'` can trigger SIGPIPE
under `set -o pipefail`; replace that pipeline with a single awk invocation to
find the first NotReady node (e.g., use `oc get nodes --no-headers | awk '$0 ~
/NotReady/ {print $1; exit}'`) so the variable assignment in notready_node and
subsequent logic (the echo and `oc describe node "$notready_node" | tee ...`
blocks) will not be aborted by a broken pipe.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not get step registry differences: could not find registry component in registry graph: ref/openshift-qe-ovn-service-sync-perf
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure, pj-rehearse: unable to determine affected jobs ERROR:

could not determine changed registry steps: could not get step registry differences: could not find registry component in registry graph: ref/openshift-qe-ovn-service-sync-perf

If the problem persists, please contact Test Platform.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.22-nightly-x86-ovn-service-sync-perf-5nodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

couldn't prepare candidate: couldn't rebase candidate onto d47ab7ef01f6469a00cbc0ac814e4d93d9ce6c0b due to conflicts
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.22-nightly-x86-ovn-service-sync-perf-5nodes

…ation

  - Add EgressIP scale test (OCPBUGS-45891) with m6a.16xlarge instances
  - Add OVN Service Sync performance test for GCP 4.22 CI
  - Fix network expansion workflow order to prevent CIDR override
  - Reduce EgressIP test scale from 100 to 30 objects with debugging
  - Fix OVN service sync timestamp parsing and container detection
  - Add missing step parameter declarations for CI validation
  - Remove malformed duplicate configuration file
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@SachinNinganure: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-compact-cp-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-conc-builds-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-control-plane-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-egress-ip-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-node-density-cni-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.18-nightly-x86-router-perf-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-nad-network-expansion-3nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes openshift-eng/ocp-qe-perfscale-ci presubmit Presubmit changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.23-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Ci-operator config changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-5.0-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Ci-operator config changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.22-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Ci-operator config changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.23-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-metal-x86-data-path-6nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-metal-x86-virt-datapath openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.22-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.23-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.23-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.23-nightly-x86-data-path-ipsec-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa_hcp-4.21-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-metal-4.19-nightly-x86-weekly-6nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.22-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-metal-4.19-nightly-x86-netperf-external-sno openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-metal-4.19-nightly-x86-daily-virt-6nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed

A total of 143 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@SachinNinganure
Copy link
Copy Markdown
Contributor Author

/pj-rehearse pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.22-nightly-x86-ovn-service-sync-perf-5nodes pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-nad-network-expansion-3nodes

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@SachinNinganure: job(s): either don't exist or were not found to be affected, and cannot be rehearsed

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

@SachinNinganure: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes 8c347c8 link unknown /pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-egress-ip-scale-2xlnodes
ci/rehearse/periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes 8c347c8 link unknown /pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes
ci/rehearse/periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.22-nightly-x86-ovn-service-sync-perf-5nodes 8ca5f7c link unknown /pj-rehearse periodic-ci-openshift-eng-ocp-qe-perfscale-ci-main-gcp-4.22-nightly-x86-ovn-service-sync-perf-5nodes
ci/rehearse/openshift-eng/ocp-qe-perfscale-ci/main/aws-4.22-nightly-x86-nad-network-expansion-3nodes 8ca5f7c link unknown /pj-rehearse pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-nad-network-expansion-3nodes
ci/prow/release-controller-config 8ca5f7c link true /test release-controller-config
ci/rehearse/openshift-eng/ocp-qe-perfscale-ci/main/aws-4.22-nightly-x86-network-expansion-520nodes 8ca5f7c link unknown /pj-rehearse pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-4.22-nightly-x86-network-expansion-520nodes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 11, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 11, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants