PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630
PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630james-nesbitt wants to merge 3 commits into
Conversation
## Smoke test Added TestUpgradeLegacyToModern (test/smoke/upgrade_test.go): - Provisions RHEL8/Rocky8/Ubuntu22, installs MCR stable-25.0 / MKE 3.8.8, then upgrades in place to MCR stable-29.2 / MKE 3.9.2 via a second Apply(). - runUpgradeTest() helper mirrors runSmokeTest() structure (defer destroy, resource tagging, temp SSH dir). - bumpVersions() unmarshals Terraform-generated launchpad_yaml, updates spec.mcr.channel and spec.mke.version, re-marshals — preserving host addresses, SANs, LB names, and install flags verbatim. - make smoke-upgrade target (90m timeout). - smoke-upgrade CI job in .github/workflows/smoke-tests.yaml, gated by smoke-upgrade or smoke-test PR label. CI result: PASS (run 25721416884, 1320s). ## Documentation AGENTS.md (replaces CLAUDE.md + AI_AGENTS.md): - Consolidated into the AGENTS.md open standard (Linux Foundation, supported by Claude Code, Cursor, Windsurf, Codex, Gemini CLI, Aider, and others). - Covers: project overview, non-negotiable rules, build/test commands, phase manager architecture, config schema (v1.6 with example), smoke test reference table, contributing guidelines, multi-engineer workflow guidance, and documentation index. docs/development/smoke-tests.md (new): - Complete authoring guide for new smoke tests: framework mechanics, all 14 available platforms, runSmokeTest / runUpgradeTest / bumpVersions usage with annotated examples, Windows-specific requirements, CI wiring (Makefile + workflow + PR label), timeout guidance, Reset best-effort rationale, and a pre-submission checklist. docs/development/workflow.md: - Replace stale smoke-small/smoke-full with actual four targets. - Add --tags testing note for unit tests. - Remove non-existent make build-release / make sign-release. - Add current apiVersion to schema safety guideline. docs/specifications/architecture.md: - Fix package path: pkg/product/mke/api -> pkg/product/mke/config. - Add current apiVersion and full spec structure. - Add abridged Apply and Reset phase sequences. - Document UninstallMKE swarm dissolution fallback (PRODENG-3442). Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
|
This is waiting for input from @trifo13 |
4df8031 to
03683d2
Compare
…enario) Adds TestAirgappedMultiHopUpgrade, a smoke test that exercises the full upgrade chain similar as to what CSO EMEA observed in a specific customer scenario: install with MCR 25.0 / MKE 3.8.8 / MSR 2.9.27, then upgrade through 3.8.11 → 3.8.12 (MCR 29.2) → 3.9.2 → latest MKE 3.x / MCR 29.x. All post-install upgrade steps pull images from an internal DTR exposed on a non-standard port (4443) via an NLB, simulating an airgapped registry configuration. Key design decisions -------------------- - Image preload strategy: rather than pushing images to DTR (which requires namespace provisioning and hits DTR auth edge cases), all upgrade images are pulled from docker.io/mirantis on every node and tagged with the DTR registry address. Launchpad's "Pull MKE images" phase runs docker image inspect before docker pull; finding the image locally it skips the pull entirely. This exercises Launchpad's imageRepo feature without requiring actual DTR push/pull. - SSH key compatibility: Terraform's tls_private_key emits OpenSSH-format ed25519 keys that golang.org/x/crypto/ssh.ParsePrivateKey rejects. All remote commands use the system ssh binary (sshRun/sshRunScript) to avoid Go-side key parsing. - DTR image listing: UCP bootstrapper uses "images --list"; DTR 2.x uses "images" (the --list flag is unrecognised and causes help text on stdout with exit 0). The preload script filters output for valid image-reference patterns and falls back to the plain "images" subcommand automatically. - Dynamic latest-version step: fetchLatestMKEVersion queries Docker Hub tags for mirantis/ucp; fetchLatestMCRChannel probes the Mirantis apt repository (channels are non-sequential so the probe scans the full range rather than stopping at the first 404). The dynamic step is appended only when it differs from the last fixed step. Supporting changes ------------------ - examples/terraform/aws-simple: add msr_port variable (default 443) so the NLB can expose DTR on a non-standard port; all other smoke tests are unaffected. --dtr-external-url appends :PORT only when port != 443. - test/platforms.go: fix RHEL8 MCR install — disable the container-tools module stream before installing MCR to prevent the system runc from conflicting with Mirantis's containerd.io-runc. - Makefile: add smoke-airgapped-multi-hop target (-timeout 200m). - .github/workflows/smoke-tests.yaml: add smoke-airgapped-multi-hop CI job triggered by the smoke-test or smoke-airgapped-multi-hop PR labels. Tested: two passing runs (2537s with 3 fixed steps; 3329s with 4 steps including the dynamic latest-version step MKE 3.9.3 / MCR stable-29.4). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
03683d2 to
6740adc
Compare
| timeout-minutes: 205 | ||
| if: | | ||
| github.event_name == 'push' || | ||
| contains(github.event.pull_request.labels.*.name, 'smoke-test') || |
There was a problem hiding this comment.
remove this from the general smoke-test list, and make it comprehensive (remove this line)
| // Disable the container-tools module stream before MCR install. RHEL8 | ||
| // AppStream pulls in system runc as a container-selinux dependency; that | ||
| // package conflicts with Mirantis's containerd.io-runc at install time. | ||
| UserData: "sudo dnf module disable container-tools -y; sudo firewall-cmd --permanent --add-port=2377/tcp --add-port=7946/tcp --add-port=7946/udp --add-port=4789/udp --add-port=10250/tcp; sudo firewall-cmd --reload", |
There was a problem hiding this comment.
there should be a way to pass in custom userdata in the nodegroup variables instead of changing the default. does it make sense to always use this override for all tests? if so, does this need to be applied to the other rhel versions instead of just rhel 8 ?
There was a problem hiding this comment.
Pull request overview
Adds a new smoke-test scenario to exercise a customer-style “airgapped” multi-hop upgrade path, plus supporting Terraform/CI/doc updates so the scenario can be provisioned and run in automation.
Changes:
- Added
TestAirgappedMultiHopUpgradesmoke test that provisions MSR behind a non-standard external port and performs multi-step MKE/MCR upgrades usingimageRepooverrides. - Extended the shared AWS Terraform smoke module with an
msr_portinput and updated MSR ingress/--dtr-external-urlgeneration. - Updated CI workflow and developer/agent documentation for smoke-test authoring and execution.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
test/smoke/airgapped_multi_hop_upgrade_test.go |
New multi-hop “airgapped” upgrade smoke test (custom registry port + sequential upgrade chain). |
test/platforms.go |
Adjusts RHEL8 userdata to avoid MCR install conflicts by disabling container-tools module stream. |
Makefile |
Adds smoke-airgapped-multi-hop make target with extended timeout. |
examples/terraform/aws-simple/variables.tf |
Introduces msr_port variable (default 443). |
examples/terraform/aws-simple/launchpad.tf |
Wires msr_port into MSR NLB ingress and --dtr-external-url. |
examples/terraform/aws-simple/.terraform.lock.hcl |
Updates provider constraint/hashes after Terraform init/upgrade. |
docs/specifications/architecture.md |
Refreshes architecture/spec details (schema, phases, flows). |
docs/development/workflow.md |
Expands testing guidance and smoke-test workflow documentation. |
docs/development/smoke-tests.md |
New smoke-test authoring guide and CI wiring checklist. |
CLAUDE.md |
Removed (superseded by AGENTS.md standard). |
AI_AGENTS.md |
Removed (superseded by AGENTS.md standard). |
AGENTS.md |
New consolidated agent instructions (AGENTS.md open standard). |
.github/workflows/smoke-tests.yaml |
Adds smoke-airgapped-multi-hop CI job with long timeout and label gate. |
Files not reviewed (1)
- examples/terraform/aws-simple/.terraform.lock.hcl: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // (2.9.27) on port 4443, installs the baseline software, pre-loads all MKE | ||
| // and MSR upgrade images into DTR, then drives three sequential upgrades with | ||
| // mke.imageRepo and msr.imageRepo pointing to DTR throughout. | ||
| // | ||
| // Upgrade chain: | ||
| // | ||
| // install: MCR stable-25.0 / MKE 3.8.8 / MSR 2.9.27 (images from docker.io/mirantis) | ||
| // step 1: MCR stable-25.0 / MKE 3.8.11 (images from DTR :4443) | ||
| // step 2: MCR stable-29.2 / MKE 3.8.12 (images from DTR :4443) | ||
| // step 3: MCR stable-29.2 / MKE 3.9.2 (images from DTR :4443) | ||
| // | ||
| // What this test validates: | ||
| // - Launchpad correctly uses mke.imageRepo and msr.imageRepo when set to an | ||
| // internal registry address that includes a non-standard port. | ||
| // - The full 3.8.8 → 3.8.11 → 3.8.12 → 3.9.2 upgrade chain completes when | ||
| // all MKE bootstrapper images are served from an internal registry. | ||
| // - DTR exposed on a non-standard port (4443) is reachable and usable as an | ||
| // image registry for both Docker operations and Launchpad imageRepo config. |
| // - DTR exposed on a non-standard port (4443) is reachable and usable as an | ||
| // image registry for both Docker operations and Launchpad imageRepo config. |
| resp, err := http.Get(pageURL) | ||
| require.NoError(t, err, "query Docker Hub tags for mirantis/ucp (page %d)", page+1) | ||
| var p hubPage | ||
| require.NoError(t, json.NewDecoder(resp.Body).Decode(&p), | ||
| "decode Docker Hub tags response (page %d)", page+1) | ||
| resp.Body.Close() | ||
|
|
| const ( | ||
| probeBase = "https://repos.mirantis.com/ubuntu/dists/jammy" | ||
| probeArch = "binary-amd64/Packages" | ||
| maxMinor = 20 | ||
| ) | ||
|
|
||
| last := "" | ||
| for minor := 1; minor <= maxMinor; minor++ { | ||
| channel := fmt.Sprintf("stable-%d.%d", major, minor) | ||
| url := fmt.Sprintf("%s/%s/%s", probeBase, channel, probeArch) | ||
| resp, err := http.Head(url) | ||
| if resp != nil { | ||
| resp.Body.Close() | ||
| } | ||
| // Do NOT break on 404 — channels are non-sequential; a gap does not | ||
| // mean higher minors are absent. | ||
| if err == nil && resp != nil && resp.StatusCode == http.StatusOK { | ||
| last = channel | ||
| } | ||
| } |
| err = upgradeProduct.Apply(true, true, 3, true) | ||
| assert.NoError(t, err, "upgrade Apply() for step %d", i+1) | ||
| if err != nil { | ||
| t.Logf("upgrade step %d failed; stopping upgrade chain", i+1) | ||
| break | ||
| } |
| provider "registry.terraform.io/hashicorp/aws" { | ||
| version = "6.43.0" | ||
| constraints = ">= 6.28.0, >= 6.29.0" | ||
| constraints = ">= 6.28.0, >= 6.33.0" | ||
| hashes = [ |
| | Make target | Test | Timeout | Description | | ||
| |---|---|---|---| | ||
| | `smoke-modern` | `TestModernCluster` | 50m | RHEL9/Ubuntu24/Rocky9, MCR stable-29.2, MKE 3.9.2 | | ||
| | `smoke-legacy` | `TestLegacyCluster` | 50m | RHEL8/Rocky8/Ubuntu22, MCR stable-25.0, MKE 3.8.8 | | ||
| | `smoke-windows` | `TestWindowsCluster` | 60m | Ubuntu24 manager + Windows 2019/2022/2025 workers | | ||
| | `smoke-upgrade` | `TestUpgradeLegacyToModern` | 90m | Install MCR stable-25.0/MKE 3.8.8, upgrade to stable-29.2/MKE 3.9.2 | | ||
|
|
||
| ```bash | ||
| # Run a specific smoke test | ||
| make smoke-modern | ||
| make smoke-upgrade | ||
| ``` | ||
|
|
||
| All smoke-test AWS resources are tagged `launchpad-smoke-test: true` for cost tracking. CI smoke jobs are gated by PR labels (`smoke-test`, `smoke-modern`, `smoke-legacy`, `smoke-windows`, `smoke-upgrade`). |
|
|
||
| CI jobs are gated by PR labels: `smoke-test` (all jobs), or individual labels `smoke-modern`, `smoke-legacy`, `smoke-windows`, `smoke-upgrade`. |
| Timeout guidance: | ||
| - Install-only tests: **50m** | ||
| - Windows tests: **60m** (WinRM setup and Windows image pull are slower) | ||
| - Upgrade tests: **90m** (two full apply cycles) |
| runAirgappedMultiHopUpgradeTest(t, airgapUpgradeConfig{ | ||
| Base: smokeConfig{ | ||
| Name: "airgappedup", | ||
| MCRChannel: "stable-25.0", | ||
| MKEVersion: "3.8.8", | ||
| MSRVersion: "2.9.27", | ||
| SSHKeyAlgorithm: "ed25519", |
Jira: https://mirantis.jira.com/browse/PRODENG-3446
Summary
Extends the smoke test suite to cover additional customer-specific deployment scenarios. Two engineers are collaborating on this branch; each engineer adds tests in a dedicated file under
test/smoke/to avoid merge conflicts.What is here so far
test/smoke/upgrade_test.go—TestUpgradeLegacyToModern: installs MCRstable-25.0/ MKE3.8.8on RHEL8/Rocky8/Ubuntu22, then upgrades in place to MCRstable-29.2/ MKE3.9.2. CI verified passing (run 25721416884, 1320s).make smoke-upgradeMakefile target (90m timeout).smoke-upgradeCI job in.github/workflows/smoke-tests.yaml.AGENTS.md— consolidated fromCLAUDE.md+AI_AGENTS.mdinto the cross-agent open standard; updated with multi-engineer workflow guidance.docs/development/smoke-tests.md— new authoring guide covering the full framework, platform registry, helper usage, CI wiring, and a pre-submission checklist.docs/development/workflow.md,docs/specifications/architecture.md— corrected stale references and added current schema/phase sequence detail.Adding tests to this PR
Read
docs/development/smoke-tests.mdfor the complete guide. In short:test/smoke/<scenario>_test.goin packagesmoke_test.runSmokeTest(t, smokeConfig{...})orrunUpgradeTest(t, upgradeConfig{...}).Makefiletarget and a.github/workflows/smoke-tests.yamljob.gh label create smoke-<scenario>).Constraints
stable-29.2), not explicit version.SSHKeyAlgorithm: "rsa"and a Linux manager.smoke_test.goorupgrade_test.go.