Skip to content

PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630

Open
james-nesbitt wants to merge 3 commits into
mainfrom
PRODENG-3446-smoke-test-vocalink
Open

PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario)#630
james-nesbitt wants to merge 3 commits into
mainfrom
PRODENG-3446-smoke-test-vocalink

Conversation

@james-nesbitt
Copy link
Copy Markdown
Collaborator

Jira: https://mirantis.jira.com/browse/PRODENG-3446

Summary

Extends the smoke test suite to cover additional customer-specific deployment scenarios. Two engineers are collaborating on this branch; each engineer adds tests in a dedicated file under test/smoke/ to avoid merge conflicts.

What is here so far

  • test/smoke/upgrade_test.goTestUpgradeLegacyToModern: installs MCR stable-25.0 / MKE 3.8.8 on RHEL8/Rocky8/Ubuntu22, then upgrades in place to MCR stable-29.2 / MKE 3.9.2. CI verified passing (run 25721416884, 1320s).
  • make smoke-upgrade Makefile target (90m timeout).
  • smoke-upgrade CI job in .github/workflows/smoke-tests.yaml.
  • AGENTS.md — consolidated from CLAUDE.md + AI_AGENTS.md into the cross-agent open standard; updated with multi-engineer workflow guidance.
  • docs/development/smoke-tests.md — new authoring guide covering the full framework, platform registry, helper usage, CI wiring, and a pre-submission checklist.
  • docs/development/workflow.md, docs/specifications/architecture.md — corrected stale references and added current schema/phase sequence detail.

Adding tests to this PR

Read docs/development/smoke-tests.md for the complete guide. In short:

  1. Create test/smoke/<scenario>_test.go in package smoke_test.
  2. Call runSmokeTest(t, smokeConfig{...}) or runUpgradeTest(t, upgradeConfig{...}).
  3. Add a Makefile target and a .github/workflows/smoke-tests.yaml job.
  4. Create a PR label (gh label create smoke-<scenario>).
  5. Push and add your label to trigger only your test.

Constraints

  • No customer names in any code, comment, commit message, or resource tag.
  • MCR must be specified by channel (stable-29.2), not explicit version.
  • Windows clusters require SSHKeyAlgorithm: "rsa" and a Linux manager.
  • One file per engineer / scenario — do not modify smoke_test.go or upgrade_test.go.

## Smoke test

Added TestUpgradeLegacyToModern (test/smoke/upgrade_test.go):
- Provisions RHEL8/Rocky8/Ubuntu22, installs MCR stable-25.0 / MKE 3.8.8,
  then upgrades in place to MCR stable-29.2 / MKE 3.9.2 via a second Apply().
- runUpgradeTest() helper mirrors runSmokeTest() structure (defer destroy,
  resource tagging, temp SSH dir).
- bumpVersions() unmarshals Terraform-generated launchpad_yaml, updates
  spec.mcr.channel and spec.mke.version, re-marshals — preserving host
  addresses, SANs, LB names, and install flags verbatim.
- make smoke-upgrade target (90m timeout).
- smoke-upgrade CI job in .github/workflows/smoke-tests.yaml, gated by
  smoke-upgrade or smoke-test PR label.

CI result: PASS (run 25721416884, 1320s).

## Documentation

AGENTS.md (replaces CLAUDE.md + AI_AGENTS.md):
- Consolidated into the AGENTS.md open standard (Linux Foundation, supported
  by Claude Code, Cursor, Windsurf, Codex, Gemini CLI, Aider, and others).
- Covers: project overview, non-negotiable rules, build/test commands,
  phase manager architecture, config schema (v1.6 with example), smoke test
  reference table, contributing guidelines, multi-engineer workflow guidance,
  and documentation index.

docs/development/smoke-tests.md (new):
- Complete authoring guide for new smoke tests: framework mechanics,
  all 14 available platforms, runSmokeTest / runUpgradeTest / bumpVersions
  usage with annotated examples, Windows-specific requirements, CI wiring
  (Makefile + workflow + PR label), timeout guidance, Reset best-effort
  rationale, and a pre-submission checklist.

docs/development/workflow.md:
- Replace stale smoke-small/smoke-full with actual four targets.
- Add --tags testing note for unit tests.
- Remove non-existent make build-release / make sign-release.
- Add current apiVersion to schema safety guideline.

docs/specifications/architecture.md:
- Fix package path: pkg/product/mke/api -> pkg/product/mke/config.
- Add current apiVersion and full spec structure.
- Add abridged Apply and Reset phase sequences.
- Document UninstallMKE swarm dissolution fallback (PRODENG-3442).

Signed-off-by: James Nesbitt <jnesbitt@mirantis.com>
@james-nesbitt
Copy link
Copy Markdown
Collaborator Author

This is waiting for input from @trifo13

@trifo13 trifo13 force-pushed the PRODENG-3446-smoke-test-vocalink branch from 4df8031 to 03683d2 Compare May 27, 2026 10:59
@trifo13 trifo13 changed the title PRODENG-3446: extend smoke test suite with additional customer scenario tests PRODENG-3446: add airgapped multi-hop upgrade smoke test (customer scenario) May 27, 2026
…enario)

Adds TestAirgappedMultiHopUpgrade, a smoke test that exercises the full
upgrade chain similar as to what CSO EMEA observed in a specific customer
scenario: install with MCR 25.0 / MKE 3.8.8 / MSR 2.9.27, then upgrade
through 3.8.11 → 3.8.12 (MCR 29.2) → 3.9.2 → latest MKE 3.x / MCR 29.x.
All post-install upgrade steps pull images from an internal DTR exposed on
a non-standard port (4443) via an NLB, simulating an airgapped registry
configuration.

Key design decisions
--------------------
- Image preload strategy: rather than pushing images to DTR (which requires
  namespace provisioning and hits DTR auth edge cases), all upgrade images
  are pulled from docker.io/mirantis on every node and tagged with the DTR
  registry address. Launchpad's "Pull MKE images" phase runs docker image
  inspect before docker pull; finding the image locally it skips the pull
  entirely. This exercises Launchpad's imageRepo feature without requiring
  actual DTR push/pull.

- SSH key compatibility: Terraform's tls_private_key emits OpenSSH-format
  ed25519 keys that golang.org/x/crypto/ssh.ParsePrivateKey rejects. All
  remote commands use the system ssh binary (sshRun/sshRunScript) to avoid
  Go-side key parsing.

- DTR image listing: UCP bootstrapper uses "images --list"; DTR 2.x uses
  "images" (the --list flag is unrecognised and causes help text on stdout
  with exit 0). The preload script filters output for valid image-reference
  patterns and falls back to the plain "images" subcommand automatically.

- Dynamic latest-version step: fetchLatestMKEVersion queries Docker Hub
  tags for mirantis/ucp; fetchLatestMCRChannel probes the Mirantis apt
  repository (channels are non-sequential so the probe scans the full
  range rather than stopping at the first 404). The dynamic step is
  appended only when it differs from the last fixed step.

Supporting changes
------------------
- examples/terraform/aws-simple: add msr_port variable (default 443) so
  the NLB can expose DTR on a non-standard port; all other smoke tests are
  unaffected. --dtr-external-url appends :PORT only when port != 443.
- test/platforms.go: fix RHEL8 MCR install — disable the container-tools
  module stream before installing MCR to prevent the system runc from
  conflicting with Mirantis's containerd.io-runc.
- Makefile: add smoke-airgapped-multi-hop target (-timeout 200m).
- .github/workflows/smoke-tests.yaml: add smoke-airgapped-multi-hop CI job
  triggered by the smoke-test or smoke-airgapped-multi-hop PR labels.

Tested: two passing runs (2537s with 3 fixed steps; 3329s with 4 steps
including the dynamic latest-version step MKE 3.9.3 / MCR stable-29.4).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@trifo13 trifo13 force-pushed the PRODENG-3446-smoke-test-vocalink branch from 03683d2 to 6740adc Compare May 27, 2026 11:09
timeout-minutes: 205
if: |
github.event_name == 'push' ||
contains(github.event.pull_request.labels.*.name, 'smoke-test') ||
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this from the general smoke-test list, and make it comprehensive (remove this line)

Comment thread test/platforms.go
// Disable the container-tools module stream before MCR install. RHEL8
// AppStream pulls in system runc as a container-selinux dependency; that
// package conflicts with Mirantis's containerd.io-runc at install time.
UserData: "sudo dnf module disable container-tools -y; sudo firewall-cmd --permanent --add-port=2377/tcp --add-port=7946/tcp --add-port=7946/udp --add-port=4789/udp --add-port=10250/tcp; sudo firewall-cmd --reload",
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a way to pass in custom userdata in the nodegroup variables instead of changing the default. does it make sense to always use this override for all tests? if so, does this need to be applied to the other rhel versions instead of just rhel 8 ?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new smoke-test scenario to exercise a customer-style “airgapped” multi-hop upgrade path, plus supporting Terraform/CI/doc updates so the scenario can be provisioned and run in automation.

Changes:

  • Added TestAirgappedMultiHopUpgrade smoke test that provisions MSR behind a non-standard external port and performs multi-step MKE/MCR upgrades using imageRepo overrides.
  • Extended the shared AWS Terraform smoke module with an msr_port input and updated MSR ingress/--dtr-external-url generation.
  • Updated CI workflow and developer/agent documentation for smoke-test authoring and execution.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
test/smoke/airgapped_multi_hop_upgrade_test.go New multi-hop “airgapped” upgrade smoke test (custom registry port + sequential upgrade chain).
test/platforms.go Adjusts RHEL8 userdata to avoid MCR install conflicts by disabling container-tools module stream.
Makefile Adds smoke-airgapped-multi-hop make target with extended timeout.
examples/terraform/aws-simple/variables.tf Introduces msr_port variable (default 443).
examples/terraform/aws-simple/launchpad.tf Wires msr_port into MSR NLB ingress and --dtr-external-url.
examples/terraform/aws-simple/.terraform.lock.hcl Updates provider constraint/hashes after Terraform init/upgrade.
docs/specifications/architecture.md Refreshes architecture/spec details (schema, phases, flows).
docs/development/workflow.md Expands testing guidance and smoke-test workflow documentation.
docs/development/smoke-tests.md New smoke-test authoring guide and CI wiring checklist.
CLAUDE.md Removed (superseded by AGENTS.md standard).
AI_AGENTS.md Removed (superseded by AGENTS.md standard).
AGENTS.md New consolidated agent instructions (AGENTS.md open standard).
.github/workflows/smoke-tests.yaml Adds smoke-airgapped-multi-hop CI job with long timeout and label gate.
Files not reviewed (1)
  • examples/terraform/aws-simple/.terraform.lock.hcl: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +6 to +23
// (2.9.27) on port 4443, installs the baseline software, pre-loads all MKE
// and MSR upgrade images into DTR, then drives three sequential upgrades with
// mke.imageRepo and msr.imageRepo pointing to DTR throughout.
//
// Upgrade chain:
//
// install: MCR stable-25.0 / MKE 3.8.8 / MSR 2.9.27 (images from docker.io/mirantis)
// step 1: MCR stable-25.0 / MKE 3.8.11 (images from DTR :4443)
// step 2: MCR stable-29.2 / MKE 3.8.12 (images from DTR :4443)
// step 3: MCR stable-29.2 / MKE 3.9.2 (images from DTR :4443)
//
// What this test validates:
// - Launchpad correctly uses mke.imageRepo and msr.imageRepo when set to an
// internal registry address that includes a non-standard port.
// - The full 3.8.8 → 3.8.11 → 3.8.12 → 3.9.2 upgrade chain completes when
// all MKE bootstrapper images are served from an internal registry.
// - DTR exposed on a non-standard port (4443) is reachable and usable as an
// image registry for both Docker operations and Launchpad imageRepo config.
Comment on lines +22 to +23
// - DTR exposed on a non-standard port (4443) is reachable and usable as an
// image registry for both Docker operations and Launchpad imageRepo config.
Comment on lines +353 to +359
resp, err := http.Get(pageURL)
require.NoError(t, err, "query Docker Hub tags for mirantis/ucp (page %d)", page+1)
var p hubPage
require.NoError(t, json.NewDecoder(resp.Body).Decode(&p),
"decode Docker Hub tags response (page %d)", page+1)
resp.Body.Close()

Comment on lines +390 to +409
const (
probeBase = "https://repos.mirantis.com/ubuntu/dists/jammy"
probeArch = "binary-amd64/Packages"
maxMinor = 20
)

last := ""
for minor := 1; minor <= maxMinor; minor++ {
channel := fmt.Sprintf("stable-%d.%d", major, minor)
url := fmt.Sprintf("%s/%s/%s", probeBase, channel, probeArch)
resp, err := http.Head(url)
if resp != nil {
resp.Body.Close()
}
// Do NOT break on 404 — channels are non-sequential; a gap does not
// mean higher minors are absent.
if err == nil && resp != nil && resp.StatusCode == http.StatusOK {
last = channel
}
}
Comment on lines +559 to +564
err = upgradeProduct.Apply(true, true, 3, true)
assert.NoError(t, err, "upgrade Apply() for step %d", i+1)
if err != nil {
t.Logf("upgrade step %d failed; stopping upgrade chain", i+1)
break
}
Comment on lines 4 to 7
provider "registry.terraform.io/hashicorp/aws" {
version = "6.43.0"
constraints = ">= 6.28.0, >= 6.29.0"
constraints = ">= 6.28.0, >= 6.33.0"
hashes = [
Comment on lines +55 to +68
| Make target | Test | Timeout | Description |
|---|---|---|---|
| `smoke-modern` | `TestModernCluster` | 50m | RHEL9/Ubuntu24/Rocky9, MCR stable-29.2, MKE 3.9.2 |
| `smoke-legacy` | `TestLegacyCluster` | 50m | RHEL8/Rocky8/Ubuntu22, MCR stable-25.0, MKE 3.8.8 |
| `smoke-windows` | `TestWindowsCluster` | 60m | Ubuntu24 manager + Windows 2019/2022/2025 workers |
| `smoke-upgrade` | `TestUpgradeLegacyToModern` | 90m | Install MCR stable-25.0/MKE 3.8.8, upgrade to stable-29.2/MKE 3.9.2 |

```bash
# Run a specific smoke test
make smoke-modern
make smoke-upgrade
```

All smoke-test AWS resources are tagged `launchpad-smoke-test: true` for cost tracking. CI smoke jobs are gated by PR labels (`smoke-test`, `smoke-modern`, `smoke-legacy`, `smoke-windows`, `smoke-upgrade`).
Comment thread AGENTS.md
Comment on lines +122 to +123

CI jobs are gated by PR labels: `smoke-test` (all jobs), or individual labels `smoke-modern`, `smoke-legacy`, `smoke-windows`, `smoke-upgrade`.
Timeout guidance:
- Install-only tests: **50m**
- Windows tests: **60m** (WinRM setup and Windows image pull are slower)
- Upgrade tests: **90m** (two full apply cycles)
Comment on lines +618 to +624
runAirgappedMultiHopUpgradeTest(t, airgapUpgradeConfig{
Base: smokeConfig{
Name: "airgappedup",
MCRChannel: "stable-25.0",
MKEVersion: "3.8.8",
MSRVersion: "2.9.27",
SSHKeyAlgorithm: "ed25519",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants