Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,40 @@ jobs:
path: ginkgo.json
retention-days: 15

# ARM64 GPU E2E test — runs only on merge to main (g5g instances are expensive)
e2e-test-arm64:
runs-on: linux-amd64-cpu4
if: github.ref == 'refs/heads/main'
name: E2E Test (arm64)

steps:
- name: Checkout code
uses: actions/checkout@v6

- name: Install Go
uses: actions/setup-go@v6
with:
go-version: 'stable'
check-latest: true

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y make

- name: Run ARM64 GPU e2e test
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_SSH_KEY: ${{ secrets.AWS_SSH_KEY }}
LOG_ARTIFACT_DIR: e2e_logs
run: |
e2e_ssh_key=$(mktemp)
echo "${{ secrets.AWS_SSH_KEY }}" > "$e2e_ssh_key"
chmod 600 "$e2e_ssh_key"
export E2E_SSH_KEY="$e2e_ssh_key"
make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"

Comment on lines +107 to +108
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ARM64 E2E test job is missing an artifact upload step that exists in the main e2e-test job. The e2e-test job includes an "Archive Ginkgo logs" step (lines 68-73) that uploads ginkgo.json artifacts with 15-day retention. This step should be added after the "Run ARM64 GPU e2e test" step to maintain consistency and ensure test results are preserved for debugging. Note that the test run command in line 107 doesn't generate ginkgo.json (no --json-report flag), so you would either need to add the flag to generate the artifact or adjust the artifact upload to capture different logs.

Suggested change
make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64'"
make -f tests/Makefile test GINKGO_ARGS="--label-filter='arm64' --json-report=${LOG_ARTIFACT_DIR}/ginkgo.json"
- name: Archive Ginkgo logs
if: always()
uses: actions/upload-artifact@v4
with:
name: e2e-ginkgo-logs-arm64
path: e2e_logs/ginkgo.json
retention-days: 15
if-no-files-found: ignore

Copilot uses AI. Check for mistakes.
integration-test:
runs-on: linux-amd64-cpu4
if: ${{ github.event.workflow_run.conclusion == 'success' }} && ${{ github.event.workflow_run.event == 'push' }}
Expand Down
18 changes: 16 additions & 2 deletions pkg/provisioner/templates/crio.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,18 +59,32 @@ holodeck_progress "$COMPONENT" 2 4 "Adding CRI-O repository"

CRIO_VERSION="${DESIRED_VERSION}"

# Default to latest stable CRI-O if no version specified
if [[ -z "$CRIO_VERSION" ]]; then
CRIO_VERSION="v1.33"
holodeck_log "INFO" "$COMPONENT" "No version specified, defaulting to ${CRIO_VERSION}"
fi

# Ensure version starts with 'v' and is in vX.Y format (strip patch if present)
CRIO_VERSION="${CRIO_VERSION#v}"
CRIO_VERSION="v$(echo "$CRIO_VERSION" | cut -d. -f1,2)"

# CRI-O migrated from pkgs.k8s.io to download.opensuse.org
# See: https://github.com/cri-o/packaging#readme
CRIO_REPO_URL="https://download.opensuse.org/repositories/isv:/cri-o:/stable:/${CRIO_VERSION}"

# Add CRI-O repo (idempotent)
if [[ ! -f /etc/apt/keyrings/cri-o-apt-keyring.gpg ]]; then
sudo mkdir -p /etc/apt/keyrings
holodeck_retry 3 "$COMPONENT" curl -fsSL \
"https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/Release.key" | \
"${CRIO_REPO_URL}/deb/Release.key" | \
sudo gpg --dearmor -o /etc/apt/keyrings/cri-o-apt-keyring.gpg
else
holodeck_log "INFO" "$COMPONENT" "CRI-O GPG key already present"
fi

if [[ ! -f /etc/apt/sources.list.d/cri-o.list ]]; then
echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/stable:/${CRIO_VERSION}/deb/ /" | \
echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] ${CRIO_REPO_URL}/deb/ /" | \
sudo tee /etc/apt/sources.list.d/cri-o.list > /dev/null
else
holodeck_log "INFO" "$COMPONENT" "CRI-O repository already configured"
Comment on lines +62 to 90
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRI-O repository migration changes (switching from pkgs.k8s.io to download.opensuse.org) and version defaulting logic are not mentioned in the PR description and appear unrelated to ARM64 GPU testing. The ARM64 test configuration uses Docker, not CRI-O.

While these changes appear to be fixing a legitimate issue with CRI-O repository availability, they should either:

  1. Be documented in the PR description explaining why they're included
  2. Be split into a separate PR focused on CRI-O repository migration

Including unrelated changes makes it harder to review, understand the scope of changes, and potentially revert specific functionality if issues arise.

Copilot uses AI. Check for mistakes.
Expand Down
7 changes: 6 additions & 1 deletion pkg/provisioner/templates/docker.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,12 @@ holodeck_progress "$COMPONENT" 5 6 "Installing cri-dockerd"

# Install cri-dockerd (idempotent)
CRI_DOCKERD_VERSION="0.3.17"
CRI_DOCKERD_ARCH="amd64"
CRI_DOCKERD_ARCH="$(uname -m)"
case "${CRI_DOCKERD_ARCH}" in
x86_64|amd64) CRI_DOCKERD_ARCH="amd64" ;;
aarch64|arm64) CRI_DOCKERD_ARCH="arm64" ;;
*) holodeck_log "ERROR" "$COMPONENT" "Unsupported arch for cri-dockerd: ${CRI_DOCKERD_ARCH}"; exit 1 ;;
esac

if [[ ! -f /usr/local/bin/cri-dockerd ]]; then
CRI_DOCKERD_URL="https://github.com/Mirantis/cri-dockerd/releases/download/v${CRI_DOCKERD_VERSION}/cri-dockerd-${CRI_DOCKERD_VERSION}.${CRI_DOCKERD_ARCH}.tgz"
Expand Down
5 changes: 5 additions & 0 deletions tests/aws_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,11 @@ var _ = DescribeTable("AWS Environment E2E",
filePath: filepath.Join(packagePath, "data", "test_aws_k8s_latest.yml"),
description: "Tests AWS environment with Kubernetes tracking master branch",
}, Label("k8s-latest")),
Entry("ARM64 GPU Test", testConfig{
name: "ARM64 GPU Test",
filePath: filepath.Join(packagePath, "data", "test_aws_arm64.yml"),
description: "Tests full GPU stack on ARM64 (g5g Graviton) with architecture inferred from instance type",
}, Label("arm64")),
)

// Note: To run tests in parallel, use: ginkgo -p or --procs=N
Expand Down
24 changes: 24 additions & 0 deletions tests/data/test_aws_arm64.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: holodeck.nvidia.com/v1alpha1
kind: Environment
metadata:
name: holodeck-aws-e2e-test-arm64
description: "end-to-end test infrastructure for ARM64 (Graviton + GPU)"
spec:
provider: aws
auth:
keyName: cnt-ci
privateKey: /home/runner/.cache/key
instance:
type: g5g.xlarge
region: us-east-1
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The region is set to 'us-east-1', but all other single-instance E2E test configurations in the tests/data directory use 'us-west-1' (test_aws.yml, test_aws_dra.yml, test_aws_kernel.yml, test_aws_legacy.yml, test_aws_ctk_git.yml, test_aws_k8s_git.yml, test_aws_k8s_kind_git.yml, test_aws_k8s_latest.yml). Using a different region creates inconsistency and could lead to regional quota issues or cleanup problems if the periodic cleanup workflow only targets specific regions. Consider changing to 'us-west-1' to maintain consistency with existing E2E tests.

Suggested change
region: us-east-1
region: us-west-1

Copilot uses AI. Check for mistakes.
# architecture intentionally omitted to exercise inference from instance type
containerRuntime:
install: true
name: docker
nvidiaContainerToolkit:
install: true
nvidiaDriver:
install: true
kubernetes:
install: true
installer: kubeadm
Loading