Skip to content

feat(aws): AMI architecture detection and cross-validation#665

Closed
ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-ami-arch-validation
Closed

feat(aws): AMI architecture detection and cross-validation#665
ArangoGutierrez wants to merge 1 commit intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-ami-arch-validation

Conversation

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator

Summary

  • Add describeImageArch helper to query AMI architecture from EC2
  • Update resolveImageForNode to return architecture for explicit ImageId (was skipping all arch detection)
  • Add getInstanceTypeArch to query instance type supported architectures
  • Cross-validate AMI arch against instance type arch in DryRun()

Catches arm64 AMI + x86_64 instance type mismatches with a clear error message instead of cryptic boot failures.

Test plan

  • Unit tests for describeImageArch with mock EC2 client
  • Unit tests for getInstanceTypeArch with mock EC2 client
  • Unit test for DryRun() returning error on arch mismatch
  • go test ./pkg/provider/aws/... -v passes
  • CI validation pending

Add describeImageArch helper to query an AMI's architecture from EC2
and getInstanceTypeArch to query instance type supported architectures.
Update resolveImageForNode to return architecture for all code paths
including explicit ImageId. Cross-validate AMI architecture against
instance type supported architectures in DryRun() to catch mismatches
(e.g., arm64 AMI + x86_64 instance type) with a clear error message
before instance creation fails.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 14, 2026 17:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds AMI architecture detection and cross-validation to prevent architecture mismatches between AMIs and instance types. Previously, when an explicit AMI ID was provided, architecture detection was skipped, leading to cryptic boot failures when arm64 AMIs were used with x86_64 instance types (or vice versa). The PR implements three new helper functions to query architecture information from EC2 APIs and integrates a validation check in the DryRun() method to catch these mismatches early with clear error messages.

Changes:

  • Added describeImageArch() to query AMI architecture from EC2 DescribeImages API
  • Added getInstanceTypeArch() to query supported architectures from EC2 DescribeInstanceTypes API
  • Added normalizeArchToEC2() to convert architecture aliases (amd64/aarch64) to EC2 canonical forms (x86_64/arm64)
  • Enhanced resolveImageForNode() to detect and return architecture for explicit ImageIds
  • Added architecture cross-validation in DryRun() to catch AMI/instance type mismatches
  • Updated all mock EC2 responses to include ProcessorInfo with architecture data

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/provider/aws/image.go Core implementation: added three new helper functions for architecture detection/normalization, updated resolveImageForNode to query architecture for explicit AMIs, updated ResolvedImage struct to include architecture, added architecture normalization in resolveOSToAMI and setLegacyAMI
pkg/provider/aws/dryrun.go Added architecture compatibility validation that cross-checks AMI architecture against instance type supported architectures and returns a clear error on mismatch
pkg/provider/aws/image_test.go Comprehensive unit tests for new functions including describeImageArch, getInstanceTypeArch, and DryRun scenarios for both architecture match and mismatch cases, plus assertion that architecture is always populated
pkg/provider/aws/mock_ec2_test.go Updated default mock responses for DescribeInstanceTypes to include ProcessorInfo with supported architectures
pkg/provider/aws/aws_test.go Updated mock responses in existing tests to include ProcessorInfo architecture data
pkg/provider/aws/aws_ginkgo_test.go Updated mock responses in Ginkgo tests to include ProcessorInfo architecture data

@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 22021789809

Details

  • 74 of 82 (90.24%) changed or added relevant lines in 2 files are covered.
  • 9 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 47.856%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/dryrun.go 18 21 85.71%
pkg/provider/aws/image.go 56 61 91.8%
Files with Coverage Reduction New Missed Lines %
pkg/provider/aws/image.go 9 83.72%
Totals Coverage Status
Change from base Build 21999063489: 0.0%
Covered Lines: 2567
Relevant Lines: 5364

💛 - Coveralls

@ArangoGutierrez
Copy link
Copy Markdown
Collaborator Author

Closing as superseded. The equivalent fixes were already merged into main via PRs #661-664:

Additionally, these fixes address downstream provisioning issues but do not resolve the actual EC2 RunInstances failure (Unsupported: The requested configuration is currently not supported) reported in https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634. The root cause is missing architecture inference from instance type — when image.architecture is unset, holodeck defaults to x86_64 regardless of the instance type. A new PR will address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants