Update example_ec2 to Create Instance Which Supports Hibernation (#35790) #35839

dashton90 · 2023-11-24T15:09:20Z

A bug was introduced in the ec2_example DAG when adding the EC2HibernateInstanceOperator. This PR updates the configuration of the example to create an instance which supports hibernation.

Discussion begins here.

related: #35790

dashton90 · 2023-11-24T15:09:53Z

@ferruzzi tagging you since you flagged the bug. I did not create a CMK since AWS will use an Amazon managed key to encrypt the instance

ferruzzi · 2023-11-24T18:10:51Z

At a glance, it looks right. As mentioned, we'll need to adjust it to accept a KMS key to encrypt the EBS volume. Github makes it a bit of a pain to collaborate on a PR like this since the code still lives in your fork; it would be nice to be able to add a commit to your PR myself..... here's what needs to be done next:

Replace Line 41 with the following block:

KMS_KEY_ID_KEY = "KMS_KEY_ID"

sys_test_context_task = (
    SystemTestContextBuilder()
    .add_variable(KMS_KEY_ID_KEY)
    .build()
)

then add a line between 97 and 98:

kms_key_id = test_context[KMS_KEY_ID_KEY]

Then you can use that in the config{}. I'm not positive on this part, but I /think/ you apply it by changing line 113 to

{"DeviceName": "/dev/xvda", "Ebs": {"Encrypted": True, "KmsKeyId": kms_key_id, "DeleteOnTermination": True}}

I would need to test that last part out, but the first few steps are definitely right and will allow us to store the KMS key in SSM Parameter Store and fetch it for the test.

Check out the RDS Export test for a working example of a KMS Key in a system test if I didn't make sense.

Unfortunately, as you mentioned, once you do that it still needs one of us to actually test running it against the real service using the real KMS key. But if we merge this as it is, we'll need to make yet another PR to make those changes.... I'm not sure which is actually more convenient.

dashton90 · 2023-11-24T18:55:09Z

I added you as a collaborator to my fork so you should be free to push any changes you need. But I can make the changes myself tonight or tomorrow if you don't get around to it.

I skipped creating the KMS key because Amazon will default to using an Amazon managed key to encrypt the volume if you don't specify one. Figured it was less overhead than managing our own key, but happy to create one if that's the established practice.

ferruzzi · 2023-11-24T19:13:49Z

Amazon will default to using an Amazon managed key to encrypt the volume if you don't specify one.

AH! I misunderstood, I thought you meant "Amazon" meaning my team specifically would have to add it. If the API call works without specifying one, then by all means let's skip that whole thing and keep the example/test as simple as we can.

ferruzzi · 2023-11-24T20:04:47Z

Alright. progress! I ran the proposed changes from this PR (without my KMS stuff) and it failed, but it's getting much further. Here's the current state:

INFO     airflow.task.operators.airflow.providers.amazon.aws.operators.ec2.EC2TerminateInstanceOperator:ec2.py:249 Terminating EC2 instance i-03e2f168f51570fec
ERROR    airflow.models.taskinstance.TaskInstance:taskinstance.py:2696 Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/airflow/providers/amazon/aws/operators/ec2.py", line 251, in execute
    ec2_hook.get_waiter("instance_terminated").wait(
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 375, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter InstanceTerminated failed: Waiter encountered a terminal failure state: For expression "Reservations[].Instances[].State.Name" we matched expected path: "stopping" at least once

dashton90 · 2023-11-24T21:02:50Z

Huh, it looks like boto fails the terminate waiter if the instance is in the stopping state.

No idea if that is a bug or the intended behaviour since instances which are stopping should be able to be terminated per the AWS docs

Easiest fix is probably to add wait_for_completion=True to both the Reboot and Hibernate operators. I can do that now

…mple

dashton90 · 2023-11-27T20:19:01Z

@ferruzzi
Tried this example myself against the latest version of the providers package and it looks like there are a few things going on.

Instances cannot be hibernated for a few minutes after they are started. This is not a problem per se, but could cause flakiness in the example. I haven't found any documentation on checking if an instance is ready to hibernate. A hacky fix would be to sleep before running the hibernate operator.
Instance takes a long time to finish hibernating. This is anecdotal, but hibernate on a t3.micro is taking 10-15 minutes to go from Stopping to Stopped. I'm not sure if this could be solved by using an AMI with a larger root volume (the one we are using is 2GB vs 0.5GB of memory for the t3.micro so there should be sufficient space). Again, I can't find any docs on the "expected" amount of time it takes an instance to hibernate. We can add a longer duration for wait_for_completion in the operator and this should be fine. The downside is that it will now take the DAG 15-20 minutes to complete. If that's acceptable to the AWS team then we can go this route.
Instance can't be terminated while in Stopping state. I don't know if this is a botocore bug, or expected, but we can terminate a stopping instance through the console. Botocore explicitly disallows this. It could be a limitation of the EC2 API, but if not I can open an issue with botocore. This won't be a problem if we wait for hibernation to finish as outlined above.

ferruzzi · 2023-11-27T20:26:48Z

I am actually poking at it right now. When I ran it with the wait_for_completion it ended up with a waiter timeout, so I bumped max_attempts to 999 just to see if that solved it. Still waiting on the result.

If it turns out that this is going to take as long as it appears it'll take, then I'll run it by the team and see what they think. I will propose (I'll have to get back to you on the decision) that we move it into a different system test which we flag not to run with the others. That way the code snippets can still be imported into the docs but it won't bog down the test. We have a couple of other system tests which are flagged as "do not run automatically" for a number of reasons and this seems like a viable use for that, IMHO.

ferruzzi · 2023-11-27T21:35:31Z

In response to my last comment, it looks like the consensus is to get this sorted out, then we (I) can split it into two tests and just run them in parallel. We're fine with the run time, but breaking it out might save a bit of time. I'm running into unrelated issues at the moment with Docker, but I'm poking at it as I can.

ferruzzi · 2023-11-27T23:56:06Z

Alright, I think I got my laptop's docker issue figured out and I ran the system test as you have it here, but with max_attempts=999 added to hibernate_instance and it passes! It took 1383.42s (0:23m:03s) but I didn't get a printout of how many retries it took.... but that should at least indicate we're on the right track! We just need to narrow that 999 down to something a bit more realistic.

The other thing is purely cosmetic, but please move the wait_for_completion and the new max_retries out of the codeblock so they aren't in the docs. You can see an example of what I mean here

ferruzzi · 2023-11-28T21:58:50Z

Alright, I dialed it in. It passed with max_retries set to 75 and fails at 60, so let's go with 75 for now. We can bump it later as needed, but that seems like a reasonable starting point.

dashton90 · 2023-11-29T01:29:39Z

Done in 00e06ce

ferruzzi · 2023-11-30T23:49:10Z

I merged main into the branch and that got the CI tests to pass. 🚀

The only other thing I'd like to see is move the wait_for_completion and the max_retries out of the code snippets as I mentioned above so they don't clutter up the docs pages, but other than that I'm very happy with it.

Can we get another pair of eyes? @o-nikolas @vincbeck @syedahsn

syedahsn · 2023-11-30T23:55:26Z

+1 to moving wait_for_completion and max_retries out of the codeblock, but otherwise LGTM

o-nikolas · 2023-12-01T00:18:37Z

tests/system/providers/amazon/aws/example_ec2.py

+    root_device_name = "/dev/xvda"

    images = boto3.client("ec2").describe_images(
        Filters=[
            {"Name": "description", "Values": [image_prefix]},
-            {"Name": "architecture", "Values": ["arm64"]},
+            {"Name": "architecture", "Values": ["x86_64"]},
+            {"Name": "root-device-type", "Values": ["ebs"]},
+            {"Name": "root-device-name", "Values": [root_device_name]},


Why do we need these changes?

Non-graviton instances only support x86 architecture. The root device must be ebs backed in order to support hibernation

Hibernation requires an x86 architecture and encrypted storage; encrypted storage requires EBS; EBS requires manually assigning mount points. It was a bit of a chain of requirements there.

Could we add comments? That'd be helpful I think :)

Thanks, that's helpful context. I agree with @vincbeck, I'd love some comments here explaining that for future folks.

Added comments in 4089321

Thanks a lot @dashton90

o-nikolas · 2023-12-01T00:19:07Z

tests/system/providers/amazon/aws/example_ec2.py

    reboot_instance = EC2RebootInstanceOperator(
        task_id="reboot_instace",
        instance_ids=instance_id,
+        wait_for_completion=True,


Move this and below out of the docs snippet.

Done in b8e6568

dashton90 · 2023-12-01T01:25:17Z

Moved everything out of the codeblock in b8e6568

ferruzzi

LGTM. Your persistence is greatly appreciated on this one, it was a bit of a slog.

vincbeck · 2023-12-05T17:39:01Z

Congrats @dashton90 , thanks to your PR, the system test is now succeeding, see dashboard here

dashton90 requested review from eladkal and o-nikolas as code owners November 24, 2023 15:09

boring-cyborg bot added area:providers area:system-tests provider:amazon AWS/Amazon - related issues labels Nov 24, 2023

dashton90 mentioned this pull request Nov 24, 2023

Add EC2HibernateInstanceOperator and EC2RebootInstanceOperator #35790

Merged

Update example_ec2 instance to support hibernation

c42c673

dashton90 force-pushed the bugfix/ec2_example branch from bd98196 to c42c673 Compare November 24, 2023 15:48

Add wait_for_completion for hibernate and reboot operators in EC2 exa…

ae55a85

…mple

dashton90 mentioned this pull request Nov 25, 2023

Status of testing Providers that were prepared on November 24, 2023 #35845

Closed

69 tasks

Change max_attempts to 75 for example_ec2 DAG

00e06ce

Merge branch 'main' into bugfix/ec2_example

b6c134a

syedahsn approved these changes Nov 30, 2023

View reviewed changes

o-nikolas reviewed Dec 1, 2023

View reviewed changes

Move configurations out of code block

b8e6568

ferruzzi approved these changes Dec 1, 2023

View reviewed changes

add comments to ec2 example

4089321

vincbeck approved these changes Dec 5, 2023

View reviewed changes

vincbeck merged commit f83bf93 into apache:main Dec 5, 2023

Update example_ec2 to Create Instance Which Supports Hibernation (#35790) #35839

Update example_ec2 to Create Instance Which Supports Hibernation (#35790) #35839

Uh oh!

Conversation

dashton90 commented Nov 24, 2023

Uh oh!

dashton90 commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ferruzzi commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashton90 commented Nov 24, 2023

Uh oh!

ferruzzi commented Nov 24, 2023

Uh oh!

ferruzzi commented Nov 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dashton90 commented Nov 24, 2023

Uh oh!

dashton90 commented Nov 27, 2023

Uh oh!

ferruzzi commented Nov 27, 2023

Uh oh!

ferruzzi commented Nov 27, 2023

Uh oh!

ferruzzi commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ferruzzi commented Nov 28, 2023

Uh oh!

dashton90 commented Nov 29, 2023

Uh oh!

ferruzzi commented Nov 30, 2023

Uh oh!

syedahsn commented Nov 30, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dashton90 commented Dec 1, 2023

Uh oh!

ferruzzi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincbeck commented Dec 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dashton90 commented Nov 24, 2023 •

edited

Loading

ferruzzi commented Nov 24, 2023 •

edited

Loading

ferruzzi commented Nov 24, 2023 •

edited

Loading

ferruzzi commented Nov 27, 2023 •

edited

Loading

ferruzzi left a comment •

edited

Loading