Skip to content

Conversation

@dashton90
Copy link
Contributor

A bug was introduced in the ec2_example DAG when adding the EC2HibernateInstanceOperator. This PR updates the configuration of the example to create an instance which supports hibernation.

Discussion begins here.

related: #35790

@dashton90
Copy link
Contributor Author

dashton90 commented Nov 24, 2023

@ferruzzi tagging you since you flagged the bug. I did not create a CMK since AWS will use an Amazon managed key to encrypt the instance

@ferruzzi
Copy link
Contributor

ferruzzi commented Nov 24, 2023

At a glance, it looks right. As mentioned, we'll need to adjust it to accept a KMS key to encrypt the EBS volume. Github makes it a bit of a pain to collaborate on a PR like this since the code still lives in your fork; it would be nice to be able to add a commit to your PR myself..... here's what needs to be done next:

Replace Line 41 with the following block:

KMS_KEY_ID_KEY = "KMS_KEY_ID"

sys_test_context_task = (
    SystemTestContextBuilder()
    .add_variable(KMS_KEY_ID_KEY)
    .build()
)

then add a line between 97 and 98:

kms_key_id = test_context[KMS_KEY_ID_KEY]

Then you can use that in the config{}. I'm not positive on this part, but I /think/ you apply it by changing line 113 to

{"DeviceName": "/dev/xvda", "Ebs": {"Encrypted": True, "KmsKeyId": kms_key_id, "DeleteOnTermination": True}}

I would need to test that last part out, but the first few steps are definitely right and will allow us to store the KMS key in SSM Parameter Store and fetch it for the test.

Check out the RDS Export test for a working example of a KMS Key in a system test if I didn't make sense.

Unfortunately, as you mentioned, once you do that it still needs one of us to actually test running it against the real service using the real KMS key. But if we merge this as it is, we'll need to make yet another PR to make those changes.... I'm not sure which is actually more convenient.

@dashton90
Copy link
Contributor Author

I added you as a collaborator to my fork so you should be free to push any changes you need. But I can make the changes myself tonight or tomorrow if you don't get around to it.

I skipped creating the KMS key because Amazon will default to using an Amazon managed key to encrypt the volume if you don't specify one. Figured it was less overhead than managing our own key, but happy to create one if that's the established practice.

@ferruzzi
Copy link
Contributor

Amazon will default to using an Amazon managed key to encrypt the volume if you don't specify one.

AH! I misunderstood, I thought you meant "Amazon" meaning my team specifically would have to add it. If the API call works without specifying one, then by all means let's skip that whole thing and keep the example/test as simple as we can.

@ferruzzi
Copy link
Contributor

ferruzzi commented Nov 24, 2023

Alright. progress! I ran the proposed changes from this PR (without my KMS stuff) and it failed, but it's getting much further. Here's the current state:

INFO     airflow.task.operators.airflow.providers.amazon.aws.operators.ec2.EC2TerminateInstanceOperator:ec2.py:249 Terminating EC2 instance i-03e2f168f51570fec
ERROR    airflow.models.taskinstance.TaskInstance:taskinstance.py:2696 Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/airflow/providers/amazon/aws/operators/ec2.py", line 251, in execute
    ec2_hook.get_waiter("instance_terminated").wait(
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 55, in wait
    Waiter.wait(self, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/botocore/waiter.py", line 375, in wait
    raise WaiterError(
botocore.exceptions.WaiterError: Waiter InstanceTerminated failed: Waiter encountered a terminal failure state: For expression "Reservations[].Instances[].State.Name" we matched expected path: "stopping" at least once

@dashton90
Copy link
Contributor Author

Huh, it looks like boto fails the terminate waiter if the instance is in the stopping state.

No idea if that is a bug or the intended behaviour since instances which are stopping should be able to be terminated per the AWS docs

Easiest fix is probably to add wait_for_completion=True to both the Reboot and Hibernate operators. I can do that now

@dashton90
Copy link
Contributor Author

@ferruzzi
Tried this example myself against the latest version of the providers package and it looks like there are a few things going on.

  1. Instances cannot be hibernated for a few minutes after they are started. This is not a problem per se, but could cause flakiness in the example. I haven't found any documentation on checking if an instance is ready to hibernate. A hacky fix would be to sleep before running the hibernate operator.
  2. Instance takes a long time to finish hibernating. This is anecdotal, but hibernate on a t3.micro is taking 10-15 minutes to go from Stopping to Stopped. I'm not sure if this could be solved by using an AMI with a larger root volume (the one we are using is 2GB vs 0.5GB of memory for the t3.micro so there should be sufficient space). Again, I can't find any docs on the "expected" amount of time it takes an instance to hibernate. We can add a longer duration for wait_for_completion in the operator and this should be fine. The downside is that it will now take the DAG 15-20 minutes to complete. If that's acceptable to the AWS team then we can go this route.
  3. Instance can't be terminated while in Stopping state. I don't know if this is a botocore bug, or expected, but we can terminate a stopping instance through the console. Botocore explicitly disallows this. It could be a limitation of the EC2 API, but if not I can open an issue with botocore. This won't be a problem if we wait for hibernation to finish as outlined above.

@ferruzzi
Copy link
Contributor

I am actually poking at it right now. When I ran it with the wait_for_completion it ended up with a waiter timeout, so I bumped max_attempts to 999 just to see if that solved it. Still waiting on the result.

If it turns out that this is going to take as long as it appears it'll take, then I'll run it by the team and see what they think. I will propose (I'll have to get back to you on the decision) that we move it into a different system test which we flag not to run with the others. That way the code snippets can still be imported into the docs but it won't bog down the test. We have a couple of other system tests which are flagged as "do not run automatically" for a number of reasons and this seems like a viable use for that, IMHO.

@ferruzzi
Copy link
Contributor

In response to my last comment, it looks like the consensus is to get this sorted out, then we (I) can split it into two tests and just run them in parallel. We're fine with the run time, but breaking it out might save a bit of time. I'm running into unrelated issues at the moment with Docker, but I'm poking at it as I can.

@ferruzzi
Copy link
Contributor

ferruzzi commented Nov 27, 2023

Alright, I think I got my laptop's docker issue figured out and I ran the system test as you have it here, but with max_attempts=999 added to hibernate_instance and it passes! It took 1383.42s (0:23m:03s) but I didn't get a printout of how many retries it took.... but that should at least indicate we're on the right track! We just need to narrow that 999 down to something a bit more realistic.

The other thing is purely cosmetic, but please move the wait_for_completion and the new max_retries out of the codeblock so they aren't in the docs. You can see an example of what I mean here

@ferruzzi
Copy link
Contributor

Alright, I dialed it in. It passed with max_retries set to 75 and fails at 60, so let's go with 75 for now. We can bump it later as needed, but that seems like a reasonable starting point.

@dashton90
Copy link
Contributor Author

Done in 00e06ce

@ferruzzi
Copy link
Contributor

I merged main into the branch and that got the CI tests to pass. 🚀

The only other thing I'd like to see is move the wait_for_completion and the max_retries out of the code snippets as I mentioned above so they don't clutter up the docs pages, but other than that I'm very happy with it.

Can we get another pair of eyes? @o-nikolas @vincbeck @syedahsn

@syedahsn
Copy link
Contributor

+1 to moving wait_for_completion and max_retries out of the codeblock, but otherwise LGTM

Comment on lines 53 to 60
root_device_name = "/dev/xvda"

images = boto3.client("ec2").describe_images(
Filters=[
{"Name": "description", "Values": [image_prefix]},
{"Name": "architecture", "Values": ["arm64"]},
{"Name": "architecture", "Values": ["x86_64"]},
{"Name": "root-device-type", "Values": ["ebs"]},
{"Name": "root-device-name", "Values": [root_device_name]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-graviton instances only support x86 architecture. The root device must be ebs backed in order to support hibernation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hibernation requires an x86 architecture and encrypted storage; encrypted storage requires EBS; EBS requires manually assigning mount points. It was a bit of a chain of requirements there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add comments? That'd be helpful I think :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's helpful context. I agree with @vincbeck, I'd love some comments here explaining that for future folks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in 4089321

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @dashton90

reboot_instance = EC2RebootInstanceOperator(
task_id="reboot_instace",
instance_ids=instance_id,
wait_for_completion=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this and below out of the docs snippet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b8e6568

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dashton90
Copy link
Contributor Author

Moved everything out of the codeblock in b8e6568

Copy link
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Your persistence is greatly appreciated on this one, it was a bit of a slog.

@vincbeck vincbeck merged commit f83bf93 into apache:main Dec 5, 2023
@vincbeck
Copy link
Contributor

vincbeck commented Dec 5, 2023

Congrats @dashton90 , thanks to your PR, the system test is now succeeding, see dashboard here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants