cc_ssh: handle race between cloud-init and sshd-keygen#1015
Conversation
|
@TheRealFalcon @otubo do you agree with this fix? I am not sure this is the ideal solution. |
|
As suggested by Eduardo (and I also see it as a cleaner solution), we can instead modify By the way, |
|
We already have a Before=sshd-keygen.service . It looks like sshd-keygen is a templated service though. I just checked on a fedora34 instance and see If that's the case, ignoring the error when we race might be the best path forward. |
So the point of having it is just to have 'try to start (at some point, depending on
Actually I think the template gets compiled for each specific distribution, so in RHEL we have a cloudinit.service generate from that. So changing the template could effectively change something. The question is if the removal of |
Eh...kind of? It's specifying to start as close to me as possible while respecting other dependencies. Without it, it could start much later in boot, which may or may not be a problem (we'd have to investigate), but I see no reason to remove it. If
Sorry, I'm not following. Are we referring to the same template? When I say sshd-keygen is a templated service, I'm referring to https://fedoramagazine.org/systemd-template-unit-files/ . On a fedora34 machine: That I think the fact that it is a template is preventing us from being able to specify Last time I looked, I hadn't noticed that target file. I think we could add a |
|
Ok, what you say about the sshd template makes sense. Sorry I misunderstood at the beginning 😄 Anyways, I tried adding So since as you pointed out removing |
Yep, works for me. Can you also add a test where we have a pre-existing key? |
I'm sorry, can you elaborate more about the test? I will be happy to add it, but I am not sure how you want to test this. |
|
Never mind, I think I figured what you mean. Let me know if the test is what you wanted. |
It looks like that at least in AWS RHEL 9 images there is
a race between cloud-init and sshd-keygen.
In particular, they both create /etc/ssh/ssh_host_*key*
at first boot, causing sometimes warning in cloud-init:
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['ssh-keygen', '-t', 'rsa', '-N', '', '-f', '/etc/ssh/ssh_host_rsa_key']
Exit code: 1
Reason: -
Stdout: Generating public/private rsa key pair.
/etc/ssh/ssh_host_rsa_key already exists.
Overwrite (y/n)?
Stderr:
What happens is:
1) cloud-init checks if /etc/ssh/ssh_host_rsa_key exist
2) it does not exist, so it continues the logic in cc_ssh line 234
3) sshd-keygen in the meanwhile creates /etc/ssh/ssh_host_rsa_key
4) cloud-init issues
'ssh-keygen -t rsa -N '' -f /etc/ssh/ssh_host_rsa_key',
failing
Masking the service with `systemctl mask sshd-keygen.target` fixes
the bug, but it is not the right solution.
The solution proposed here is to just analyze the error and
avoid throwing a warning if the file has been created in the meanwhile.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
|
hmm weird... Running both |
|
In general, cloud-init already generates ssh keys and manipulates host ssh keys (it removes existing ones if detects it's running on a new instance so that image captured VMs don't retain the same host keys), so cloud-init should conflict with the ssh-keygen service.
Wants tells systemd to add the job to the Units dependencies; it is an unordered dep, to influence ordering, one uses After or Before.[1] |
|
@raharper , thanks for the context about Conflicts. That makes sense. Given that the sshd-keygen service is a oneshot, does using Wants vs Conflicts actually matter here, since we also have a Before? If we use Before, shouldn't it wait until the cloud-init service has finished before invoking sshd-keygen, regardless of whether we specify Wants or Conflicts? Does specifying a target change the semantic of it? |
|
|
I don't think we want to accept ssh host keys generated by ssh-keygen except explicitly by user-config, in which users can already provide their own keys that cloud-init writes. One of the critical use-case that cloud-init handles with ssh key generation is when an instance has been captured and launched as a new instance we don't want to keep the same host keys once we boot a new instance. ssh-keygen does not have (that I know of) any way to determine if the existing host-keys need to be removed and regenerated. |
The problem with that (as I mentioned above), is that it is a templated service now...at least for some distros. Something like
Well, we still delete any keys found at the beginning of the module, so that should prevent us from keeping keys from an older image: https://github.com/canonical/cloud-init/blob/main/cloudinit/config/cc_ssh.py#L193 The change here is specifically about when the two services race. Between the time cloud-init deletes the keys and tries to create new ones, sshd-keygen runs and creates new keys before we have a chance to do it ourselves. It'd be great if we could fix that via systemd, but I haven't found a way to do that with a templated service. Without that, it seems the easiest thing to do is just ignore if sshd-keygen has raced with us. |
Interesting. Yes, looking at an older centos8 vm, the keygen services run early and cloud-init hasn't started yet, so we end up deleting those keys and regenerating.... and as long as that happens you're quite right that it could race as this PR shows... With some effort, I think the following drop-in unit does what I'd like, which is, if cloud-init is going to run, then don't bother starting the keygen template service. This checks the symlink that cloud-init's generate creates if it's going to enable cloud-init. When I run with this, I can see that the condition fails and we skip generating the keys early. And if cloud-init is disabled, it runs as needed |
|
@raharper Thank you for the explanations and the suggestion! I just tested the drop-in, and it works as intended also in RHEL. So should we just drop the fix I introduce here? |
Great!
I think this depends on what fix goes upstream and whether that goes as-is downstream
I think we want to package this drop-in rather than trying to create it on-the-fly. We'd For Distros which don't use/include ssh-keygen@.service (Ubuntu does not include Thoughts? @esposem For downstream, you could package the file as part of the cloud-init rpm ASAP |
|
@raharper Thank you, I packed the file downstream. I am closing this PR as the fix won't be needed. |
|
Actually, should we add this fix also upstream? Or do you want to leave cloud-init as it is? Sorry I initially thought it was a downstream fix only. |
I think upstream should take the drop-in config change as well, @TheRealFalcon @blackboxsw do you agree? |
|
Yes, I think it makes sense to include it upstream, though I'm not exactly sure how to accomplish that. I've never gotten around to grokking setup.py, but my understanding was we just listed the units to be dropped in /lib/systemd/system, and then the downstream packaging was responsible installing/enabling things. If that's true, I'm not sure how we could go about creating a drop-in upstream |
I'll submit a PR for it shortly. |
Proposed Commit Message
It looks like that at least in AWS RHEL 9 images there is
a race between cloud-init and sshd-keygen.
In particular, they both create
/etc/ssh/ssh_host_*key*at first boot, causing sometimes warning in cloud-init:
This is not a critical issue, as a new key is created anyways, either by cloud-init or by sshd.
What happens is:
ssh-keygen -t rsa -N '' -f /etc/ssh/ssh_host_rsa_key,failing
Masking the service with
systemctl mask sshd-keygen.targetfixesthe bug, but it is not the right solution.
The solution proposed here is to just analyze the error and
avoid throwing a warning if the file has been created in the meanwhile.
Signed-off-by: Emanuele Giuseppe Esposito eesposit@redhat.com
RHBZ: 2002492
Test Steps
Checklist: