Skip to content

cc_ssh: handle race between cloud-init and sshd-keygen#1015

Closed
esposem wants to merge 1 commit into
canonical:mainfrom
esposem:race_sshd
Closed

cc_ssh: handle race between cloud-init and sshd-keygen#1015
esposem wants to merge 1 commit into
canonical:mainfrom
esposem:race_sshd

Conversation

@esposem
Copy link
Copy Markdown
Contributor

@esposem esposem commented Sep 13, 2021

Proposed Commit Message

It looks like that at least in AWS RHEL 9 images there is
a race between cloud-init and sshd-keygen.
In particular, they both create /etc/ssh/ssh_host_*key*
at first boot, causing sometimes warning in cloud-init:

cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['ssh-keygen', '-t', 'rsa', '-N', '', '-f', '/etc/ssh/ssh_host_rsa_key']
Exit code: 1
Reason: -
Stdout: Generating public/private rsa key pair.
        /etc/ssh/ssh_host_rsa_key already exists.
        Overwrite (y/n)?
Stderr:

This is not a critical issue, as a new key is created anyways, either by cloud-init or by sshd.

What happens is:

  1. cloud-init checks if /etc/ssh/ssh_host_rsa_key exist
  2. it does not exist, so it continues the logic in cc_ssh line 234
  3. sshd-keygen in the meanwhile creates /etc/ssh/ssh_host_rsa_key
  4. cloud-init issues ssh-keygen -t rsa -N '' -f /etc/ssh/ssh_host_rsa_key,
    failing

Masking the service with systemctl mask sshd-keygen.target fixes
the bug, but it is not the right solution.

The solution proposed here is to just analyze the error and
avoid throwing a warning if the file has been created in the meanwhile.

Signed-off-by: Emanuele Giuseppe Esposito eesposit@redhat.com
RHBZ: 2002492

Test Steps

# rm -rf /etc/ssh/ssh_host_rsa_key*
# cloud-init clean
# reboot

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 13, 2021

@TheRealFalcon @otubo do you agree with this fix? I am not sure this is the ideal solution.
Also @TheRealFalcon can you reproduce it in Ubuntu or other systems too, or is this something RHEL specific?

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 13, 2021

As suggested by Eduardo (and I also see it as a cleaner solution), we can instead modify cloud-init.service.tmpl and remove the Wants=sshd-keygen.service dependency. It should delay sshd-keygen startup execution. It looks like it is working, but being a race condition I am not 100% sure.
What do you think?

By the way, Before/After define the ordering in the dependencies, but I am not very clear on what Wants actually does here, and why it was introduced in the first time.

@TheRealFalcon
Copy link
Copy Markdown
Contributor

Wants is basically a 'start me at the same time if you can'. Before and After should take precedence though.

We already have a Before=sshd-keygen.service . It looks like sshd-keygen is a templated service though. I just checked on a fedora34 instance and see sshd-keygen@.service. Because of the template, I don't think there's a way for cloud-init to specify a Before/After (or even Wants) dependency on this.

If that's the case, ignoring the error when we race might be the best path forward.

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 14, 2021

Wants is basically a 'start me at the same time if you can'. Before and After should take precedence though.

So the point of having it is just to have 'try to start (at some point, depending on Before/After) the dependencies. If they don't start, it's okay'. So would it be fine to delete it? It looks like nothing happens if Wants is removed

We already have a Before=sshd-keygen.service . It looks like sshd-keygen is a templated service though. I just checked on a fedora34 instance and see sshd-keygen@.service. Because of the template, I don't think there's a way for cloud-init to specify a Before/After (or even Wants) dependency on this.

If that's the case, ignoring the error when we race might be the best path forward.

Actually I think the template gets compiled for each specific distribution, so in RHEL we have a cloudinit.service generate from that. So changing the template could effectively change something. The question is if the removal of Wants actually changes something, or it is just a case that the race does not happen when it gets deleted.

@TheRealFalcon
Copy link
Copy Markdown
Contributor

So the point of having it is just to have 'try to start (at some point, depending on Before/After) the dependencies. If they don't start, it's okay'. So would it be fine to delete it? It looks like nothing happens if Wants is removed

Eh...kind of? It's specifying to start as close to me as possible while respecting other dependencies. Without it, it could start much later in boot, which may or may not be a problem (we'd have to investigate), but I see no reason to remove it. If Before is working, Wants won't cause an issue. If Before isn't working, Wants won't work either. I suspect Before isn't working because of the templating issue.

Actually I think the template gets compiled for each specific distribution, so in RHEL we have a cloudinit.service generate from that. So changing the template could effectively change something. The question is if the removal of Wants actually changes something, or it is just a case that the race does not happen when it gets deleted.

Sorry, I'm not following. Are we referring to the same template? When I say sshd-keygen is a templated service, I'm referring to https://fedoramagazine.org/systemd-template-unit-files/ . On a fedora34 machine:

[root@credible-fowl ssh]# ls -l /lib/systemd/system/sshd-keygen*
-rw-r--r-- 1 root root 247 May 24 12:15 /lib/systemd/system/sshd-keygen@.service
-rw-r--r-- 1 root root 123 May 24 12:15 /lib/systemd/system/sshd-keygen.target

That @ makes it a template. I can say systemctl start sshd-keygen@rsa.service , and the service will generate me an rsa key. That target file in that directory does essentially that for some common key types:

[root@credible-fowl ssh]# cat /lib/systemd/system/sshd-keygen.target 
[Unit]
Wants=sshd-keygen@rsa.service
Wants=sshd-keygen@ecdsa.service
Wants=sshd-keygen@ed25519.service
PartOf=sshd.service

I think the fact that it is a template is preventing us from being able to specify Before or Wants on it. Systemd requires that you specify the actual service, not just a template.

Last time I looked, I hadn't noticed that target file. I think we could add a Before=sshd-keygen.target and Wants=sshd-keygen.target, and that would fix the issue.

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 15, 2021

Ok, what you say about the sshd template makes sense. Sorry I misunderstood at the beginning 😄

Anyways, I tried adding Before=sshd-keygen.target and Wants=sshd-keygen.target, but it doesn't solve the issue.

So since as you pointed out removing Wants=sshd-keygen.service can make it start much later in the boot and could possibly cause issues, and since the race condition does not produce any unwanted effect (regardless of the ordering a new host key is always created) I guess we can go ahead with the fix in this PR?

@TheRealFalcon
Copy link
Copy Markdown
Contributor

I guess we can go ahead with the fix in this PR?

Yep, works for me. Can you also add a test where we have a pre-existing key?

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 16, 2021

I guess we can go ahead with the fix in this PR?

Yep, works for me. Can you also add a test where we have a pre-existing key?

I'm sorry, can you elaborate more about the test? I will be happy to add it, but I am not sure how you want to test this.

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 16, 2021

Never mind, I think I figured what you mean. Let me know if the test is what you wanted.

@esposem esposem marked this pull request as ready for review September 16, 2021 09:58
It looks like that at least in AWS RHEL 9 images there is
a race between cloud-init and sshd-keygen.
In particular, they both create /etc/ssh/ssh_host_*key*
at first boot, causing sometimes warning in cloud-init:

cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['ssh-keygen', '-t', 'rsa', '-N', '', '-f', '/etc/ssh/ssh_host_rsa_key']
Exit code: 1
Reason: -
Stdout: Generating public/private rsa key pair.
        /etc/ssh/ssh_host_rsa_key already exists.
        Overwrite (y/n)?
Stderr:

What happens is:
1) cloud-init checks if /etc/ssh/ssh_host_rsa_key exist
2) it does not exist, so it continues the logic in cc_ssh line 234
3) sshd-keygen in the meanwhile creates /etc/ssh/ssh_host_rsa_key
4) cloud-init issues
   'ssh-keygen -t rsa -N '' -f /etc/ssh/ssh_host_rsa_key',
   failing

Masking the service with `systemctl mask sshd-keygen.target` fixes
the bug, but it is not the right solution.

The solution proposed here is to just analyze the error and
avoid throwing a warning if the file has been created in the meanwhile.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 16, 2021

hmm weird... Running both tox cloudinit/config/tests/test_ssh.py and python -m pytest tests/unittests cloudinit pass locally for me, instead all CI tests fail... @TheRealFalcon any idea on what is going wrong here?

@raharper
Copy link
Copy Markdown
Collaborator

In general, cloud-init already generates ssh keys and manipulates host ssh keys (it removes existing ones if detects it's running on a new instance so that image captured VMs don't retain the same host keys), so cloud-init should conflict with the ssh-keygen service.

but I am not very clear on what Wants actually does here

Wants tells systemd to add the job to the Units dependencies; it is an unordered dep, to influence ordering, one uses After or Before.[1]

  1. https://www.freedesktop.org/software/systemd/man/systemd.unit.html

@TheRealFalcon
Copy link
Copy Markdown
Contributor

@raharper , thanks for the context about Conflicts. That makes sense. Given that the sshd-keygen service is a oneshot, does using Wants vs Conflicts actually matter here, since we also have a Before? If we use Before, shouldn't it wait until the cloud-init service has finished before invoking sshd-keygen, regardless of whether we specify Wants or Conflicts? Does specifying a target change the semantic of it?

@raharper
Copy link
Copy Markdown
Collaborator

@raharper , thanks for the context about Conflicts. That makes sense. Given that the sshd-keygen service is a oneshot, does using Wants vs Conflicts actually matter here, since we also have a Before? If we use Before, shouldn't it wait until the cloud-init service has finished before invoking sshd-keygen, regardless of whether we specify Wants or Conflicts? Does specifying a target change the semantic of it?

  1. I don't think we should include Wants= at all, however, this may be a distro issue in that, if cloud-init packaged in RHEL does not include the cc_ssh module, then nothing would generate ssh keys. Without further investigation in the image itself I believe it's OK to leave a Wants here. In the case where cloud-init was not configured to generate ssh keys then the Wants=ssh-keygen tells systemd that cloud-init service depends on the keygen service (unordered) since cloud-init itself won't generate ssh keys.

  2. Reading further on Conflicts, it's a two-way relationship which could then prevent cloud-init from running.

  3. I believe that ssh-kegen sevice is idempotent, running cloud-init with Before=ssh-keygen should suffice to allow cloud-init to handle keygen if it's scheduled to run (ds-identify activation) and not regenerate a second set of keys after cloud-init has generated a set.

@raharper
Copy link
Copy Markdown
Collaborator

I don't think we want to accept ssh host keys generated by ssh-keygen except explicitly by user-config, in which users can already provide their own keys that cloud-init writes.

One of the critical use-case that cloud-init handles with ssh key generation is when an instance has been captured and launched as a new instance we don't want to keep the same host keys once we boot a new instance. ssh-keygen does not have (that I know of) any way to determine if the existing host-keys need to be removed and regenerated.

@TheRealFalcon
Copy link
Copy Markdown
Contributor

I believe that ssh-kegen sevice is idempotent, running cloud-init with Before=ssh-keygen should suffice to allow cloud-init to handle keygen if it's scheduled to run (ds-identify activation) and not regenerate a second set of keys after cloud-init has generated a set.

The problem with that (as I mentioned above), is that it is a templated service now...at least for some distros. Something like Before=ssh-keygen@ doesn't work for a templated service. @esposem said he tried to specify the sshd-keygen.target as well, but that didn't work either.

I don't think we want to accept ssh host keys generated by ssh-keygen except explicitly by user-config, in which users can already provide their own keys that cloud-init writes.
One of the critical use-case that cloud-init handles with ssh key generation is when an instance has been captured and launched as a new instance we don't want to keep the same host keys once we boot a new instance. ssh-keygen does not have (that I know of) any way to determine if the existing host-keys need to be removed and regenerated.

Well, we still delete any keys found at the beginning of the module, so that should prevent us from keeping keys from an older image: https://github.com/canonical/cloud-init/blob/main/cloudinit/config/cc_ssh.py#L193

The change here is specifically about when the two services race. Between the time cloud-init deletes the keys and tries to create new ones, sshd-keygen runs and creates new keys before we have a chance to do it ourselves. It'd be great if we could fix that via systemd, but I haven't found a way to do that with a templated service. Without that, it seems the easiest thing to do is just ignore if sshd-keygen has raced with us.

@raharper
Copy link
Copy Markdown
Collaborator

The problem with that (as I mentioned above), is that it is a templated service now...at least for some distros.

Interesting.

Yes, looking at an older centos8 vm, the keygen services run early and cloud-init hasn't started yet, so we end up deleting those keys and regenerating.... and as long as that happens you're quite right that it could race as this PR shows...

With some effort, I think the following drop-in unit does what I'd like, which is, if cloud-init is going to run, then don't bother starting the keygen template service.

$ mkdir -p /etc/systemd/system/sshd-keygen\@.service.d
$ cat > /etc/systemd/system/sshd-keygen\@.service.d/disabled-if-cloud-init-active.conf << EOF
[Unit]
ConditionPathExists=!/run/systemd/generator.early/multi-user.target.wants/cloud-init.target
EOF

This checks the symlink that cloud-init's generate creates if it's going to enable cloud-init.

When I run with this, I can see that the condition fails and we skip generating the keys early.

[root@c8-cloud ~]# systemctl status sshd-keygen@ecdsa 
● sshd-keygen@ecdsa.service - OpenSSH ecdsa Server Key Generation
   Loaded: loaded (/usr/lib/systemd/system/sshd-keygen@.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sshd-keygen@.service.d
           └─disabled-if-cloud-init-active.conf
   Active: inactive (dead)
Condition: start condition failed at Thu 2021-09-16 20:00:49 UTC; 58s ago
           └─ ConditionPathExists=!/run/systemd/generator.early/multi-user.target.wants/cloud-init.target was not met

And if cloud-init is disabled, it runs as needed

root@c8-cloud ~]# cloud-init status --long 
status: disabled
detail:
Cloud-init disabled by /etc/cloud/cloud-init.disabled
[root@c8-cloud ~]# systemctl status sshd-keygen@ecdsa 
● sshd-keygen@ecdsa.service - OpenSSH ecdsa Server Key Generation
   Loaded: loaded (/usr/lib/systemd/system/sshd-keygen@.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/sshd-keygen@.service.d
           └─disabled-if-cloud-init-active.conf
   Active: inactive (dead) since Thu 2021-09-16 20:09:41 UTC; 25s ago
  Process: 544 ExecStart=/usr/libexec/openssh/sshd-keygen ecdsa (code=exited, status=0/SUCCESS)
 Main PID: 544 (code=exited, status=0/SUCCESS)

Sep 16 20:09:41 c8-cloud systemd[1]: Starting OpenSSH ecdsa Server Key Generation...
Sep 16 20:09:41 c8-cloud systemd[1]: sshd-keygen@ecdsa.service: Succeeded.
Sep 16 20:09:41 c8-cloud systemd[1]: Started OpenSSH ecdsa Server Key Generation.

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 17, 2021

@raharper Thank you for the explanations and the suggestion! I just tested the drop-in, and it works as intended also in RHEL.

So should we just drop the fix I introduce here?
Also, should (and how can) we add automatic creation of the drop-in in cloud-init?

@raharper
Copy link
Copy Markdown
Collaborator

@raharper Thank you for the explanations and the suggestion! I just tested the drop-in, and it works as intended also in RHEL.

Great!

So should we just drop the fix I introduce here?

I think this depends on what fix goes upstream and whether that goes as-is downstream

Also, should (and how can) we add automatic creation of the drop-in in cloud-init?

I think we want to package this drop-in rather than trying to create it on-the-fly. We'd
need to do this in our generator so it is written soon enough for systemd to apply it to
the target units. That seems needlessly complex when we could just package it.

For Distros which don't use/include ssh-keygen@.service (Ubuntu does not include
this in their cloud images) it isn't needed but doesn't hurt anything to include it.

@TheRealFalcon @blackboxsw

Thoughts?

@esposem For downstream, you could package the file as part of the cloud-init rpm ASAP

@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 20, 2021

@raharper Thank you, I packed the file downstream. I am closing this PR as the fix won't be needed.

@esposem esposem closed this Sep 20, 2021
@esposem
Copy link
Copy Markdown
Contributor Author

esposem commented Sep 20, 2021

Actually, should we add this fix also upstream? Or do you want to leave cloud-init as it is? Sorry I initially thought it was a downstream fix only.

@raharper
Copy link
Copy Markdown
Collaborator

Actually, should we add this fix also upstream? Or do you want to leave cloud-init as it is? Sorry I initially thought it was a downstream fix only.

I think upstream should take the drop-in config change as well, @TheRealFalcon @blackboxsw do you agree?

@TheRealFalcon
Copy link
Copy Markdown
Contributor

Yes, I think it makes sense to include it upstream, though I'm not exactly sure how to accomplish that. I've never gotten around to grokking setup.py, but my understanding was we just listed the units to be dropped in /lib/systemd/system, and then the downstream packaging was responsible installing/enabling things. If that's true, I'm not sure how we could go about creating a drop-in upstream

@raharper
Copy link
Copy Markdown
Collaborator

Yes, I think it makes sense to include it upstream, though I'm not exactly sure how to accomplish that. I've never gotten around to grokking setup.py, but my understanding was we just listed the units to be dropped in /lib/systemd/system, and then the downstream packaging was responsible installing/enabling things. If that's true, I'm not sure how we could go about creating a drop-in upstream

I'll submit a PR for it shortly.

@raharper
Copy link
Copy Markdown
Collaborator

#1028

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants