Wait for apt lock by TheRealFalcon · Pull Request #1034 · canonical/cloud-init

TheRealFalcon · 2021-09-24T21:38:15Z

Proposed Commit Message

Wait for apt lock

Currently any attempt to run an apt command while another process holds
an apt lock will fail. We should instead wait to acquire the apt lock.

LP: #1944611

Additional Context

See https://bugs.launchpad.net/cloud-init/+bug/1944611 . I have also seen this intermittently while testing.

Test Steps

Beyond the unit tests, start an instance with the following cloud-config

#cloud-config
bootcmd:
  - python3 -c "import fcntl; from time import sleep; x = open('/var/lib/dpkg/lock', 'w'); fcntl.lockf(x, fcntl.LOCK_EX | fcntl.LOCK_NB); sleep(50000)" &
drivers:
  nvidia:
    license-accepted: true

Old behavior is to fail immediately. With this PR it will wait 30 seconds and then fail. Decrease the sleep an appropriate amount of time if you'd like to see it wait and then proceed when the lock is cleared.

Checklist:

My code follows the process laid out in the documentation
I have updated or added any unit tests accordingly
I have updated or added any documentation accordingly

TheRealFalcon · 2021-09-24T21:43:18Z

Currently, the sleep time is hardcoded, which isn't ideal. I know that we're generally against adding top-level keys, but there's already some precedent for having not-really-documented top-level apt keys. I'd prefer we find a way to document them better, but I'm not sure it would be all that bad to add something like apt_lock_timeout alongside the already existing apt_get_command and apt_get_wrapper.

We could also put it under the apt top-level key, but that feels like adding configuration that belongs to another module, so I don't really like that option.

I'm open to any suggestions.

TheRealFalcon · 2021-09-24T21:46:06Z

+                    kwargs=subp_kwargs,
+                )
+            except subp.ProcessExecutionError as e:
+                if e.exit_code == 100 and 'Could not get lock' in e.stderr:


Does this pose any internationalization concerns?

A couple of concerns:

Agree that hard-coding English strings to scope our logging of the exception reason is ineffective without using gettext.

I'm not sure this logging warrants us digging into gettext on the apt and apt-utils domain to do translations on this expected string type for just this log.

This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?

I'm not quite understanding what you are expecting to handle in this except clause:

Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or

Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?

blackboxsw

thanks @TheRealFalcon only significant question I have is whether our _wait_for_apt_install loop needs to raise in unexpected (non-lock related) errors? Exit 100 is raised in instances like invalid apt sources.list which won't be fixed by retries:
E: The repository 'http://archive.ubuntu.com/ubuntu2 xenial Release' does not have a Release file.

blackboxsw · 2021-09-29T23:35:14Z

+                    kwargs=subp_kwargs,
+                )
+            except subp.ProcessExecutionError as e:
+                if e.exit_code == 100 and 'Could not get lock' in e.stderr:


A couple of concerns:

Agree that hard-coding English strings to scope our logging of the exception reason is ineffective without using gettext.

I'm not sure this logging warrants us digging into gettext on the apt and apt-utils domain to do translations on this expected string type for just this log.

This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?

I'm not quite understanding what you are expecting to handle in this except clause:

Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or

Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?

TheRealFalcon · 2021-09-30T00:57:03Z

This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?

Yep, I forgot to add the raise. It's added now!

I'm not quite understanding what you are expecting to handle in this except clause:
Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or
Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?

The second option. Since snap is doing things at the same time as us, I don't think it's out of the realm of possibility for us both to attempt to do an update at the same time. I could have removed the lock check entirely and retry on apt failure, but that would produce a lot of log spam and needless apt runs, so I have both here.

Do you think checking for a race condition is unnecessary? It would remove the exception text checking, which I'm not exactly sure the best way to approach. Though...I did have fun playing with apt in French, and that particular error message came back in English 😃

github-actions · 2021-10-15T00:02:13Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

Currently any attempt to run an apt command while another process holds an apt lock will fail. We should instead wait to acquire the apt lock. LP: #1944611

TheRealFalcon · 2021-10-18T13:37:37Z

The testing branch this was waiting on has merged, so I moved this out of WIP. I think the one remaining question we have left is are we ok checking the text of an apt failure, or does that seem more risky than leaving that extra check out altogether.

@blackboxsw or @holmanb Any strong opinions?

holmanb · 2021-10-18T19:19:39Z

+                    func=subp.subp,
+                    kwargs=subp_kwargs,
+                )
+            except subp.ProcessExecutionError as e:


There are some semantics I had to look up here, so I'll make a note of the logic/control flow (please comment if I'm wrong on any points:

On failure apt-get always returns 100, parsing for "Could not get apt lock" on line 214 is probably necessary since there are likely other causes for error. If ProcessExecutionError is raised and exit code is not 100, then it's likely not an apt lock issue. The same is true for missing "Could not get apt lock" message.

This is racy, but better than current state, I think. Steps to race:

self._is_apt_lock_available() -> returns True

other process grabs lock before cloud-init

this process starts apt-get install/upgrade and fails

holmanb · 2021-10-18T19:30:38Z

@TheRealFalcon I made a comment, but I'm not requesting anything there. I think checking text output is a requirement if we're going to log "Could not obtain apt lock".

It would be nice if apt-get used useful return codes (i.e. returned EAGAIN when unable to get lock) so we didn't have to resort to parsing text.

No objections from me, looks good.

TheRealFalcon · 2021-10-18T21:35:46Z

@blackboxsw @holmanb I pushed another commit that might be a good compromise. I added a large comment with it on the commit, but instead of checking the apt failure text or ignoring the race, we could instead just check the apt lock as soon as apt fails. If we can't acquire the lock, that almost certainly means it's because another application raced us. If we can acquire the lock, that likely means apt failed for unrelated reasons.

It's not a perfect solution, but given our options, I think it's good enough.

holmanb

+1 - I like this over the previous iteration.

github-actions · 2021-11-02T00:02:16Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

blackboxsw

Thanks for the gap handling there. Validated lock handling races are coped with in this case and can see retries amid intermittent apt upgrade call collisions.

blackboxsw · 2021-11-09T05:41:27Z

+                # raced us when we tried to acquire it, so raise the apt
+                # error received. If the lock is unavailable, just keep waiting
+                if self._apt_lock_available():
+                    raise


Yes this works for me, thank you for adjusting this. I think it'll catch the majority of races.

julian-klode · 2021-12-06T16:56:28Z

This change is wrong and breaks parallel apt processes. You need to acquire the frontend-lock before acquiring any other locks, and acquire the locks in the same order as apt.

Since 20.04, apt can wait for a lock.

The apt(8) command automatically waits for a lock for 120 seconds (non-interactive) or infinitely.

The apt-get(8) command can be configured to wait as well by passing the -o DPkg::Lock::Timeout=, where may also be -1 for infinite.

This avoids any races you'd get by doing the lock yourself and then invoking apt.

julian-klode · 2021-12-06T17:05:24Z

The basic way to lock things is:

/var/lib/dpkg/lock-frontend
/var/lib/dpkg/lock
/var/cache/apt/archives

Presumably it's enough to just lock the frontend one.

/var/lib/apt/lists is locked independently of that install chain, and only locked during update, so you can acquire it either order. Also update does not acquire the dpkg frontend lock.

Presumably prepending /var/lib/dpkg/lock-frontend to the list of locks will cause the correct behavior.

Without locking frontend, another apt process waiting for that lock will run and then fail, causing bad UX.

TheRealFalcon · 2021-12-06T17:09:12Z

Thanks @julian-klode ! We're still supporting 18.04, but it's good to know that once that goes EOL the locking code can be removed.

Is there any documentation for this locking behavior? I'm still not entirely sure what each lock is responsible for.

julian-klode · 2021-12-06T17:51:53Z

We only have waiting for parallel install though (it only waits for the frontend lock), not update.

/var/lib/dpkg/lock-frontend ensures nobody else tries to run dpkg while apt runs dpkg (whether it's the user, another apt, or another dpkg frontend). This is important because apt runs dpkg multiple times, so /var/lib/dpkg/lock is released in between - it locks an individual dpkg run. Before the frontend lock, you could end up losing the lock in the middle of an install/upgrade (between two dpkg runs).

/var/lib/apt/lists/lock is locked when metadata is being downloaded. It is not locked during metadata access, which might cause some race conditions. Wondering if I should try acquire that lock when opening the cache, but it might be hard to do. Also would be nice to add waiting for that one into apt.

/var/cache/apt/archives/lock is locked once we start downloading packages and held while packages are being installed.

(Both are the same lock implementation wise, just for different download areas - package lists vs packages).

You'll mostly get by locking just frontends and the lists lock, but oh well, some other frontends might behave badly and not support the frontend lock themselves, and I'm not sure where it's all backported too. I think most Ubuntu releases, but not 100% sure. In any case, no issue acquiring it always.

TheRealFalcon · 2021-12-06T18:03:53Z

Thanks! That's very helpful.

TheRealFalcon commented Sep 24, 2021

View reviewed changes

TheRealFalcon requested a review from blackboxsw September 24, 2021 21:46

TheRealFalcon force-pushed the oracle-lock branch from d5749ab to d11fe65 Compare September 24, 2021 21:49

TheRealFalcon changed the title ~~Wait for apt lock~~ WIP: Wait for apt lock Sep 28, 2021

TheRealFalcon added the wip Work in progress, do not land label Sep 28, 2021

blackboxsw requested changes Sep 29, 2021

View reviewed changes

github-actions Bot added the stale-pr Pull request is stale; will be auto-closed soon label Oct 15, 2021

TheRealFalcon and others added 3 commits October 16, 2021 20:16

Wait for apt lock

a71971e

Currently any attempt to run an apt command while another process holds an apt lock will fail. We should instead wait to acquire the apt lock. LP: #1944611

comments

ca04f06

Fix broken unit test

70065ff

TheRealFalcon force-pushed the oracle-lock branch from 276bf2a to 70065ff Compare October 18, 2021 13:33

TheRealFalcon removed wip Work in progress, do not land stale-pr Pull request is stale; will be auto-closed soon labels Oct 18, 2021

holmanb reviewed Oct 18, 2021

View reviewed changes

Comment thread cloudinit/distros/debian.py Outdated

IOError->OSError

7577ec2

holmanb reviewed Oct 18, 2021

View reviewed changes

yo dawg. I heard you liked apt locks...

d7ae748

update tests

879f04a

TheRealFalcon changed the title ~~WIP: Wait for apt lock~~ Wait for apt lock Oct 18, 2021

holmanb approved these changes Oct 18, 2021

View reviewed changes

github-actions Bot added the stale-pr Pull request is stale; will be auto-closed soon label Nov 2, 2021

TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Nov 4, 2021

blackboxsw approved these changes Nov 9, 2021

View reviewed changes

blackboxsw merged commit 3d15068 into canonical:main Nov 9, 2021

TheRealFalcon deleted the oracle-lock branch December 6, 2021 16:00

TheRealFalcon mentioned this pull request Dec 13, 2021

Include dpkg frontend lock in APT_LOCK_FILES (SC-650) #1153

Merged

Conversation

TheRealFalcon commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Commit Message

Additional Context

Test Steps

Checklist:

Uh oh!

TheRealFalcon commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheRealFalcon Sep 24, 2021

Choose a reason for hiding this comment

Uh oh!

blackboxsw Sep 29, 2021

Choose a reason for hiding this comment

Uh oh!

blackboxsw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

blackboxsw Sep 29, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheRealFalcon commented Sep 30, 2021

Uh oh!

github-actions Bot commented Oct 15, 2021

Uh oh!

TheRealFalcon commented Oct 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

holmanb Oct 18, 2021

Choose a reason for hiding this comment

Uh oh!

holmanb commented Oct 18, 2021

Uh oh!

TheRealFalcon commented Oct 18, 2021

Uh oh!

holmanb left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Nov 2, 2021

Uh oh!

blackboxsw left a comment

Choose a reason for hiding this comment

Uh oh!

blackboxsw Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

julian-klode commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julian-klode commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheRealFalcon commented Dec 6, 2021

Uh oh!

julian-klode commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheRealFalcon commented Dec 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TheRealFalcon commented Sep 24, 2021 •

edited

Loading

TheRealFalcon commented Sep 24, 2021 •

edited

Loading

TheRealFalcon commented Oct 18, 2021 •

edited

Loading

julian-klode commented Dec 6, 2021 •

edited

Loading

julian-klode commented Dec 6, 2021 •

edited

Loading

julian-klode commented Dec 6, 2021 •

edited

Loading