Skip to content

Wait for apt lock#1034

Merged
blackboxsw merged 6 commits into
canonical:mainfrom
TheRealFalcon:oracle-lock
Nov 9, 2021
Merged

Wait for apt lock#1034
blackboxsw merged 6 commits into
canonical:mainfrom
TheRealFalcon:oracle-lock

Conversation

@TheRealFalcon
Copy link
Copy Markdown
Contributor

@TheRealFalcon TheRealFalcon commented Sep 24, 2021

Proposed Commit Message

Wait for apt lock

Currently any attempt to run an apt command while another process holds
an apt lock will fail. We should instead wait to acquire the apt lock.

LP: #1944611

Additional Context

See https://bugs.launchpad.net/cloud-init/+bug/1944611 . I have also seen this intermittently while testing.

Test Steps

Beyond the unit tests, start an instance with the following cloud-config

#cloud-config
bootcmd:
  - python3 -c "import fcntl; from time import sleep; x = open('/var/lib/dpkg/lock', 'w'); fcntl.lockf(x, fcntl.LOCK_EX | fcntl.LOCK_NB); sleep(50000)" &
drivers:
  nvidia:
    license-accepted: true

Old behavior is to fail immediately. With this PR it will wait 30 seconds and then fail. Decrease the sleep an appropriate amount of time if you'd like to see it wait and then proceed when the lock is cleared.

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

TheRealFalcon commented Sep 24, 2021

Currently, the sleep time is hardcoded, which isn't ideal. I know that we're generally against adding top-level keys, but there's already some precedent for having not-really-documented top-level apt keys. I'd prefer we find a way to document them better, but I'm not sure it would be all that bad to add something like apt_lock_timeout alongside the already existing apt_get_command and apt_get_wrapper.

We could also put it under the apt top-level key, but that feels like adding configuration that belongs to another module, so I don't really like that option.

I'm open to any suggestions.

Comment thread cloudinit/distros/debian.py Outdated
kwargs=subp_kwargs,
)
except subp.ProcessExecutionError as e:
if e.exit_code == 100 and 'Could not get lock' in e.stderr:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pose any internationalization concerns?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of concerns:

  • Agree that hard-coding English strings to scope our logging of the exception reason is ineffective without using gettext.
  • I'm not sure this logging warrants us digging into gettext on the apt and apt-utils domain to do translations on this expected string type for just this log.
  • This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?

I'm not quite understanding what you are expecting to handle in this except clause:

  • Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or
  • Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?

@TheRealFalcon TheRealFalcon changed the title Wait for apt lock WIP: Wait for apt lock Sep 28, 2021
@TheRealFalcon TheRealFalcon added the wip Work in progress, do not land label Sep 28, 2021
Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @TheRealFalcon only significant question I have is whether our _wait_for_apt_install loop needs to raise in unexpected (non-lock related) errors? Exit 100 is raised in instances like invalid apt sources.list which won't be fixed by retries:
E: The repository 'http://archive.ubuntu.com/ubuntu2 xenial Release' does not have a Release file.

Comment thread cloudinit/distros/debian.py Outdated
Comment thread cloudinit/distros/debian.py Outdated
Comment thread cloudinit/distros/debian.py Outdated
kwargs=subp_kwargs,
)
except subp.ProcessExecutionError as e:
if e.exit_code == 100 and 'Could not get lock' in e.stderr:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of concerns:

  • Agree that hard-coding English strings to scope our logging of the exception reason is ineffective without using gettext.
  • I'm not sure this logging warrants us digging into gettext on the apt and apt-utils domain to do translations on this expected string type for just this log.
  • This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?

I'm not quite understanding what you are expecting to handle in this except clause:

  • Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or
  • Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?

Comment thread cloudinit/distros/debian.py Outdated
@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?

Yep, I forgot to add the raise. It's added now!

I'm not quite understanding what you are expecting to handle in this except clause:
Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or
Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?

The second option. Since snap is doing things at the same time as us, I don't think it's out of the realm of possibility for us both to attempt to do an update at the same time. I could have removed the lock check entirely and retry on apt failure, but that would produce a lot of log spam and needless apt runs, so I have both here.

Do you think checking for a race condition is unnecessary? It would remove the exception text checking, which I'm not exactly sure the best way to approach. Though...I did have fun playing with apt in French, and that particular error message came back in English 😃

@github-actions
Copy link
Copy Markdown

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

@github-actions github-actions Bot added the stale-pr Pull request is stale; will be auto-closed soon label Oct 15, 2021
TheRealFalcon and others added 3 commits October 16, 2021 20:16
Currently any attempt to run an apt command while another process holds
an apt lock will fail. We should instead wait to acquire the apt lock.

LP: #1944611
@TheRealFalcon TheRealFalcon removed wip Work in progress, do not land stale-pr Pull request is stale; will be auto-closed soon labels Oct 18, 2021
@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

TheRealFalcon commented Oct 18, 2021

The testing branch this was waiting on has merged, so I moved this out of WIP. I think the one remaining question we have left is are we ok checking the text of an apt failure, or does that seem more risky than leaving that extra check out altogether.

@blackboxsw or @holmanb Any strong opinions?

Comment thread cloudinit/distros/debian.py Outdated
Comment thread cloudinit/distros/debian.py Outdated
func=subp.subp,
kwargs=subp_kwargs,
)
except subp.ProcessExecutionError as e:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some semantics I had to look up here, so I'll make a note of the logic/control flow (please comment if I'm wrong on any points:

  • On failure apt-get always returns 100, parsing for "Could not get apt lock" on line 214 is probably necessary since there are likely other causes for error. If ProcessExecutionError is raised and exit code is not 100, then it's likely not an apt lock issue. The same is true for missing "Could not get apt lock" message.

  • This is racy, but better than current state, I think. Steps to race:

  1. self._is_apt_lock_available() -> returns True
  2. other process grabs lock before cloud-init
  3. this process starts apt-get install/upgrade and fails

@holmanb
Copy link
Copy Markdown
Member

holmanb commented Oct 18, 2021

@TheRealFalcon I made a comment, but I'm not requesting anything there. I think checking text output is a requirement if we're going to log "Could not obtain apt lock".

It would be nice if apt-get used useful return codes (i.e. returned EAGAIN when unable to get lock) so we didn't have to resort to parsing text.

No objections from me, looks good.

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

@blackboxsw @holmanb I pushed another commit that might be a good compromise. I added a large comment with it on the commit, but instead of checking the apt failure text or ignoring the race, we could instead just check the apt lock as soon as apt fails. If we can't acquire the lock, that almost certainly means it's because another application raced us. If we can acquire the lock, that likely means apt failed for unrelated reasons.

It's not a perfect solution, but given our options, I think it's good enough.

@TheRealFalcon TheRealFalcon changed the title WIP: Wait for apt lock Wait for apt lock Oct 18, 2021
Copy link
Copy Markdown
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - I like this over the previous iteration.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 2, 2021

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

@github-actions github-actions Bot added the stale-pr Pull request is stale; will be auto-closed soon label Nov 2, 2021
@TheRealFalcon TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Nov 4, 2021
Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the gap handling there. Validated lock handling races are coped with in this case and can see retries amid intermittent apt upgrade call collisions.

# raced us when we tried to acquire it, so raise the apt
# error received. If the lock is unavailable, just keep waiting
if self._apt_lock_available():
raise
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this works for me, thank you for adjusting this. I think it'll catch the majority of races.

@blackboxsw blackboxsw merged commit 3d15068 into canonical:main Nov 9, 2021
@TheRealFalcon TheRealFalcon deleted the oracle-lock branch December 6, 2021 16:00
@julian-klode
Copy link
Copy Markdown

julian-klode commented Dec 6, 2021

This change is wrong and breaks parallel apt processes. You need to acquire the frontend-lock before acquiring any other locks, and acquire the locks in the same order as apt.

Since 20.04, apt can wait for a lock.

The apt(8) command automatically waits for a lock for 120 seconds (non-interactive) or infinitely.

The apt-get(8) command can be configured to wait as well by passing the -o DPkg::Lock::Timeout=, where may also be -1 for infinite.

This avoids any races you'd get by doing the lock yourself and then invoking apt.

@julian-klode
Copy link
Copy Markdown

julian-klode commented Dec 6, 2021

The basic way to lock things is:

  1. /var/lib/dpkg/lock-frontend
  2. /var/lib/dpkg/lock
  3. /var/cache/apt/archives

Presumably it's enough to just lock the frontend one.

/var/lib/apt/lists is locked independently of that install chain, and only locked during update, so you can acquire it either order. Also update does not acquire the dpkg frontend lock.

Presumably prepending /var/lib/dpkg/lock-frontend to the list of locks will cause the correct behavior.

Without locking frontend, another apt process waiting for that lock will run and then fail, causing bad UX.

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

Thanks @julian-klode ! We're still supporting 18.04, but it's good to know that once that goes EOL the locking code can be removed.

Is there any documentation for this locking behavior? I'm still not entirely sure what each lock is responsible for.

@julian-klode
Copy link
Copy Markdown

julian-klode commented Dec 6, 2021

We only have waiting for parallel install though (it only waits for the frontend lock), not update.

/var/lib/dpkg/lock-frontend ensures nobody else tries to run dpkg while apt runs dpkg (whether it's the user, another apt, or another dpkg frontend). This is important because apt runs dpkg multiple times, so /var/lib/dpkg/lock is released in between - it locks an individual dpkg run. Before the frontend lock, you could end up losing the lock in the middle of an install/upgrade (between two dpkg runs).

/var/lib/apt/lists/lock is locked when metadata is being downloaded. It is not locked during metadata access, which might cause some race conditions. Wondering if I should try acquire that lock when opening the cache, but it might be hard to do. Also would be nice to add waiting for that one into apt.

/var/cache/apt/archives/lock is locked once we start downloading packages and held while packages are being installed.

(Both are the same lock implementation wise, just for different download areas - package lists vs packages).

You'll mostly get by locking just frontends and the lists lock, but oh well, some other frontends might behave badly and not support the frontend lock themselves, and I'm not sure where it's all backported too. I think most Ubuntu releases, but not 100% sure. In any case, no issue acquiring it always.

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

Thanks! That's very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants