Wait for apt lock#1034
Conversation
|
Currently, the sleep time is hardcoded, which isn't ideal. I know that we're generally against adding top-level keys, but there's already some precedent for having not-really-documented top-level apt keys. I'd prefer we find a way to document them better, but I'm not sure it would be all that bad to add something like We could also put it under the I'm open to any suggestions. |
| kwargs=subp_kwargs, | ||
| ) | ||
| except subp.ProcessExecutionError as e: | ||
| if e.exit_code == 100 and 'Could not get lock' in e.stderr: |
There was a problem hiding this comment.
Does this pose any internationalization concerns?
There was a problem hiding this comment.
A couple of concerns:
- Agree that hard-coding English strings to scope our logging of the exception reason is ineffective without using gettext.
- I'm not sure this logging warrants us digging into gettext on the apt and apt-utils domain to do translations on this expected string type for just this log.
- This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?
I'm not quite understanding what you are expecting to handle in this except clause:
- Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or
- Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?
d5749ab to
d11fe65
Compare
blackboxsw
left a comment
There was a problem hiding this comment.
thanks @TheRealFalcon only significant question I have is whether our _wait_for_apt_install loop needs to raise in unexpected (non-lock related) errors? Exit 100 is raised in instances like invalid apt sources.list which won't be fixed by retries:
E: The repository 'http://archive.ubuntu.com/ubuntu2 xenial Release' does not have a Release file.
| kwargs=subp_kwargs, | ||
| ) | ||
| except subp.ProcessExecutionError as e: | ||
| if e.exit_code == 100 and 'Could not get lock' in e.stderr: |
There was a problem hiding this comment.
A couple of concerns:
- Agree that hard-coding English strings to scope our logging of the exception reason is ineffective without using gettext.
- I'm not sure this logging warrants us digging into gettext on the apt and apt-utils domain to do translations on this expected string type for just this log.
- This loop looks to sleep and retry for 30 seconds on all CalledProcessErrors. Shouldn't we be raising and not retrying the called processerrors if exit is != 100 or if _is_apt_lock_available is True yet we received a CalledProcessError?
I'm not quite understanding what you are expecting to handle in this except clause:
- Are you thinking that our _is_apt_lock_available is not enough of a pre-flight check on obtaining- the locks or
- Are you trying to handle a race where some other process obtained the APT lock between our success on _is_apt_lock_available and our subsequent call of subp?
Yep, I forgot to add the raise. It's added now!
The second option. Since snap is doing things at the same time as us, I don't think it's out of the realm of possibility for us both to attempt to do an update at the same time. I could have removed the lock check entirely and retry on apt failure, but that would produce a lot of log spam and needless apt runs, so I have both here. Do you think checking for a race condition is unnecessary? It would remove the exception text checking, which I'm not exactly sure the best way to approach. Though...I did have fun playing with apt in French, and that particular error message came back in English 😃 |
|
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.) |
Currently any attempt to run an apt command while another process holds an apt lock will fail. We should instead wait to acquire the apt lock. LP: #1944611
276bf2a to
70065ff
Compare
|
The testing branch this was waiting on has merged, so I moved this out of WIP. I think the one remaining question we have left is are we ok checking the text of an apt failure, or does that seem more risky than leaving that extra check out altogether. @blackboxsw or @holmanb Any strong opinions? |
| func=subp.subp, | ||
| kwargs=subp_kwargs, | ||
| ) | ||
| except subp.ProcessExecutionError as e: |
There was a problem hiding this comment.
There are some semantics I had to look up here, so I'll make a note of the logic/control flow (please comment if I'm wrong on any points:
-
On failure apt-get always returns 100, parsing for "Could not get apt lock" on line 214 is probably necessary since there are likely other causes for error. If
ProcessExecutionErroris raised and exit code is not 100, then it's likely not an apt lock issue. The same is true for missing"Could not get apt lock"message. -
This is racy, but better than current state, I think. Steps to race:
self._is_apt_lock_available()-> returns True- other process grabs lock before cloud-init
- this process starts apt-get install/upgrade and fails
|
@TheRealFalcon I made a comment, but I'm not requesting anything there. I think checking text output is a requirement if we're going to log "Could not obtain apt lock". It would be nice if apt-get used useful return codes (i.e. returned EAGAIN when unable to get lock) so we didn't have to resort to parsing text. No objections from me, looks good. |
|
@blackboxsw @holmanb I pushed another commit that might be a good compromise. I added a large comment with it on the commit, but instead of checking the apt failure text or ignoring the race, we could instead just check the apt lock as soon as apt fails. If we can't acquire the lock, that almost certainly means it's because another application raced us. If we can acquire the lock, that likely means apt failed for unrelated reasons. It's not a perfect solution, but given our options, I think it's good enough. |
holmanb
left a comment
There was a problem hiding this comment.
+1 - I like this over the previous iteration.
|
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.) |
blackboxsw
left a comment
There was a problem hiding this comment.
Thanks for the gap handling there. Validated lock handling races are coped with in this case and can see retries amid intermittent apt upgrade call collisions.
| # raced us when we tried to acquire it, so raise the apt | ||
| # error received. If the lock is unavailable, just keep waiting | ||
| if self._apt_lock_available(): | ||
| raise |
There was a problem hiding this comment.
Yes this works for me, thank you for adjusting this. I think it'll catch the majority of races.
|
This change is wrong and breaks parallel apt processes. You need to acquire the frontend-lock before acquiring any other locks, and acquire the locks in the same order as apt. Since 20.04, apt can wait for a lock. The apt(8) command automatically waits for a lock for 120 seconds (non-interactive) or infinitely. The apt-get(8) command can be configured to wait as well by passing the -o DPkg::Lock::Timeout=, where may also be -1 for infinite. This avoids any races you'd get by doing the lock yourself and then invoking apt. |
|
The basic way to lock things is:
Presumably it's enough to just lock the frontend one.
Presumably prepending Without locking frontend, another apt process waiting for that lock will run and then fail, causing bad UX. |
|
Thanks @julian-klode ! We're still supporting 18.04, but it's good to know that once that goes EOL the locking code can be removed. Is there any documentation for this locking behavior? I'm still not entirely sure what each lock is responsible for. |
|
We only have waiting for parallel install though (it only waits for the frontend lock), not update.
(Both are the same lock implementation wise, just for different download areas - package lists vs packages). You'll mostly get by locking just frontends and the lists lock, but oh well, some other frontends might behave badly and not support the frontend lock themselves, and I'm not sure where it's all backported too. I think most Ubuntu releases, but not 100% sure. In any case, no issue acquiring it always. |
|
Thanks! That's very helpful. |
Proposed Commit Message
Additional Context
See https://bugs.launchpad.net/cloud-init/+bug/1944611 . I have also seen this intermittently while testing.
Test Steps
Beyond the unit tests, start an instance with the following cloud-config
Old behavior is to fail immediately. With this PR it will wait 30 seconds and then fail. Decrease the sleep an appropriate amount of time if you'd like to see it wait and then proceed when the lock is cleared.
Checklist: