Try to join the cgroup of the init process of the parent container when apply_cgroup for a tenant container fails due to a "Device or resource busy" error#3347
Conversation
|
Thanks for opening this PR — I ran into the same issue while working on This problem is not about the init process, but about exec processes under cgroup v2 when domain controllers are enabled. Once a controller is turned on, the container’s configured cgroup may no longer be joinable (kernel returns EBUSY / EPERM), and exec is expected to fall back to joining the init process’s cgroup. This behavior is explicitly documented by runc: Note for cgroup v2: in case the process can’t join the top level cgroup, runc exec fallback is to try joining the cgroup of container’s init. Importantly, this fallback is exec-only:
Because this is policy, not cgroup mechanism, runc implements it in the container execution path, not inside the cgroup manager itself. This avoids:
For youki, the correct place to implement, I think, is here: // crates/libcontainer/src/process/container_intermediate_process.rs
fn apply_cgroups<
C: CgroupManager<Error = E> + ?Sized,
E: std::error::Error + Send + Sync + 'static,
>(
cmanager: &C,
resources: Option<&LinuxResources>,
init: bool,
) -> Result<()> { ... }where we know:
Handling this inside libcgroups (or only for systemd) is insufficient and environment-dependent. The expected behavior should be:
Without implementing this retry at the libcontainer level (as runc does), exec under cgroup v2 with domain controllers enabled will continue to fail for cgroupfs users. WDYT? Thanks again. |
| /// The init process PID of the parent container if the container is created as a tenant. | ||
| parent_init_pid: Option<Pid>, |
There was a problem hiding this comment.
ContainerType should have parent_init_pid.
| Err(e) => { | ||
| // If adding the process to the cgroup fails due to a "Device or resource busy" error, | ||
| // manager tries to join the cgroup of the init process of the tenant container. | ||
| if e.to_string().contains("Device or resource busy") |
There was a problem hiding this comment.
How about getting the error(EBUSY) from the debug client instead of parsing the error message?
There was a problem hiding this comment.
I really wanted to, but I couldn't achieve that just by putting the following code here.
impl From<nix::Error> for SystemdClientError {
fn from(err: nix::Error) -> SystemdClientError {
match err {
nix::Error::EBUSY => DbusError::DeviceOrResourceBusy(err.to_string()).into(),
_ => DbusError::ConnectionError(err.to_string()).into(),
}
}
}
Seems like socket::sendmsg in dbus_native::DbusConnection::send_message() doesn't emit nix::error::EBUSY. Rather, it puts out an error message with no error in Result.
Could you give me some advice on what I should do here?
There was a problem hiding this comment.
To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation.
Since you mentioned it, I'll stop focusing on the detailed code of this PR for now. I'll review the detailed code once we've clarified and implemented the non-demo aspects.
crates/libcgroups/src/common.rs
Outdated
| // is empty string ("") and the value is the cgroup path the <pid> is in. | ||
| // | ||
| // ref: https://github.com/opencontainers/cgroups/blob/main/utils.go#L171-L219 | ||
| pub fn parse_proc_cgroup_file(path: &str) -> Result<HashMap<String, String>, ParseProcCgroupError> { |
There was a problem hiding this comment.
Could we use the procfs crate? Be careful: if it reads inside the container, please use ProcfsHandle for safety.
|
@utam0k @tommady To clarify, the goal of this PR’s initial implementation is to serve as a conceptual demo, providing a starting point for discussing a more suitable implementation. I’m still not very familiar with youki, or even Rust itself. Also, this may be out of context, but I want to clarify the wording here. I made a table of the wording I imagine.
What confused me here is that the word
WDYT about this? Should I use the name |
|
@tommady
I strongly agree with that. I'll re-implement the logic there. |
2b24f7f to
e1e62b0
Compare
|
Just to clarify, since I've forgotten to put sign-offs on the previous commits and I've pushed a complete re-implementation now, I force-pushed the branch. |
|
Please set it to "ready for review" when you are ready to review the detailed codes after the discussion. |
Thanks for the table — that actually helped me realize part of the confusion is on my side too 😅 I think I’ve been a bit sloppy with naming. Referring to your table, when I said “init process” I meant Container B’s init process (the TenantContainer being exec’d into), not Container A’s init. In your terms, this is the exec case for Container B: if joining the configured cgroup fails under cgroup v2, exec should fall back to B’s init process cgroup, not the parent’s. Sorry about the naming confusion 🤪 that’s on me. I’d really appreciate hearing others’ opinions on whether using runc-style naming. |
|
This isn't a separate “Container B”; it's an exec/tenant process joining the existing container's cgroup. So calling it “parent” is confusing. How about |
|
@logica0419 What is the status of this PR? |
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
…en add_process_to_unit fails Signed-off-by: Takuto Nagami <logica0419@gmail.com>
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
… process.rs Signed-off-by: Takuto Nagami <logica0419@gmail.com>
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
fc1ad10 to
b2c5320
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses nested-container cgroup v2 behavior when joining a tenant container’s cgroup via systemd fails with an EBUSY (“Device or resource busy”) error, by falling back to joining the parent (landlord) init process’s cgroup.
Changes:
- Add an EBUSY-specific fallback in
apply_cgroupsto join the landlord init process cgroup by reading/proc/<pid>/cgroupand writing tocgroup.procs. - Extend tenant container process args to carry
landlord_init_pidand plumb it from the tenant builder. - Teach the systemd dbus layer to classify
System.Error.EBUSYdistinctly and surface it viaSystemdManagerError::is_ebusy().
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/libcontainer/src/process/container_intermediate_process.rs | Adds EBUSY detection and tenant fallback to parent init cgroup during cgroup application. |
| crates/libcontainer/src/process/args.rs | Extends ContainerType::TenantContainer with landlord_init_pid. |
| crates/libcontainer/src/container/tenant_builder.rs | Populates landlord_init_pid from the parent container state. |
| crates/libcgroups/src/systemd/manager.rs | Adds SystemdManagerError::is_ebusy() for EBUSY classification. |
| crates/libcgroups/src/systemd/dbus_native/utils.rs | Introduces a DbusError::DeviceOrResourceBusy variant. |
| crates/libcgroups/src/systemd/dbus_native/proxy.rs | Maps DBus ErrorName=System.Error.EBUSY to the new EBUSY error variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, | ||
| // we check the ErrorName header and switch error type | ||
| // see https://github.com/youki-dev/youki/issues/3342 |
There was a problem hiding this comment.
The comment references a specific file path and line number ("process/container_intermediate_process.rs:262"), which will quickly become stale as the code moves. Prefer referencing the issue link (#3342) or describing the call site without hard-coded line numbers/paths.
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, | |
| // we check the ErrorName header and switch error type | |
| // see https://github.com/youki-dev/youki/issues/3342 | |
| // For the relevant intermediate-process call path, detect EBUSY via the | |
| // ErrorName header and map it to the more specific error type. | |
| // See https://github.com/youki-dev/youki/issues/3342 for context. |
| if let Err(err) = cmanager.add_task(pid) { | ||
| if !init && is_ebusy(&err) { | ||
| // If adding the process to the cgroup fails due to a "Device or resource busy" error, | ||
| // manager tries to join the cgroup of the init process of the parent container. | ||
| tracing::debug!( | ||
| "failed to add task to cgroup, trying to join parent's init process cgroup" | ||
| ); |
There was a problem hiding this comment.
This adds a new EBUSY-specific fallback path (joining the parent init process's cgroup via /proc and writing to cgroup.procs), but the unit tests in this module don't exercise it. Add a unit test that simulates an EBUSY add_task error and asserts the fallback behavior is taken (and that non-EBUSY errors still fail).
| ) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The new is_ebusy() helper is used to control a behavior-changing fallback, but it's not covered by tests. Add a unit test that constructs a SystemdManagerError::SystemdClient(SystemdClientError::DBus(DbusError::DeviceOrResourceBusy(_))) and verifies is_ebusy() returns true, and false for other DBus errors.
| #[cfg(test)] | |
| mod tests { | |
| use super::*; | |
| #[test] | |
| fn test_is_ebusy_true_for_device_or_resource_busy() { | |
| let err = SystemdManagerError::SystemdClient(SystemdClientError::DBus( | |
| DbusError::DeviceOrResourceBusy("busy".into()), | |
| )); | |
| assert!(err.is_ebusy()); | |
| } | |
| #[test] | |
| fn test_is_ebusy_false_for_other_dbus_errors() { | |
| let err = SystemdManagerError::SystemdClient(SystemdClientError::DBus( | |
| DbusError::Failed("not busy".into()), | |
| )); | |
| assert!(!err.is_ebusy()); | |
| } | |
| } |
There was a problem hiding this comment.
Should this function be tested?
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
| let mut ctr = 0; | ||
| let message = String::deserialize(&msg.body, &mut ctr)?; | ||
|
|
||
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, |
There was a problem hiding this comment.
process/container_intermediate_process.rs:262
This is not permanent.
There was a problem hiding this comment.
Yep, that was my mistake 😩
I rewrote them based on the Copilot's suggestion.
| exec_notify_fd: RawFd, | ||
| landlord_init_pid: Option<Pid>, |
There was a problem hiding this comment.
Please provide descriptions for these fields.
There was a problem hiding this comment.
Also, let's update the developer documentation.
https://youki-dev.github.io/youki/developer/introduction.html
There was a problem hiding this comment.
I'll work on the dev doc later as well.
Which pages should be updated specifically?
I’d also appreciate any advice on what kind of information should be included.
| #[cfg(not(feature = "systemd"))] | ||
| { | ||
| false | ||
| } | ||
|
|
||
| #[cfg(feature = "systemd")] |
There was a problem hiding this comment.
Why was it necessary to separate them in this function?
There was a problem hiding this comment.
Yes, since when feature = "systemd" is disabled, e (of type libcgroups::common::AnyManagerError::Systemd) doesn't have .ebusy() method.
By the way, I’m personally not very happy with how information is propagated in manager.rs, or with the complex type casting here.
Could you think of another approach?
| let message = String::deserialize(&msg.body, &mut ctr)?; | ||
|
|
||
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, | ||
| // we check the ErrorName header and switch error type |
There was a problem hiding this comment.
| // we check the ErrorName header and switch error type | |
| // we check the ErrorName header and switch error type. |
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, | ||
| // we check the ErrorName header and switch error type | ||
| // see https://github.com/youki-dev/youki/issues/3342 | ||
| const EBUSY_ERROR_NAME: &str = "System.Error.EBUSY"; |
There was a problem hiding this comment.
Wouldn't it be more appropriate to define it in message.rs?
There was a problem hiding this comment.
I thought so too, so I moved it 👍
| .and_then(|h| match &h.value { | ||
| HeaderValue::String(s) => Some(s.as_str()), | ||
| _ => None, | ||
| }) |
There was a problem hiding this comment.
How about using map? instead of match syntax.
There was a problem hiding this comment.
How would that work?
Since I’m not very familiar with Rust, I’d really appreciate it if you could share an example snippet 😃
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
Signed-off-by: Takuto Nagami <logica0419@gmail.com>
logica0419
left a comment
There was a problem hiding this comment.
Hi, thanks for the review!
First of all, I’m really sorry for leaving this PR unattended for so long.
I’ve been crazily busy over the past three months, and I even got seriously sick last month.
I have more free time now, so I’ll actively work on this until it’s merged.”
【Field naming】
How about
landlord_init_pid(landlord = parent init in your context)?
This naming is good, I love it! I've fixed the naming, as you kindly reviewed.
【About Dbus error】
With GitHub Copilot, I found that I can retrieve the original error information from the ErrorName header.
On Ubuntu 24.04, it was "System.Error.EBUSY", so I guess it's common across Linux distributions (not sure)... Please comment if you have any concerns.
Also, as I wrote in the review comment, I'm not really satisfied with the current approach of how DbusError::DeviceOrResourceBusy is propagated to apply_cgroup().
I’m not very experienced with Rust, so I’d really appreciate any suggestions.”
【Testing】
I think some integration or end-to-end tests are necessary.
I’d appreciate any advice on what those tests should look like, since I’m not very familiar with the project structure yet.
| let mut ctr = 0; | ||
| let message = String::deserialize(&msg.body, &mut ctr)?; | ||
|
|
||
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, |
There was a problem hiding this comment.
Yep, that was my mistake 😩
I rewrote them based on the Copilot's suggestion.
| // to check if error is EBUSY at process/container_intermediate_process.rs:262, | ||
| // we check the ErrorName header and switch error type | ||
| // see https://github.com/youki-dev/youki/issues/3342 | ||
| const EBUSY_ERROR_NAME: &str = "System.Error.EBUSY"; |
There was a problem hiding this comment.
I thought so too, so I moved it 👍
| .and_then(|h| match &h.value { | ||
| HeaderValue::String(s) => Some(s.as_str()), | ||
| _ => None, | ||
| }) |
There was a problem hiding this comment.
How would that work?
Since I’m not very familiar with Rust, I’d really appreciate it if you could share an example snippet 😃
| ) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Should this function be tested?
| exec_notify_fd: RawFd, | ||
| landlord_init_pid: Option<Pid>, |
| #[cfg(not(feature = "systemd"))] | ||
| { | ||
| false | ||
| } | ||
|
|
||
| #[cfg(feature = "systemd")] |
There was a problem hiding this comment.
Yes, since when feature = "systemd" is disabled, e (of type libcgroups::common::AnyManagerError::Systemd) doesn't have .ebusy() method.
By the way, I’m personally not very happy with how information is propagated in manager.rs, or with the complex type casting here.
Could you think of another approach?
| exec_notify_fd: RawFd, | ||
| landlord_init_pid: Option<Pid>, |
There was a problem hiding this comment.
I'll work on the dev doc later as well.
Which pages should be updated specifically?
I’d also appreciate any advice on what kind of information should be included.
| if !error_message.is_empty() { | ||
| let msg = error_message[0]; | ||
| if msg.body.is_empty() { | ||
| // this should rarely be the case | ||
| return Err(DbusError::MethodCallErr("Unknown Dbus Error".into()).into()); | ||
| } else { | ||
| // in error message, first item of the body (if present) is always a string | ||
| // indicating the error | ||
| let mut ctr = 0; | ||
| let msg = String::deserialize(&msg.body, &mut ctr)?; | ||
| return Err(DbusError::MethodCallErr(msg).into()); | ||
| } | ||
|
|
||
| // in error message, first item of the body (if present) is always a string | ||
| // indicating the error | ||
| let mut ctr = 0; | ||
| let message = String::deserialize(&msg.body, &mut ctr)?; | ||
|
|
||
| // To check if error is EBUSY in the intermediate_process, detect EBUSY via the | ||
| // ErrorName header and map it to the more specific error type. | ||
| // See https://github.com/youki-dev/youki/issues/3342 for context. | ||
| if let Some(error_name) = msg | ||
| .headers | ||
| .iter() | ||
| .find(|h| h.kind == HeaderKind::ErrorName) | ||
| .and_then(|h| match &h.value { | ||
| HeaderValue::String(s) => Some(s.as_str()), | ||
| _ => None, | ||
| }) | ||
| && error_name == ERROR_NAME_EBUSY | ||
| { | ||
| return Err(DbusError::DeviceOrResourceBusy(message).into()); | ||
| } | ||
|
|
||
| return Err(DbusError::MethodCallErr(message).into()); | ||
| } |
There was a problem hiding this comment.
from the D-Bus spec (Message Protocol -> Message Types -> Error)
https://dbus.freedesktop.org/doc/dbus-specification.html
An ERROR may have any arguments, but if the first argument is a STRING, it must be an error message. The error message may be logged or shown to the user in some way.
from my understanding that means an error might have 0 arguments (an empty body).
therefore, I’d suggest keeping the else branch, but avoiding an early return when the body is empty. Instead, handle the empty body explicitly, for example:
if !error_message.is_empty() {
let msg = error_message[0];
let message = if msg.body.is_empty() {
"Unknown Dbus Error".into()
} else {
// in error message, first item of the body (if present) is always a string
// indicating the error
let mut ctr = 0;
String::deserialize(&msg.body, &mut ctr)?
};
// To check if error is EBUSY in the intermediate_process, detect EBUSY via the
// ErrorName header and map it to the more specific error type.
// See https://github.com/youki-dev/youki/issues/3342 for context.
if let Some(error_name) = msg
...
}There was a problem hiding this comment.
That's a pretty smart Rust technique, I love that!
| if let ContainerType::TenantContainer { | ||
| exec_notify_fd: _, | ||
| landlord_init_pid, | ||
| } = container_type | ||
| && let Some(landlord_init_pid) = landlord_init_pid | ||
| && let Some(landlord_init_proc_cgroup) = | ||
| ProcessCGroups::from_read(ProcfsHandle::new()?.open( | ||
| ProcfsBase::ProcPid(landlord_init_pid.as_raw() as u32), | ||
| "cgroup", | ||
| OpenFlags::O_RDONLY | OpenFlags::O_CLOEXEC, | ||
| )?)? | ||
| .into_iter() | ||
| .find(|c| c.controllers.is_empty()) | ||
| && let Some(landlord_init_proc_cgroup_path) = | ||
| landlord_init_proc_cgroup.pathname.strip_prefix("/") | ||
| { | ||
| libcgroups::common::write_cgroup_file( | ||
| Path::new(libcgroups::common::DEFAULT_CGROUP_ROOT) | ||
| .join(Path::new(landlord_init_proc_cgroup_path)) | ||
| .join(libcgroups::common::CGROUP_PROCS), | ||
| pid, | ||
| ) | ||
| .map_err(|err| IntermediateProcessError::Cgroup(err.to_string()))?; | ||
| return Ok(()); | ||
| } |
There was a problem hiding this comment.
I think there’s a subtle issue with how the fallback logic is currently written.
right now it relies quite a bit on the ? operator, which means any failure during the fallback immediately bubbles up.
the problem is that this ends up overwriting the original error, which is usually the one I actually care about.
that can make debugging pretty confusing
scenario:
- systemd returns an EBUSY (“Device or resource busy”) error when youki attempts to join the cgroup.
- youki then enters the fallback path and tries to join the landlord’s init process cgroup.
- however, the landlord process exits just before the fallback executes, so /proc/{pid}/cgroup no longer exists (ENOENT).
as a result, the function exits early with the fallback error, the user ends up seeing only:
ERROR youki::process::container_intermediate_process error during container init: Cgroup("Procfs Error: No such file or directory")the original "Device or resource busy" error is completely gone.
the user thinks youki failed because a file was missing, not because of a cgroup conflict.
what I think would feel better:
the fallback is more of a “best effort” thing, so it probably shouldn’t be able to override the original failure.
it would be much easier to understand what’s going on if we kept the original error and just logged what happened during the fallback. something like:
DEBUG youki::process::container_intermediate_process failed to add task to cgroup, trying to join parent's init process cgroup
DEBUG youki::process::container_intermediate_process failed to read landlord cgroup: Procfs Error: No such file or directory
ERROR youki::process::container_intermediate_process failed to add task to cgroup pid=1234 err="Device or resource busy" init=falsethis way:
- we still see the fallback attempt (useful for debugging),
- but we don’t lose the original reason things failed.
WDYT?


Description
Type of Change
Testing
Steps to Reproduceof [Bug]: Nested containers: exec process not added to cgroup v2 #3342 and got the expected resultRelated Issues
Fixes #3342
Additional Context