Skip to content

systemd: add SetUnified for direct cgroupfs writes bypassing dbus#59

Closed
sohankunkerkar wants to merge 1 commit intoopencontainers:mainfrom
sohankunkerkar:setunified-v0.0.6
Closed

systemd: add SetUnified for direct cgroupfs writes bypassing dbus#59
sohankunkerkar wants to merge 1 commit intoopencontainers:mainfrom
sohankunkerkar:setunified-v0.0.6

Conversation

@sohankunkerkar
Copy link
Copy Markdown

Add SetUnified method on UnifiedManager that writes unified cgroup v2 resource values directly to cgroupfs without going through systemd's SetUnitProperties dbus path.

When callers use Set() to update specific unified keys (e.g. memory.min, memory.low), the current implementation bundles those updates with all other resource properties into a single SetUnitProperties dbus call. This can cause systemd to reset unrelated cgroup properties on the unit. SetUnified avoids this by writing only the specified unified values via the existing fs2 manager path.

This is needed by kubelet to clear stale MemoryQoS cgroup protection values during feature rollback without triggering systemd side effects on CPU and memory limit properties.

Context

Copilot AI review requested due to automatic review settings May 7, 2026 03:21
@sohankunkerkar sohankunkerkar requested a review from a team as a code owner May 7, 2026 03:21
@sohankunkerkar
Copy link
Copy Markdown
Author

cc @haircommander

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a targeted API to update cgroup v2 “unified” (cgroupfs file) values for systemd-managed cgroups without using systemd’s SetUnitProperties DBus call, to avoid systemd resetting unrelated cgroup properties during property updates.

Changes:

  • Add (*UnifiedManager).SetUnified(map[string]string) to write selected unified keys directly via the cgroupfs (fs2) path, bypassing DBus.
  • Add an integration test to verify unified keys (e.g., memory.min, memory.low) can be set and later cleared using the new method.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
systemd/v2.go Adds SetUnified to perform direct cgroupfs unified writes via the fs2 manager.
systemd/systemd_test.go Adds TestSetUnified integration coverage for setting/clearing unified memory protection knobs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread systemd/v2.go Outdated
Add SetUnified method on UnifiedManager that writes unified cgroup v2
resource values directly to cgroupfs without going through systemd's
SetUnitProperties dbus path.

When callers use Set() to update specific unified keys (e.g. memory.min,
memory.low), the current implementation bundles those updates with all
other resource properties into a single SetUnitProperties dbus call.
This can cause systemd to reset unrelated cgroup properties on the
unit. SetUnified avoids this by writing only the specified unified
values via the existing fs2 manager path.

This is needed by kubelet to clear stale MemoryQoS cgroup protection
values during feature rollback without triggering systemd side effects
on CPU and memory limit properties.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
@haircommander
Copy link
Copy Markdown
Contributor

hm if we do this is there a chance systemd clobbers the changes? We've had situations in the past where writing directly to cgroups without going through dbus means systemd doesn't become aware of the change and later undoes it.

@sohankunkerkar
Copy link
Copy Markdown
Author

hm if we do this is there a chance systemd clobbers the changes? We've had situations in the past where writing directly to cgroups without going through dbus means systemd doesn't become aware of the change and later undoes it.

Yup, systemd will clobber these writes on the next SetUnitProperties call for the unit. That's actually why this API exists. The kubelet MemoryQoS feature sets memory.min/memory.low on QoS-class cgroups via cm.Update() → SetUnitProperties. When MemoryQoS is disabled, we need to clear those stale values. The problem is that clearing via SetUnitProperties(MemoryLow=0) causes systemd to re-apply all stored properties for the unit, which resets cpu.max/cpu.weight/memory.max to defaults (kubernetes/kubernetes#137886). So we can't go through dbus for the cleanup path.

The caller (clearStaleMemoryQoS in kubelet) handles the clobbering by running after every cm.Update() cycle, not just once at startup. We actually tried startup-only cleanup first and confirmed it gets undone immediately and UpdateCgroups runs via wait.Until which fires the function right away, and systemd re-applies the stored MemoryLow on that first SetUnitProperties call. The pattern ends up being: cm.Update() (systemd re-applies stale MemoryLow) → clearStaleMemoryQoS() (direct cgroupfs write overrides it). This repeats every ~1 minute. The direct write wins because it runs last.

So yes, the clobbering is expected and the caller is built around it. The API intentionally leaves that responsibility to the caller rather than trying to solve it here.

@haircommander
Copy link
Copy Markdown
Contributor

I guess do we understand why setting the memory.low is unsetting cpu.max/cpu.weight/memory.max? I feel like we should address that rather than doing all of this overwriting all the time

@sohankunkerkar
Copy link
Copy Markdown
Author

I guess do we understand why setting the memory.low is unsetting cpu.max/cpu.weight/memory.max? I feel like we should address that rather than doing all of this overwriting all the time

systemd’s SetUnitProperties dbus API reconciles the entire unit state, not just the provided properties. When kubelet calls cm.Update() with only memory Unified keys, zero-valued CPU fields are skipped, so they are omitted from the dbus call. systemd treats omitted properties as reset-to-default and clobbers cpu.max/memory.max. This is expected systemd behavior.

@haircommander
Copy link
Copy Markdown
Contributor

so maybe we need to keep track of the state of the cgroup in kubelet so each time we pass the properties we don't clobber our own values? or in this library read them before we write them. if that's too complicated ig this would work but it feels like bad practice to be fighting with systemd so much on this

@sohankunkerkar
Copy link
Copy Markdown
Author

so maybe we need to keep track of the state of the cgroup in kubelet so each time we pass the properties we don't clobber our own values? or in this library read them before we write them. if that's too complicated ig this would work but it feels like bad practice to be fighting with systemd so much on this

When the gate is on this already works! The problem is only when the gate is off, setMemoryQoS doesn't run, so cm.Update() sends CPU-only and systemd re-applies the stored MemoryLow from when the gate was previously on. We could always send MemoryLow=0 even with the gate off but that was rejected upstream since it means running MemoryQoS code on every node unconditionally.

The alternative of doing read-modify-write in the cgroups library to snapshot the full unit state before SetUnitProperties is a bigger change with its own race conditions since it's not atomic against systemd. A targeted cgroupfs write after the dbus call is the simplest thing that works here.

@kolyshkin
Copy link
Copy Markdown
Contributor

I will take a closer look next week, but for now:

  1. If this is a method that writes to fs (rather than systemd), why is this a systemd driver method, and not a method of fs2?
  2. fs2 manager can already set unified resources, what prevents you from using it?

@sohankunkerkar
Copy link
Copy Markdown
Author

so maybe we need to keep track of the state of the cgroup in kubelet so each time we pass the properties we don't clobber our own values? or in this library read them before we write them. if that's too complicated ig this would work but it feels like bad practice to be fighting with systemd so much on this

Dug deeper and confirmed why the dbus path can't work for this. Ran a test on systemd 257:

  # Set CPU and memory on a slice via dbus
  systemctl set-property --runtime test-mask.slice CPUQuota=80% CPUWeight=500 MemoryMax=1073741824                                                                                                                 
                                                                                                                                                                                                                   
  # Overwrite values via cgroupfs (simulating what enforceNodeAllocatableCgroups + external process does)                                                                                                          
  echo "50000 100000" > test-mask.slice/cpu.max                                                                                                                                                                    
  echo "200" > test-mask.slice/cpu.weight                                                                                                                                                                          
  echo "536870912" > test-mask.slice/memory.max                                                                                                                                                                    
                                                                                                                                                                                                                   
  # Send ONLY MemoryMin=0 via dbus (what kubelet's setMemoryQoS would do)                                                                                                                                          
  systemctl set-property --runtime test-mask.slice MemoryMin=0                                                                                                                                                     
                                                                                                                                                                                                                   
  # Result: ALL values reverted to dbus-stored values                                                                                                                                                              
  cpu.max: 80000 100000  (reverted from 50000)
  cpu.weight: 500        (reverted from 200)                                                                                                                                                                       
  memory.max: 1073741824 (reverted from 536870912)                                                                                                                                                                 
  memory.min: 0          (correctly cleared)                                                                                                                                                                       

SetUnitProperties triggers full unit realization as systemd rewrites all cgroup properties from its stored context, not just the one you changed. There's no dbus API to update a single cgroup property without triggering this. This is the same dbus method kubelet uses.

@sohankunkerkar
Copy link
Copy Markdown
Author

I will take a closer look next week, but for now:

  1. If this is a method that writes to fs (rather than systemd), why is this a systemd driver method, and not a method of fs2?

It's on the systemd driver because the problem is systemd-specific. On cgroupfs driver, Set() writes unified keys directly to cgroupfs without side effects. On systemd driver, Set() calls setUnitProperties via dbus first, which triggers full unit realization as systemd rewrites all cgroup properties from its stored context, not just the ones you sent. #59 (comment)

  1. fs2 manager can already set unified resources, what prevents you from using it?

The caller (kubelet) could technically get the path via unifiedMgr.Path("") and create a standalone fs2.NewManager(config, path) to call Set() with only Unified keys. But that requires the caller to construct a cgroups.Cgroup config, create a second manager, and understand the internal architecture (systemd driver wraps fs2). SetUnified keeps this contained. Happy to restructure if you'd prefer a different approach though.

@sohankunkerkar
Copy link
Copy Markdown
Author

Closing this in favor of kubernetes/kubernetes#138903

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants