drm/compositor: fall back to primary-plane cursor on map_mut failure#2033
Open
poelzi wants to merge 1 commit into
Open
drm/compositor: fall back to primary-plane cursor on map_mut failure#2033poelzi wants to merge 1 commit into
poelzi wants to merge 1 commit into
Conversation
The cursor-plane fast path called .expect("Lost track of cursor device")
on the result of gbm::BufferObject::map_mut. That panic was written
assuming the only way the outer Result could fail was a vanished gbm
device handle. In practice, on NVIDIA the kernel allocator can return
ENOMEM here — typically after a misbehaving client (e.g. Chromium's
video decoder crashing under SIGILL) leaks GPU memory in nvkms — and
the panic propagated all the way up render_frame, killing the entire
compositor over a transient per-frame allocation failure.
Treat the outer map_mut error the same way the four sibling failure
points immediately above already do (create_buffer, add_framebuffer,
copy_element_to_cursor_bo, missing underlying_storage): log at debug!
and return None, letting the caller composite the cursor on the
primary plane for this frame. Once kernel memory frees up the cursor
plane is re-tried automatically.
Repro path on the user's machine: chromium SIGILL in the Media thread →
nv_drm_gem_nvkms_map starts returning ENOMEM → niri panics at
mod.rs:3353 with "Lost track of cursor device: Os { code: 12,
kind: OutOfMemory }" → every Wayland client loses its socket.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DrmCompositor::render_cursor_planecalls.expect(\"Lost track of cursor device\")on the outerResultofgbm::BufferObject::map_mut. That panic was written assuming the only failure mode was a vanished gbm device, but on NVIDIA the kernel allocator can returnENOMEMhere — typically after a misbehaving client (e.g. Chromium's video decoder hitting a CHECK and dying under SIGILL) leaks GPU memory in nvkms. A transient per-frame allocation failure then takes the entire compositor down throughrender_frame.create_buffer,add_framebuffer,copy_element_to_cursor_bo, missingunderlying_storage): log atdebug!and returnNone. The caller falls back to compositing the cursor on the primary plane for this frame, and the cursor-plane fast path is automatically retried on subsequent frames once kernel memory frees up.Repro
Observed in the wild on a niri + NVIDIA setup running smithay 0.7.0:
Sequence: chromium Media thread SIGILL → kernel logs
[drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* Failed to map NvKmsKapiMemory→ 5s later niri panics at the line above → every Wayland client loses its socket.Test plan
cargo check --no-default-features --features "backend_drm,backend_gbm,renderer_pixman"— clean.cargo test --libunder the default + relevant features — 71/71 pass, no behavior change.cargo clippy— no new lints (one pre-existingio_other_errorwarning ingbm.rs:215is unrelated).Notes
.unwrap()/.expect()sites inDrmCompositorwere audited and are state-machine invariants (HashMap entries the code just inserted,Options the conditional just provedSome); they do not touch the GPU allocator path.