Fiber#13
Open
daandemeyer wants to merge 273 commits into
Open
Conversation
This generates on-the-fly cpio initrds from 'extra' resources declared in Type #1 entries and installs them via the Linux initrd protocol so that they get passed to the Linux kernel. Replaces: systemd#39286
It'll be used in the next commit.
Verb dispatch is left untouched for now. Co-developed-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixup for 8623980. This didn't cause any problems until the conversion away from getopt_long().
--timeout-signal is now documented (fixup for e209926). Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
So strv_push_with_size() doesn't have to recalculate the size every time.
…pe 1) (systemd#41863) This implements the "extra" stanza for type 1 entries in systemd-boot, see: uapi-group/specifications@bde167a It comes with a really thorough test suite matching our currently level of testing of systemd-boot (read: there is none, I ask you to trust me, Claude, and your review on this one)... Split out of systemd#41543
option_parser_next_arg() is renamed to option_parser_peek_next_arg() to match option_parser_consume_next_arg(). A new helper is added option_parser_get_arg(…, n). It is a common pattern to only need a single arg, and getting an array and extracting a single item from it is too verbose.
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>
…to be used by vmspawn/nspawn/pid1 to provide storage volumes in a generic fashion (systemd#41776) BindPath= in unit files, and --bind= in nspawn/vmspawn doesn't really cut it to connect arbitrary storage infra to it. Let's do something about it, and implement a simple, light-weight API for acquiring an fd to a storage volume. Benefits: 1. the interface can be implemented by anyone, connecting anything to vmspawn/nspawn/service management 2. very lose coupling: just bind a socket into a well-known dir, done 3. mounting can happen on-demand
This addresses some trivial points made by @keszybz in the PR review.
This is mostly stuff discussed in systemd#41776.
7c721de to
4e1404f
Compare
UEFI firmware can report the currently-active keyboard layout via EFI_HII_DATABASE_PROTOCOL.GetKeyboardLayout(). The layout descriptor includes an RFC 4646 / BCP 47 language tag (e.g. "en-US"). Query this from sd-boot/sd-stub and write it to a new LoaderKeyboardLayout EFI variable, advertised through a new EFI_LOADER_FEATURE_KEYBOARD_LAYOUT feature bit. On the OS side, systemd-vconsole-setup reads the variable as a lowest-priority fallback for the console keymap. To map the BCP 47 tag to a vconsole keymap we extend /usr/share/systemd/kbd-model-map with an optional sixth column listing the comma-separated BCP 47 tags each row covers; a new find_vconsole_keymap_for_bcp47() helper walks the file, preferring an exact tag match and otherwise falling back to the row whose tag matches the input's primary subtag. Credentials, /etc/vconsole.conf, and vconsole.keymap= on the kernel command line continue to take precedence. bootctl status surfaces the new variable, printing the language tag or "n/a (not reported by firmware)" when sd-boot advertises the feature but the firmware HII database didn't expose a layout (common on QEMU without a USB keyboard, since EDK2's PS/2 driver does not register an HII keyboard layout).
The builtin one also makes the clipboard and such work. spicevmc is only required for remote desktop use cases, so let's use the builtin one instead.
…stall.target Many of our services are nowadays implemented via socket activation, and hence require sockets.target to be active to be accessible. One of them is mute-console.socket, which we typically want to use from systemd-firstboot.service, systemd-sysinstall.service and other related services. Hence let's pull in basic.target rather than sysinit.target from system-install.target since it pulls sockets.target in too. Effectively, this doesn't change much except for pulling in a bunch more sockets, and frankly going for sysinit.target was really a bug to begin width.
Limiting VMs to 2 cpus was cargo culting without any actual data that this benefits performance. The host OS has a scheduler, let's make use of it and give the VM access to all the CPUs. This doesn't mean they become inaccessible to the host, it just means the VM gets as many virtual CPUs as the host has CPU cores (threads). How they get scheduled is still up to the host OS.
This makes sure that whenever we want to show the OS name we can show the fancy name. Thus this moves the escaping/validation of the fancy name out of hostnamed into generic code, and then makes use of it in sysinstall,firstboot,prompt-util.
This partially reverts 267b16f. We usually make xyz_size() take NULL, e.g. hashmap_size().
…ept zero length entry These will be used later. Preparation for later commits.
In many network protocols, the length-prefixed data format is often used. Let's add a simple parser and builder for the format.
They were dropped by the commit 267b16f, but will be used later. Hence, let's reintroduce them.
In many network protocols e.g. DHCP, the TLV format is used. Let's introduce a simple parser and builder of the data format.
SUSE uses a different preset, so don't just assert in the test, instead just start the socket in case it is not enabled TEST-74-AUX-UTILS.sh[1594]: ++ systemctl is-enabled systemd-report-basic.socket TEST-74-AUX-UTILS.sh[1540]: + [[ disabled == enabled ]] TEST-74-AUX-UTILS.sh[120]: + echo 'Subtest /usr/lib/systemd/tests/testdata/units/TEST-74-AUX-UTILS.report.sh failed' Follow-up for 4409e52
Addition to PR systemd#41181 Plasma-workspace OSD notifications about turning the touchpad on and off are guided by f21. When this match is specified, KDE notifies on this laptop that the on/off switch of the atchpad state is pressed. Fix dmesg: atkbd serio0: Unknown key pressed (translated set 2, code 0xc1 on isa0060/serio0).
BTRFS_IOC_SEARCH is only available to root in the initial userns. This means we fail to recursively snapshot even if a subvolume has no nested subvolumes at the moment. Let's fix this by using the newer btrfs ioctls which do work even if we don't have CAP_SYS_ADMIN in the initial userns.
…d TLV data (systemd#41802) These are currently not used yet, but will be used later in parsing/building network packets like DHCP message.
Traditionally, asynchronous programming in systemd has been achieved using sd-event along with the asynchronous interfaces of sd-bus and sd-varlink. This works well when the system is reacting to events and all code triggered by those events can run without blocking. In these scenarios, the global Manager object is passed as userdata to the callback, and the callback can use the stack as usual, declaring local state and ensuring proper cleanup via _cleanup_. Control flow structures, such as loops, work as expected, and everything runs smoothly. However, challenges arise when the code needs to perform long-running operations within these callbacks. Since the system cannot block execution within the callback, we can't directly invoke a long-running operation and wait for its result without introducing complexities. Instead, we need to initiate the long-running task, register for completion with sd-event, sd-bus, or sd-varlink, and provide a callback to be invoked when the operation completes. This callback, however, only receives a single userdata pointer, which forces us to bundle all local variables into a struct and pass it along as part of the callback. On top of that, after queuing the asynchronous operation, the caller continues executing. As the caller's stack unwinds when the function exits, the resources and state within the local scope may be prematurely cleaned up. Therefore, the struct must store copies of the local variables or ensure proper reference counting to prevent premature resource cleanup. When multiple long-running operations need to be initiated within a loop, the complexity grows further. We must introduce additional shared state to track the completion of all operations before we can run any code that depends on their results. Furthermore, since the daemon may be shut down at any time, we must track the lifecycle of each long-running operation in the global Manager struct, ensuring proper cleanup even when stack unwinding can no longer manage the resources for us. Fibers, or green threads, provide a more natural way of handling asynchronous operations. By enabling cooperative multitasking within a single thread, fibers allow us to write code that looks like it’s running synchronously, but with the ability to yield control at predefined points, such as when waiting for long-running tasks to complete. With fibers, we can simplify the control flow by running asynchronous operations within a fiber, allowing us to "pause" execution while waiting for the long-running operation to finish and then "resume" the operation once it's complete. This eliminates the need for multiple callback chains, extensive state tracking, and the potential pitfalls of stack unwinding. This commit introduces the ability to execute long-running operations in a non-blocking manner while maintaining the simplicity and readability of synchronous code. The fiber-based approach will significantly improve the handling of complex workflows, making the code easier to write and maintain. The implementation is based on ucontext.h's makecontext() (with a fallback to the venerable sigaltstack() approach on musl), sigsetjmp()/siglongjmp() and sd-event. ucontext.h provides us with alternate stacks that we can switch between. We use sigsetjmp()/siglongjmp() instead of swapcontext() because the latter forcibly saves/restores a per context signal mask every time it is called. Using sigsetjmp()/siglongjmp(), we can avoid the unnecessary syscall and maintain a per thread signal mask, which makes much more sense than having a per fiber signal mask. The default stack size is the same as a regular thread. Because we use mmap() to allocate the stack, the memory won't actually be used until it is paged in by the kernel, so we don't actually use 8MB per fiber. To integrate fibers with the event loop, each fiber is assigned a deferred event source which resumes the fiber when enabled. The deferred event source is oneshot by default so the fiber will run immediately until it yields or suspends. If it yields, the deferred event source is enabled again (oneshot) immediately. If it suspends, before it suspends, one or more event sources are registered with sd-event that will enable the deferred event source (oneshot) to resume the fiber once the operation it is waiting for completes. Yielding or suspending the fiber is done by calling sd_fiber_yield() or sd_fiber_suspend() respectively. Both of these return zero on success or any error value from the async operation that caused the fiber to resume. This is also how fiber cancellation is implemented. When a fiber is cancelled, sd_fiber_yield() and sd_fiber_suspend() will return ECANCELED when the fiber is resumed, allowing the fiber to unwind its stack (which allows cleanup to happen automatically) and finish. Instead of having applications work directly with fibers, we hide them behind a generic futures interface to represent long-running operations, regardless of whether those operations are running on a fiber or not. Aside from fibers, the futures library (sd-future) will for example allow waiting for sd-event sources and doing sd-bus calls in the background as well. Fibers can suspend until a future is ready with sd_fiber_await() or by having the future wake up the fiber explicitly in its callback. A future always defaults to waking up the current fiber. Each future kind plugs into the library by providing an sd_future_ops vtable (alloc, free, cancel, set_priority). The library treats the impl pointer returned by alloc() as a black box. Future Implementations retrieve it via sd_future_get_private(). A future starts in SD_FUTURE_PENDING and transitions exactly once to SD_FUTURE_RESOLVED, carrying an integer result. Consumers can react to that transition either by installing a one-shot callback with sd_future_set_callback() (callback-style code) or by waiting on it from a fiber via sd_fiber_await() (synchronous-looking fiber code). sd_fiber_await() is itself built on a "wait future" that resolves when its target resolves; sd_future_new_wait() exposes the same primitive directly so non-fiber callers can chain futures without involving a fiber. Cancellation is cooperative: sd_future_cancel() invokes the future impl's cancel callback, which is responsible for tearing down its work and ultimately resolving the promise with -ECANCELED. For fiber futures this is what surfaces as the ECANCELED return from sd_fiber_yield()/sd_fiber_suspend() mentioned above. Fire-and-forget fibers — created by passing a NULL ret to sd_fiber_new() — take a self-reference on their future so they outlive the caller's scope. The self-ref is dropped when the fiber resolves. This floating mechanism (sd_fiber_set_floating()) is restricted to fiber futures because they uniquely guarantee resolution; allowing it for arbitrary future kinds would risk silent leaks for kinds that may never resolve. Note that fiber cleanup depends on the runtime operating normally. Each fiber's _cleanup_-style cleanups live on the fiber's own stack and run only when the fiber is resumed and allowed to unwind, which requires a working event loop to drive it to completion. The exit event source registered for top-level fibers ensures unwind on a normal sd_event_exit(), but if the event loop itself terminates abnormally (e.g. an unrecoverable allocation failure mid-dispatch) before all fibers have resolved, their stacks never unwind and any resources they own leak. The code lives in libsystemd as sd-future (not exported) for the following reasons: - We may want to make this a public libsystemd API in the future - The code can't live in src/basic as it makes heavy use of sd-event - The code can't live in src/shared as sd-bus and sd-event make use of it The log and log-context headers are updated with functions to allow fibers to have their own log prefix and log context.
Add a family of sd_fiber_*() I/O wrappers that, when called from a
fiber, behave like blocking I/O from the caller's perspective but
yield to the event loop instead of blocking the thread:
sd_fiber_read / sd_fiber_write
sd_fiber_readv / sd_fiber_writev
sd_fiber_recv / sd_fiber_send
sd_fiber_connect
sd_fiber_recvmsg / sd_fiber_sendmsg
sd_fiber_recvfrom / sd_fiber_sendto
sd_fiber_accept
sd_fiber_ppoll
Most of them share a single helper, fiber_io_operation(), which when
invoked outside a fiber falls through to the underlying syscall
directly, preserving the regular blocking behaviour. Inside a fiber
the helper flips the fd to non-blocking (restoring its original mode
on return), tries the syscall once on the fast path, and on EAGAIN/
EWOULDBLOCK creates an sd-event-backed IO future via future_new_io(),
suspends the fiber, and retries the syscall once the event source
fires.
future_new_io() itself is added to sd-event/event-future.{c,h} as a
new IoFuture kind. It wraps sd_event_add_io() into an sd_future:
oneshot enable, EPOLLERR translated via SO_ERROR (suppressed for
non-sockets), and the fd duplicated with F_DUPFD_CLOEXEC to avoid
EEXIST when multiple sources watch the same descriptor.
Together these let fiber-using code write straight-line socket and
pipe I/O without bundling state into callbacks.
Some helpers in src/basic — ppoll_usec_full() (used by fd_wait_for_event()), loop_read(), loop_read_exact(), loop_write_full() and pidref_wait_for_terminate_full() — block the calling thread. That's the right behaviour outside a fiber but not inside one, where blocking the thread also stalls every other fiber running on the same event loop. Rewriting every caller to pick a fiber or non-fiber variant explicitly would be a lot of churn and would split otherwise-shared code paths in two. Instead, the helpers detect at runtime whether they're running on a fiber and dispatch to a suspending variant when they are. FiberOps in fiber-ops.h holds five function pointers (ppoll, read, write, timeout, cancel_wait_unref); a fiber_ops global constant is populated whenever we enter a fiber with functions that delegate to suspending variants of common syscalls. With this approach, the variants themselves stay in libsystemd which is required because they make use of sd-event. - loop_read()/loop_read_exact() take the fiber read hook on a fiber unless the caller asked for a non-blocking attempt (do_poll=false) and the fd is already non-blocking — in that case we fall through to read() to preserve the existing return-EAGAIN-immediately semantic. The hook itself suspends on EAGAIN until data is available, so neither the do_poll knob nor the explicit fd_wait_for_event() retry loop are needed on the fiber path. - loop_write_full() likewise takes the fiber write hook on a fiber, except when timeout=0 with an already-non-blocking fd (preserving the fast-return-EAGAIN semantic). The fiber path runs inside a FIBER_OPS_TIMEOUT() scope so the caller's timeout is honoured via a deadline future, mirroring SD_FIBER_TIMEOUT() but reachable from src/basic without pulling in sd-future.h. - pidref_wait_for_terminate_full() polls the pidfd via fd_wait_for_event() before each waitid() when either a finite timeout is set or we're on a fiber, and requires pidref->fd >= 0 in those cases (returning -ENOMEDIUM otherwise — extending the rule that already applied to finite timeouts). The poll suspends the fiber via the ppoll hook above; the subsequent waitid() doesn't block because the pidfd is already signalled.
…iber sd_event_run() blocks the calling thread on the event loop's epoll fd until something happens. When the caller is a fiber, that's the wrong behaviour: blocking the thread also stalls every other fiber and the outer event loop driving them. The most common way to hit this is a fiber that creates its own inner event loop (e.g. a server-style fiber that wants to dispatch its own sources independently of whatever loop the test or supervising fiber is running on) — with the existing implementation the inner sd_event_run() would hold the thread while the outer scheduler should be free to advance other fibers. Add an event_run_suspend() variant in sd-event/event-future.c that performs the same prepare/wait/dispatch dance, but when the fast path finds nothing ready it (a) creates an IO future watching the inner event loop's epoll fd on the *outer* event loop, (b) optionally creates a time future for the timeout, and (c) suspends the fiber. When either future fires the fiber is resumed and the prepare/wait/dispatch sequence runs once more to actually dispatch what's pending. sd_event_run() checks sd_fiber_is_running() and delegates to this variant when on a fiber; profile_delays accounting is intentionally skipped on that path since the underlying prepare/wait/dispatch primitives already account for themselves.
Two changes to teach sd-bus how to behave when called from a fiber, in
order of increasing depth:
2. sd_bus_call() now redirects to a new bus_call_suspend() helper when
the caller is a fiber whose event loop is the same one the bus is
attached to. The plain bus_poll() path serializes all bus traffic on
the slot's reply (only one method call can be in flight per
sd_bus*), which would defeat the point of running multiple fibers
against one bus. bus_call_suspend() builds on the async sd-bus API:
it wraps the call in a new BusFuture (sd-bus/bus-future.{c,h}) that
resolves when the reply or method-error arrives, lets the fiber
await that future, and surfaces the reply to the caller via
future_get_bus_reply(). Because the futures live on the event loop
rather than a per-bus slot, multiple fibers can drive concurrent
method calls against the same bus.
3. A new private SD_BUS_VTABLE_METHOD_FIBER flag dispatches a vtable
method handler on its own fiber, so handlers are free to use
sd_bus_call() against the same bus, sd_fiber_sleep(), loop_read(),
etc. without stalling the event loop for other connections or
handlers. The flag stays out of sd-bus-vtable.h (its bit value is
reserved there to prevent collisions) — the fiber runtime is a
systemd-internal implementation detail.
Lifecycle of fiber-dispatched handlers is tracked on the bus itself: a
new bus->fiber_futures set holds a ref to each in-flight handler.
bus_enter_closing() cancels every entry and process_closing() returns
with the bus still in CLOSING state until the set drains, so we can be
sure no fiber handler outlives the bus. bus_fiber_resolved() removes
the entry on completion. bus_free()'s assert(set_isempty()) makes the
invariant load-bearing.
Note that plain sd_bus_call() already works correctly on a fiber as it
calls ppoll_usec() which has already been modified to suspend when
running on a fiber.
To exercise these changes the existing thread-based client/server
sd-bus tests (test-bus-chat, test-bus-objects, test-bus-peersockaddr,
test-bus-server, test-bus-watch-bind) are migrated to fibers, and a
new test-bus-fiber is added that covers SD_BUS_VTABLE_METHOD_FIBER —
including handlers that issue nested sd_bus_call() on the same bus, the
cancel-on-close path, and concurrent dispatches across multiple fibers.
Add varlink_server_bind_fiber() and varlink_server_bind_fiber_many()
in varlink-util.{c,h} for registering a method handler that should
run on a dedicated fiber per dispatch. The fiber-bound methods live
in a separate s->fiber_methods map alongside the regular s->methods;
bind_internal()/bind_many_internal() are factored out so the regular
and fiber bind variants share their parsing/insertion code.
Registering the same method in both maps is rejected because the
dispatcher consults the regular map first and would otherwise
silently shadow the fiber binding.
varlink_dispatch_fiber() builds a VarlinkFiberData (refs to the
connection, parameters, and method name), spawns a fiber via
sd_fiber_new(), and makes the future floating so the fiber
self-manages its lifetime — neither the dispatcher nor the
connection has to track it. The fiber's priority is set to one
below the connection's quit event source so that on graceful
shutdown the fiber's exit handler fires (and runs its cleanup)
before varlink's quit_callback() closes the connection underneath
it; this is what lets a fiber-bound handler reply or flush its
sentinel on a still-open connection during shutdown.
The connection state transitions are reordered so they happen before
the fiber spawn rather than after the synchronous callback returns:
the fiber runs after dispatch has already moved past PROCESSING, which
matches the behaviour expected for a deferred reply (the fiber may
either reply immediately, or stash the connection and reply later, in
which case the post-callback logic treats it as a PENDING_METHOD).
Note that all the synchronous varlink APIs (sd_varlink_call() and friends)
already behave properly when on a fiber because they call json_stream_wait()
which calls ppoll_usec() which we already fixed to suspend when called from
a fiber.
The client/server varlink tests are migrated to fibers (threads → mock
server fibers on the same event loop) to exercise the new paths.
The synchronous qmp_client_call() pumps the event loop until its reply arrives, pinning the parsed reply on c->current so it can hand out borrowed pointers to the caller. That model only fits one in-flight sync call: a second qmp_client_call() on the same client clears c->current before issuing its own send, invalidating the first caller's borrowed pointers. On a single-threaded event loop that was fine, but with fibers two concurrent calls on the same client can interleave through the pump (json_stream_wait() suspends the running fiber) and trample each other. To fix this, make qmp_client_call() detect when it's running on a fiber whose event loop matches the client and transparently delegate to qmp_client_call_suspend(), which makes use of a new QmpFuture to allow multiple concurrent calls to qmp_client_call(). To make this work concurrently, we also change qmp_client_call() to hand out references and copies of errors so that we don't have to store the borrowed pointers we hand out in the QmpClient struct.
The mock servers used to be driven out-of-band: each test created a
socketpair, forked a child, ran a hand-coded request/response script
against the raw fd, and sent SIGTERM to tear it down. That worked but
required pidref/process-util/signal plumbing in every test, two
distinct execution contexts that couldn't share state, and a JsonStream
attached to the mock side that pretended to be event-loop-driven while
actually being driven manually via blocking reads.
Now that JsonStream suspends when on a fiber, the mocks can live
inside the same process and event loop as the client. Each mock is
rewritten as an sd-fiber that runs alongside the client fiber: so the
mock fiber yields on I/O and the event loop schedules the client in the
meantime. Both sides progress cooperatively, no fork/SIGTERM/PID tracking,
no manual phase tracking.
Two cleanups fall out of the rewrite:
- A QMP_TEST(name, mock_fn) { ... } macro encapsulates the per-test
scaffolding (event loop, socketpair, mock fiber spawn, exit-on-idle
shim) and injects an already-connected QmpClient *client into the
test body. Each test now reads as a flat sequence of
qmp_client_call() invocations against that client.
- Repeated mock command/reply scripting is factored into
mock_qmp_expect(), mock_qmp_reply(), mock_qmp_expect_and_reply(),
mock_qmp_handshake(), and mock_qmp_query_status_running(). The
greeting JSON is built with sd_json_buildo() instead of being parsed
from a literal.
The file shrinks from 756 to 494 lines, mostly through deletions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.