Skip to content

libstore: Do not mark connections as bad when RemoteStore::narFromPath is called as a coroutine#14998

Merged
edolstra merged 3 commits into
masterfrom
fix-remote-store-nar-from-path
Jan 20, 2026
Merged

libstore: Do not mark connections as bad when RemoteStore::narFromPath is called as a coroutine#14998
edolstra merged 3 commits into
masterfrom
fix-remote-store-nar-from-path

Conversation

@xokdvium
Copy link
Copy Markdown
Contributor

@xokdvium xokdvium commented Jan 14, 2026

Motivation

forced_unwind is thrown by Boost.Context when destroying the coroutine.
This lead to us resetting the remote connection for each narFromPath
with the ssh-ng:// store, so copying was very slow.

Also fixes some interrupt issues with ssh-ng store.

Context

Fixes #6950.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions Bot added the store Issues and pull requests concerning the Nix store label Jan 14, 2026
@xokdvium xokdvium force-pushed the fix-remote-store-nar-from-path branch from b0069ce to 5f5bc27 Compare January 14, 2026 23:37
@xokdvium xokdvium added performance remote build The SSH store, ssh:, ssh-ng:, ... (split from protocol label 2024-07) labels Jan 14, 2026
@edolstra
Copy link
Copy Markdown
Member

Hm, I'm not sure this is sound? If the coroutine is abandoned before the entire NAR has been copied (e.g. because the consumer of the coroutine throws an exception half-way through), won't that leave the connection in an undefined state? We only want to ignore the unwind exception if it occurs after the entire NAR has been copied.

In general, it's annoying that coroutines that produce NARs don't run to completion, since it causes a lot of unexpected issues (e.g. 7ba8443). I wonder if there is a way to fix that generically.

@xokdvium xokdvium force-pushed the fix-remote-store-nar-from-path branch from 5f5bc27 to 07ea89d Compare January 15, 2026 19:38
@xokdvium
Copy link
Copy Markdown
Contributor Author

In general, it's annoying that coroutines that produce NARs don't run to completion

Hm, yeah I think I have a fix for that (look at 446e7f1). The reason is that they don't call read and the coroutine remains suspended, but we can opportunistically call resume the coroutine in the last read if the data is empty (that seems pretty safe, unless some code out there depends on the coroutine not completing). That way it runs to completion and we can set a flag that we've read the whole NAR.

There's also the question of what to do when the coroutine has produced more data than what we expect. Currently we just discard anything extra. I wonder if there are any bugs lurking in there.

Comment thread src/libutil/serialise.cc
Comment on lines +415 to 420
if (hasCoro && *coro) {
(*coro)();
}
if (*coro) {
cur = coro->get();
} else {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These duplicate ifs seem tricky, but first first if doesn't get run on the first iteration because hasCoro is false. We don't want to call the coroutine if it's completed on the last iteration - otherwise it would segfault.

@edolstra
Copy link
Copy Markdown
Member

If the coroutine runs to completion, then it won't get an unwind exception, right (since there's nothing to unwind)? So then withConnection() shouldn't be needed, unless I'm missing something.

@xokdvium
Copy link
Copy Markdown
Contributor Author

Hm this is quite confusing indeed. I'll double-check if just the fix to serialise.cc is enough.

…h is called as a coroutine

forced_unwind is thrown by Boost.Context when destroying the coroutine.
This lead to us resetting the remote connection for each narFromPath
with the ssh-ng:// store, so copying was very slow.
Without this we can abort by throwing an exception in the destructor:

[24/635/2958 copied (3.8/26.0 GiB)] copying path '/nix/store/ncd2iic2nwxwhqsf4gp9sdybkwnwz20b-ruby3.3-mini_portile2-2.8.9' from 'ssh-ng://localhost:22'

Nix crashed. This is a bug. Please report this at https://github.com/NixOS/nix/issues with the following information included:

Exception: nix::Interrupted: error: interrupted by the user
Stack trace:
 0# 0x00000000004AFFE9 in result/bin/nix
 1# 0x00007F946290A1AA in /nix/store/cf1a53iqg6ncnygl698c4v0l8qam5a2q-gcc-14.3.0-lib/lib/libstdc++.so.6
 2# __cxa_call_terminate in /nix/store/cf1a53iqg6ncnygl698c4v0l8qam5a2q-gcc-14.3.0-lib/lib/libstdc++.so.6
 3# __gxx_personality_v0 in /nix/store/cf1a53iqg6ncnygl698c4v0l8qam5a2q-gcc-14.3.0-lib/lib/libstdc++.so.6
 4# 0x00007F946283FA19 in /nix/store/cf1a53iqg6ncnygl698c4v0l8qam5a2q-gcc-14.3.0-lib/lib/libgcc_s.so.1
 5# _Unwind_RaiseException in /nix/store/cf1a53iqg6ncnygl698c4v0l8qam5a2q-gcc-14.3.0-lib/lib/libgcc_s.so.1
 6# __cxa_throw in /nix/store/cf1a53iqg6ncnygl698c4v0l8qam5a2q-gcc-14.3.0-lib/lib/libstdc++.so.6
 7# 0x00007F94635D82D0 in /nix/store/9wrnk0nizdwba4sy9lg3h0xd30pg1x5a-nix-util-2.34.0pre/lib/libnixutil.so.2.34.0
 8# nix::Pid::wait() in /nix/store/9wrnk0nizdwba4sy9lg3h0xd30pg1x5a-nix-util-2.34.0pre/lib/libnixutil.so.2.34.0
 9# nix::Pid::~Pid() in /nix/store/9wrnk0nizdwba4sy9lg3h0xd30pg1x5a-nix-util-2.34.0pre/lib/libnixutil.so.2.34.0
This avoids the wall of text like, because ThreadPool doesn't print interrupts
on shutdowns.

error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
error (ignored): opening a connection to remote store 'ssh-ng://127.0.0.1' previously failed
@xokdvium xokdvium force-pushed the fix-remote-store-nar-from-path branch from 07ea89d to b40b786 Compare January 19, 2026 19:39
@xokdvium
Copy link
Copy Markdown
Contributor Author

So then withConnection() shouldn't be needed

Yup, dropped all those changes. Resuming the coroutine in read is enough. I suppose it means that we could even revert 7ba8443.

@edolstra edolstra added this pull request to the merge queue Jan 20, 2026
Merged via the queue into master with commit 4a267f7 Jan 20, 2026
18 checks passed
@edolstra edolstra deleted the fix-remote-store-nar-from-path branch January 20, 2026 16:52
brittonr pushed a commit to brittonr/nix that referenced this pull request Apr 1, 2026
libstore: Do not mark connections as bad when RemoteStore::narFromPath is called as a coroutine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance remote build The SSH store, ssh:, ssh-ng:, ... (split from protocol label 2024-07) store Issues and pull requests concerning the Nix store

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nix copy --from ssh-ng:// is very slow, even if the path already exists locally

2 participants