implement Nexus quiesce (sagas, db activity) for upgrade #8740

davepacheco · 2025-07-31T22:06:10Z

This implementation adds:

an internal API to begin the quiesce process and an omdb command to call it
an internal API to check on the quiesce status and show what's keeping it from finishing (for debugging), plus on omdb command to call it

The implementation goes through a sequence of stages:

normal operation
no new sagas allowed, existing sagas can keep running and all database activity is allowed
no new database connections can be acquired, existing database activity can proceed
no sagas and no database activity at all (fully quiesced)

Much of the diff here related only to tests. If it's helpful, I could break this PR into smaller pieces, but I think that might actually be more confusing because it's harder to see how the pieces fit together.

davepacheco · 2025-07-31T22:15:02Z

nexus/db-lookup/src/datastore_interface.rs

+    // `DataStoreConnection` is used by various packages that we'd like to not
+    // depend on `nexus-db-queries` (in order to parallelize compilation).
+    // However, we need to do some datastore-specific work around the lifecycle
+    // of this object (i.e., when it gets instantiated and when it gets
+    // dropped).  To achieve this, the caller in `nexus-db-queries` provides a
+    // `releaser` whose sole purpose is to be dropped when this object is
+    // dropped, allowing it to do the needed cleanup there.


For context, this package was created in #7985 to better parallelize Nexus compilation. This consequence feels pretty gross but I'm not sure how better to deal with this.

davepacheco · 2025-07-31T22:16:39Z

nexus/db-queries/src/db/pool.rs

+    pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
+        let id = self.next_id.fetch_add(1, Ordering::SeqCst);
+        let held_since = Utc::now();
+        let debug = Backtrace::force_capture().to_string();


This may be controversial because collecting a backtrace is considered expensive, but it feels like in the context of anything having to do with the database, this is probably a small cost. The upside is that if quiesce gets stuck (or anything else, for that matter), we'll be able to look at stack traces where outstanding database claims are held.

Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

I think my opinion might change if this was like, < 10% vs 100%+ of the total time to get a claim by itself.

Couple other questions:

Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)

What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

Also, this is paranoia on my part, but:

Does this backtrace contain any argument info, or is it purely the callstack? I understand this is mostly for omdb-level debugging, but I don't want it to be a side-channel for exfiltrating user data.

(EDIT: seems like "no" on the arguments)

Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

We could. To be precise, what I think matters most is how long the backtrace takes relative to the time that the connection is held, not how long it took to acquire the connection. So what I'd probably do is:

define some USDT probes before and after creating the backtrace

write a D script that traces these probes, plus the qorb claim acquired / released probes and breaks out the time spent in these two areas

deploy somewhere (best would be a racklette) and trace it for, like, 10 minutes

I'm honestly not sure it's worth the effort. A cheaper thing I could do is measure how long it takes to do a few backtraces on my own dev system and make sure it's as quick as I'd hope. I'd expect this to be on the order of tens (maybe hundreds) of microseconds and I'd expect it to be dwarfed by even a basic database query round-trip.

Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)

Yup. Here's an example:

Details

claim 1248 held since 2025-08-01 21:17:03.232936025 UTC (2s 7ms ago) acquired by: 0: {async_fn#0} at ./nexus/db-queries/src/db/pool.rs:197:21 1: {async_fn#0} at ./nexus/db-queries/src/db/datastore/mod.rs:379:27 2: {async_fn#0} at ./nexus/db-queries/src/db/datastore/mod.rs:368:45 3: {async_fn#0} at ./nexus/db-queries/src/db/datastore/deployment.rs:959:59 4: {async_fn#0} at ./nexus/db-queries/src/db/datastore/deployment.rs:2294:70 5: {async_block#0} at ./nexus/src/app/background/tasks/blueprint_load.rs:60:18 6: poll<alloc::boxed::Box<(dyn core::future::future::Future<Output=serde_json::value::Value> + core::marker::Send), alloc::alloc::Global>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9 7: {async_fn#0} at ./nexus/src/app/background/driver.rs:339:54 8: {async_fn#0} at ./nexus/src/app/background/driver.rs:292:62 9: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17 10: with_mut<tokio::runtime::task::core::Stage<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9 11: poll<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13 12: tokio::runtime::task::harness::poll_future::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19 13: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9 14: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40 15: __rust_try 16: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19 17: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14 18: tokio::runtime::task::harness::poll_future at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18 19: tokio::runtime::task::harness::Harness<T,S>::poll_inner at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27 20: tokio::runtime::task::harness::Harness<T,S>::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15 21: tokio::runtime::task::raw::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5 22: tokio::runtime::task::raw::RawTask::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18 23: tokio::runtime::task::LocalNotified<S>::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:509:9 24: {closure#0} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:600:13 25: with_budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:167:5 26: budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:133:5 27: run_task at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:591:9 28: tokio::runtime::scheduler::multi_thread::worker::Context::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:539:24 29: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:504:21 30: tokio::runtime::context::scoped::Scoped<T>::set at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/scoped.rs:40:9 31: tokio::runtime::context::set_scheduler::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:26 32: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(), tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure_env#0}>, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:315:12 33: std::thread::local::LocalKey<T>::with at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:279:15 34: tokio::runtime::context::set_scheduler at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:9 35: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:499:9 36: tokio::runtime::context::runtime::enter_runtime at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/runtime.rs:65:16 37: tokio::runtime::scheduler::multi_thread::worker::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:491:5 38: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:457:45 39: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/task.rs:42:21 40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17 41: with_mut<tokio::runtime::task::core::Stage<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9 42: poll<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13 43: tokio::runtime::task::harness::poll_future::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19 44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9 45: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40 46: __rust_try 47: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19 48: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14 49: tokio::runtime::task::harness::poll_future at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18 50: tokio::runtime::task::harness::Harness<T,S>::poll_inner at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27 51: tokio::runtime::task::harness::Harness<T,S>::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15 52: tokio::runtime::task::raw::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5 53: tokio::runtime::task::raw::RawTask::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18 54: tokio::runtime::task::UnownedTask<S>::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:546:9 55: tokio::runtime::blocking::pool::Task::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:161:9 56: run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:516:17 57: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:474:13 58: __rust_begin_short_backtrace<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs:152:18 59: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:559:17 60: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9 61: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40 62: __rust_try 63: try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19 64: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14 65: {closure#1}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:557:30 66: core::ops::function::FnOnce::call_once{{vtable.shim}} at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5 67: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9 68: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9 69: std::sys::pal::unix::thread::Thread::new::thread_start at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/pal/unix/thread.rs:97:17 70: _thrp_setup 71: <unknown>

What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

We're saving the serialized form, so to give an example, the above one is about 16 KiB. Let's call it 32 KiB to be conservative. Each connection can have at most one of these at a time. At 16 possible connections per backend, and 5 CockroachDB backends (not sure if I'm remembering those right), that's 2.5 MiB per Nexus process. Seems fine.

smklein · 2025-08-01T17:28:25Z

nexus/db-queries/src/db/pool.rs

+    pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
+        let id = self.next_id.fetch_add(1, Ordering::SeqCst);
+        let held_since = Utc::now();
+        let debug = Backtrace::force_capture().to_string();


Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

I think my opinion might change if this was like, < 10% vs 100%+ of the total time to get a claim by itself.

Couple other questions:

Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)

What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

smklein · 2025-08-01T17:31:09Z

nexus/db-queries/src/db/pool.rs

+    pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
+        let id = self.next_id.fetch_add(1, Ordering::SeqCst);
+        let held_since = Utc::now();
+        let debug = Backtrace::force_capture().to_string();


Also, this is paranoia on my part, but:

Does this backtrace contain any argument info, or is it purely the callstack? I understand this is mostly for omdb-level debugging, but I don't want it to be a side-channel for exfiltrating user data.

(EDIT: seems like "no" on the arguments)

nexus/src/app/saga.rs

nexus/src/app/quiesce.rs

smklein · 2025-08-01T18:04:37Z

nexus/db-queries/src/db/pool.rs

+                false
+            } else {
+                q.claims_held
+                    .insert_unique(HeldDbClaimInfo { id, held_since, debug })


this is probably fine, but to say it explicitly - this is a capture of where the claim started, not where the claim may be "stuck in-usage", right?

Correct. omdb reports this as acquired by <stack trace> (i.e., "acquired by this code path").

dev-tools/omdb/src/bin/omdb/nexus/quiesce.rs

nexus/db-queries/src/db/pool.rs

nexus/src/app/quiesce.rs

smklein · 2025-08-01T18:42:49Z

As a higher-level comment on the semantics we want:

For DB connections, this PR 100% makes sense to me
For sagas, it seems like we're saying "Quiesce is complete for this Nexus when no sagas are actively running"

For sagas, is that sufficient? E.g., if we have a saga that partially completed, but "this nexus isn't trying to run it" (or it was, but it crashed and rebooted) couldn't we have a system where all Nexuses identify "I'm not running sagas NOW", but there's still on-disk state identifying a partially executed saga?

smklein · 2025-08-01T19:36:24Z

nexus/src/internal_api/http_entrypoints.rs

+    async fn quiesce_start(
+        rqctx: RequestContext<Self::Context>,
+    ) -> Result<HttpResponseUpdatedNoContent, HttpError> {


Should we be concerned about the case where:

Nexus is quiescing (old Nexuses are running, as are new Nexuses)

The old nexuses all quiesce, but one of them, for whatever reason, thinks that quiescing isn't done. Maybe it can't hear back from other old nexuses for some reason.

The new nexuses boot, perform schema changes, resume their lives

The old nexus retries: it reads internal DNS, sees records for all the new nexuses, and tells them to quiesce

Now the system has an outage

Should we have an argument here, that's "only start quiescing if you are ________", where "_______" might be "running an old version", or something like that? Just something to make it conditional?

This API is really for demo, dev, and testing. My expectation is not that it will ever be invoked in a deployed system. This means the situation you're describing can't happen because the old Nexus would never tell the new Nexus instances to quiesce.

I expect it to work something like this:

Nexus decides it's time to quiesce (probably in the Reconfigurator executor) and writes quiesce_started to the database.

Nexus invokes Nexus::quiesce_start() (maybe from the same code path).

Nexus polls the other Nexus instances to see when they've finished quiescing.

When they have all finished, Nexus writes quiesce_done.

I'm not sure that's exactly right but my main point is that Nexus is deciding for itself when to quiesce itself, and that code will be calling this code. It won't be some external thing telling it to do this.

my main point is that Nexus is deciding for itself when to quiesce itself

ack, makes sense. This is the key piece - we should have enough context when observing our own "expected version" vs the db_metadata table to make this call in the immediate future.

clients/nexus-client/src/lib.rs

nexus/db-queries/src/db/pool.rs

jgallagher · 2025-08-01T19:40:03Z

nexus/db-queries/src/db/pool.rs

+        // appears before messages from code paths that saw this change.
+        info!(&self.log, "starting db pool quiesce");
+        self.quiesce.send_modify(|q| {
+            q.new_claims_allowed = ClaimsAllowed::Disallowed;


Do we care if we were already in the Disallowed state?

I don't think it matters. With this sort of API you can either make it idempotent (like this: do nothing if it's already done) or have it fail and communicate this case and let callers implement the idempotence on top if they want. I don't see a reason to do the latter here. (In practice, there is only one caller, and it can only ever call it once in the lifetime of the process. If it did call it more than once, this would work fine.)

nexus/db-queries/src/db/pool.rs

davepacheco

tl;dr: I think there are several ways this work is incomplete, ranging from "important and solvable problems" to "not so important and hard to solve problems". I still think this PR is a step forward so I think we should proceed with landing it while also fixing and filing separate issues for the other problems.

Unrelated to the discussions here, I realized that I think this PR does not account for sagas that go through recovery (i.e., those that were not created during the lifetime of this process, but rather were resumed from the database). Those won't show up in sagas_running and so Nexus could think it's quiesced even though there are (recovered) sagas still running. It would even rip the database out from under them, preventing them from finishing. This definitely needs to be fixed. I'm hopeful it will be straightforward (see below) so I'm inclined to do that in a follow-on PR (shortly) because this one's already big, most of the contents here shouldn't change, and nothing depends on quiesce yet so there's no new risk introduced by landing this one. Let me know if you disagree!

My plan here:

pass the saga watch::Sender<Quiesce> channel to the saga recovery background task
have that background task keep sagas_running up to date for recovered sagas the same way we do it for newly-created sagas (i.e., insert them into sagas_running when we resume them, then remove them when they finish)
add a boolean flag to that structure that reflects "saga recovery has completed at least once", then don't consider sagas quiesced until that's true and the existing conditions are true (sagas disallowed and no sagas running).

That brings me to @smklein's question:

As a higher-level comment on the semantics we want:
* For DB connections, this PR 100% makes sense to me

* For sagas, it seems like we're saying "Quiesce is complete for this Nexus when no sagas are actively running"
For sagas, is that sufficient? E.g., if we have a saga that partially completed, but "this nexus isn't trying to run it" (or it was, but it crashed and rebooted) couldn't we have a system where all Nexuses identify "I'm not running sagas NOW", but there's still on-disk state identifying a partially executed saga?

Under normal operation, it shouldn't be possible for there to be unfinished sagas that a Nexus is supposed to be running that it doesn't know about. Aside from bugs, there are two cases where this could happen and they both should be transient. The first case is saga recovery after a Nexus restart. Immediately on startup, Nexus will think there are no sagas running, but that's because it hasn't finished recovering them yet. This is the case that I mentioned above and I think we have a plausible plan to fixing it.

The second case is a little trickier: it's when Nexus re-assigns sagas to itself that were owned by another Nexus that has recently been expunged. Concretely: suppose we have a sled with Nexus N1 running a saga S1. The sled bursts into flames and is expunged. During the next blueprint execution, responding to the expungement, Nexus N2 assigns saga S1 to itself. As part of a concurrent upgrade, N2 decides to quiesce itself. It hasn't actually recovered S1 so it doesn't know about it yet. It may think it's quiesced, then try to recover it. That wouldn't be great! This specific case could probably be eliminated, either by having the quiesce state reflect that there may be unrecovered sagas as a result of this situation or (probably better) disallowing us from assigning ourselves sagas once we've started quiescing. I may try to incorporate that into the follow-on PR.

This case caused me to see a a deeper problem with this mechanism. This is extremely unlikely, but suppose:

we're doing an upgrade and decide to start quiescing
there are sagas running in some Nexus N1, so it's waiting for those to finish
there are no sagas running in any other Nexus so they've disabled their database access
the sled hosting N1 fails and needs to be expunged

With my suggested fixes above, there'd be no way to expunge the busted sled because the only remaining Nexus instances have shut off their database access. The upgrade would be stuck, since quiescing cannot complete without affirmative confirmation from all in-service Nexus instances. In this situation, support would need to either pause the upgrade before the handoff, get Nexus back up, do the expungement, and then proceed again; or else force the handoff (e.g., writing quiesce_completed) and allowing the saga to be abandoned. This sounds bad in that it seems hard to fix, but it's also extremely unlikely (both sled and disk failure seem quite rare in practice and this would be a pretty small window), and it is recoverable, so I'd be comfortable filing an "important non-blocker" issue on this.

This raises a related problem, which is that if one Nexus takes a lot longer to drain its sagas than the others, then the other ones are going to be pitching 503s for a long time. This suggests we might be better off having an extra step in the middle of quiesce which is "wait for other Nexus instances to report having drained their sagas before disabling database access". That way, all Nexus instances would keep doing most things (just not creating new sagas) until all sagas are done. Then they'd all shut off database access about the same time. This may well be worthwhile but I'd say that's well outside the scope of our initial upgrade MVP.

All of that said, my inclination would be:

land this PR as-is
do a follow-on PR (a blocker for actually using this work) that deals with recovered sagas and prevents sagas from being re-assigned while we're quiescing
file issues for:
- dealing with needing to expunge a Nexus during quiesce
- having Nexus instances synchronize before moving through the quiesce stages (to minimize the window of downtime and the exposure to the above problem)

dev-tools/omdb/src/bin/omdb/nexus/quiesce.rs

davepacheco · 2025-08-01T21:28:12Z

nexus/db-queries/src/db/pool.rs

+    pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
+        let id = self.next_id.fetch_add(1, Ordering::SeqCst);
+        let held_since = Utc::now();
+        let debug = Backtrace::force_capture().to_string();


Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

We could. To be precise, what I think matters most is how long the backtrace takes relative to the time that the connection is held, not how long it took to acquire the connection. So what I'd probably do is:

define some USDT probes before and after creating the backtrace

write a D script that traces these probes, plus the qorb claim acquired / released probes and breaks out the time spent in these two areas

deploy somewhere (best would be a racklette) and trace it for, like, 10 minutes

I'm honestly not sure it's worth the effort. A cheaper thing I could do is measure how long it takes to do a few backtraces on my own dev system and make sure it's as quick as I'd hope. I'd expect this to be on the order of tens (maybe hundreds) of microseconds and I'd expect it to be dwarfed by even a basic database query round-trip.

Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)

Yup. Here's an example:

Details

claim 1248 held since 2025-08-01 21:17:03.232936025 UTC (2s 7ms ago) acquired by: 0: {async_fn#0} at ./nexus/db-queries/src/db/pool.rs:197:21 1: {async_fn#0} at ./nexus/db-queries/src/db/datastore/mod.rs:379:27 2: {async_fn#0} at ./nexus/db-queries/src/db/datastore/mod.rs:368:45 3: {async_fn#0} at ./nexus/db-queries/src/db/datastore/deployment.rs:959:59 4: {async_fn#0} at ./nexus/db-queries/src/db/datastore/deployment.rs:2294:70 5: {async_block#0} at ./nexus/src/app/background/tasks/blueprint_load.rs:60:18 6: poll<alloc::boxed::Box<(dyn core::future::future::Future<Output=serde_json::value::Value> + core::marker::Send), alloc::alloc::Global>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9 7: {async_fn#0} at ./nexus/src/app/background/driver.rs:339:54 8: {async_fn#0} at ./nexus/src/app/background/driver.rs:292:62 9: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17 10: with_mut<tokio::runtime::task::core::Stage<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9 11: poll<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13 12: tokio::runtime::task::harness::poll_future::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19 13: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9 14: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40 15: __rust_try 16: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19 17: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14 18: tokio::runtime::task::harness::poll_future at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18 19: tokio::runtime::task::harness::Harness<T,S>::poll_inner at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27 20: tokio::runtime::task::harness::Harness<T,S>::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15 21: tokio::runtime::task::raw::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5 22: tokio::runtime::task::raw::RawTask::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18 23: tokio::runtime::task::LocalNotified<S>::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:509:9 24: {closure#0} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:600:13 25: with_budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:167:5 26: budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:133:5 27: run_task at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:591:9 28: tokio::runtime::scheduler::multi_thread::worker::Context::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:539:24 29: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:504:21 30: tokio::runtime::context::scoped::Scoped<T>::set at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/scoped.rs:40:9 31: tokio::runtime::context::set_scheduler::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:26 32: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(), tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure_env#0}>, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:315:12 33: std::thread::local::LocalKey<T>::with at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:279:15 34: tokio::runtime::context::set_scheduler at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:9 35: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:499:9 36: tokio::runtime::context::runtime::enter_runtime at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/runtime.rs:65:16 37: tokio::runtime::scheduler::multi_thread::worker::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:491:5 38: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:457:45 39: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/task.rs:42:21 40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17 41: with_mut<tokio::runtime::task::core::Stage<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9 42: poll<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule> at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13 43: tokio::runtime::task::harness::poll_future::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19 44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9 45: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40 46: __rust_try 47: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19 48: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14 49: tokio::runtime::task::harness::poll_future at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18 50: tokio::runtime::task::harness::Harness<T,S>::poll_inner at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27 51: tokio::runtime::task::harness::Harness<T,S>::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15 52: tokio::runtime::task::raw::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5 53: tokio::runtime::task::raw::RawTask::poll at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18 54: tokio::runtime::task::UnownedTask<S>::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:546:9 55: tokio::runtime::blocking::pool::Task::run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:161:9 56: run at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:516:17 57: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}} at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:474:13 58: __rust_begin_short_backtrace<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs:152:18 59: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:559:17 60: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9 61: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40 62: __rust_try 63: try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19 64: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14 65: {closure#1}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()> at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:557:30 66: core::ops::function::FnOnce::call_once{{vtable.shim}} at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5 67: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9 68: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9 69: std::sys::pal::unix::thread::Thread::new::thread_start at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/pal/unix/thread.rs:97:17 70: _thrp_setup 71: <unknown>

What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

We're saving the serialized form, so to give an example, the above one is about 16 KiB. Let's call it 32 KiB to be conservative. Each connection can have at most one of these at a time. At 16 possible connections per backend, and 5 CockroachDB backends (not sure if I'm remembering those right), that's 2.5 MiB per Nexus process. Seems fine.

davepacheco · 2025-08-01T21:28:28Z

nexus/db-queries/src/db/pool.rs

+                false
+            } else {
+                q.claims_held
+                    .insert_unique(HeldDbClaimInfo { id, held_since, debug })


Correct. omdb reports this as acquired by <stack trace> (i.e., "acquired by this code path").

davepacheco · 2025-08-01T21:31:51Z

nexus/db-queries/src/db/pool.rs

+        // appears before messages from code paths that saw this change.
+        info!(&self.log, "starting db pool quiesce");
+        self.quiesce.send_modify(|q| {
+            q.new_claims_allowed = ClaimsAllowed::Disallowed;


I don't think it matters. With this sort of API you can either make it idempotent (like this: do nothing if it's already done) or have it fail and communicate this case and let callers implement the idempotence on top if they want. I don't see a reason to do the latter here. (In practice, there is only one caller, and it can only ever call it once in the lifetime of the process. If it did call it more than once, this would work fine.)

davepacheco · 2025-08-01T21:42:51Z

nexus/src/internal_api/http_entrypoints.rs

+    async fn quiesce_start(
+        rqctx: RequestContext<Self::Context>,
+    ) -> Result<HttpResponseUpdatedNoContent, HttpError> {


This API is really for demo, dev, and testing. My expectation is not that it will ever be invoked in a deployed system. This means the situation you're describing can't happen because the old Nexus would never tell the new Nexus instances to quiesce.

I expect it to work something like this:

Nexus decides it's time to quiesce (probably in the Reconfigurator executor) and writes quiesce_started to the database.

Nexus invokes Nexus::quiesce_start() (maybe from the same code path).

Nexus polls the other Nexus instances to see when they've finished quiescing.

When they have all finished, Nexus writes quiesce_done.

I'm not sure that's exactly right but my main point is that Nexus is deciding for itself when to quiesce itself, and that code will be calling this code. It won't be some external thing telling it to do this.

nexus/db-queries/src/db/pool.rs

clients/nexus-client/src/lib.rs

nexus/db-queries/src/db/pool.rs

nexus/src/app/saga.rs

smklein

My main holdup with this PR was "dealing with sagas in the quiesce case" - if we're planning on punting that to later, fine with me, let's just make sure we get an issue filed before merging.

davepacheco · 2025-08-07T22:55:14Z

All of that said, my inclination would be:
* land this PR as-is

See below.

* do a follow-on PR (a blocker for actually _using_ this work) that deals with recovered sagas and prevents sagas from being re-assigned while we're quiescing

This is now up for review: #8794.

* file issues for:
  
  * dealing with needing to expunge a Nexus _during_ quiesce
  * having Nexus instances synchronize before moving through the quiesce stages (to minimize the window of downtime and the exposure to the above problem)

Filed #8795 and #8796.

I had said I wanted to land this PR as-is, but I realized what I really meant was I wanted to get it to the point where it was ready to land so that the follow-up could be reviewed independent of this one. Now that #8794 is up for review, my plan is to get that one ready to land (address any review feedback), then land both of these together. Although there shouldn't be much risk here, given the timing, I figure I'll wait for the R17 gates to open.

implement Nexus quiesce (sagas, db activity) for upgrade

fc6f4e5

davepacheco requested a review from smklein July 31, 2025 22:06

davepacheco self-assigned this Jul 31, 2025

self-review

8a4de56

davepacheco commented Jul 31, 2025

View reviewed changes

davepacheco requested review from andrewjstone and jgallagher July 31, 2025 22:31

clippy

39f2a7e

smklein mentioned this pull request Aug 1, 2025

To quiesce, Nexus must be able to use internal API before upgrading to latest schema #8750

Closed

smklein reviewed Aug 1, 2025

View reviewed changes

jgallagher reviewed Aug 1, 2025

View reviewed changes

review feedback

2677292

davepacheco commented Aug 1, 2025

View reviewed changes

davepacheco mentioned this pull request Aug 2, 2025

[nexus] Add schema changes for quiescing #8533

Closed

6 tasks

smklein approved these changes Aug 5, 2025

View reviewed changes

davepacheco mentioned this pull request Aug 7, 2025

nexus quiesce needs to deal with saga recovery and re-assignment #8794

Merged

Merge branch 'main' into dap/nexus-quiesce

d865dce

This was referenced Aug 7, 2025

better deal with expungement during quiesce window #8795

Closed

Nexus instances could coordinate on saga quiesce #8796

Closed

davepacheco merged commit f423987 into main Aug 12, 2025
18 checks passed

davepacheco deleted the dap/nexus-quiesce branch August 12, 2025 17:11

implement Nexus quiesce (sagas, db activity) for upgrade #8740

implement Nexus quiesce (sagas, db activity) for upgrade #8740

Uh oh!

Conversation

davepacheco commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smklein commented Aug 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

davepacheco commented Jul 31, 2025 •

edited

Loading

smklein Aug 1, 2025 •

edited

Loading

smklein Aug 1, 2025 •

edited

Loading