Skip to content

Conversation

@davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Jul 31, 2025

Fixes #8504.

This implementation adds:

  • an internal API to begin the quiesce process and an omdb command to call it
  • an internal API to check on the quiesce status and show what's keeping it from finishing (for debugging), plus on omdb command to call it

The implementation goes through a sequence of stages:

  • normal operation
  • no new sagas allowed, existing sagas can keep running and all database activity is allowed
  • no new database connections can be acquired, existing database activity can proceed
  • no sagas and no database activity at all (fully quiesced)

Much of the diff here related only to tests. If it's helpful, I could break this PR into smaller pieces, but I think that might actually be more confusing because it's harder to see how the pieces fit together.

@davepacheco davepacheco requested a review from smklein July 31, 2025 22:06
@davepacheco davepacheco self-assigned this Jul 31, 2025
Comment on lines +57 to +63
// `DataStoreConnection` is used by various packages that we'd like to not
// depend on `nexus-db-queries` (in order to parallelize compilation).
// However, we need to do some datastore-specific work around the lifecycle
// of this object (i.e., when it gets instantiated and when it gets
// dropped). To achieve this, the caller in `nexus-db-queries` provides a
// `releaser` whose sole purpose is to be dropped when this object is
// dropped, allowing it to do the needed cleanup there.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, this package was created in #7985 to better parallelize Nexus compilation. This consequence feels pretty gross but I'm not sure how better to deal with this.

pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
let id = self.next_id.fetch_add(1, Ordering::SeqCst);
let held_since = Utc::now();
let debug = Backtrace::force_capture().to_string();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be controversial because collecting a backtrace is considered expensive, but it feels like in the context of anything having to do with the database, this is probably a small cost. The upside is that if quiesce gets stuck (or anything else, for that matter), we'll be able to look at stack traces where outstanding database claims are held.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

I think my opinion might change if this was like, < 10% vs 100%+ of the total time to get a claim by itself.

Couple other questions:

  • Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)
  • What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

Copy link
Collaborator

@smklein smklein Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is paranoia on my part, but:

Does this backtrace contain any argument info, or is it purely the callstack? I understand this is mostly for omdb-level debugging, but I don't want it to be a side-channel for exfiltrating user data.

(EDIT: seems like "no" on the arguments)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

We could. To be precise, what I think matters most is how long the backtrace takes relative to the time that the connection is held, not how long it took to acquire the connection. So what I'd probably do is:

  • define some USDT probes before and after creating the backtrace
  • write a D script that traces these probes, plus the qorb claim acquired / released probes and breaks out the time spent in these two areas
  • deploy somewhere (best would be a racklette) and trace it for, like, 10 minutes

I'm honestly not sure it's worth the effort. A cheaper thing I could do is measure how long it takes to do a few backtraces on my own dev system and make sure it's as quick as I'd hope. I'd expect this to be on the order of tens (maybe hundreds) of microseconds and I'd expect it to be dwarfed by even a basic database query round-trip.

Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)

Yup. Here's an example:

Details
    claim 1248 held since 2025-08-01 21:17:03.232936025 UTC (2s 7ms ago)
    acquired by:
           0: {async_fn#0}
                     at ./nexus/db-queries/src/db/pool.rs:197:21
           1: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/mod.rs:379:27
           2: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/mod.rs:368:45
           3: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/deployment.rs:959:59
           4: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/deployment.rs:2294:70
           5: {async_block#0}
                     at ./nexus/src/app/background/tasks/blueprint_load.rs:60:18
           6: poll<alloc::boxed::Box<(dyn core::future::future::Future<Output=serde_json::value::Value> + core::marker::Send), alloc::alloc::Global>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
           7: {async_fn#0}
                     at ./nexus/src/app/background/driver.rs:339:54
           8: {async_fn#0}
                     at ./nexus/src/app/background/driver.rs:292:62
           9: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17
          10: with_mut<tokio::runtime::task::core::Stage<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9
          11: poll<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13
          12: tokio::runtime::task::harness::poll_future::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19
          13: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
          14: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
          15: __rust_try
          16: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
          17: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
          18: tokio::runtime::task::harness::poll_future
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18
          19: tokio::runtime::task::harness::Harness<T,S>::poll_inner
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27
          20: tokio::runtime::task::harness::Harness<T,S>::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15
          21: tokio::runtime::task::raw::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5
          22: tokio::runtime::task::raw::RawTask::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18
          23: tokio::runtime::task::LocalNotified<S>::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:509:9
          24: {closure#0}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:600:13
          25: with_budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:167:5
          26: budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:133:5
          27: run_task
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:591:9
          28: tokio::runtime::scheduler::multi_thread::worker::Context::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:539:24
          29: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:504:21
          30: tokio::runtime::context::scoped::Scoped<T>::set
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/scoped.rs:40:9
          31: tokio::runtime::context::set_scheduler::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:26
          32: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(), tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure_env#0}>, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:315:12
          33: std::thread::local::LocalKey<T>::with
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:279:15
          34: tokio::runtime::context::set_scheduler
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:9
          35: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:499:9
          36: tokio::runtime::context::runtime::enter_runtime
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/runtime.rs:65:16
          37: tokio::runtime::scheduler::multi_thread::worker::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:491:5
          38: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:457:45
          39: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/task.rs:42:21
          40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17
          41: with_mut<tokio::runtime::task::core::Stage<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9
          42: poll<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13
          43: tokio::runtime::task::harness::poll_future::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19
          44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
          45: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
          46: __rust_try
          47: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
          48: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
          49: tokio::runtime::task::harness::poll_future
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18
          50: tokio::runtime::task::harness::Harness<T,S>::poll_inner
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27
          51: tokio::runtime::task::harness::Harness<T,S>::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15
          52: tokio::runtime::task::raw::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5
          53: tokio::runtime::task::raw::RawTask::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18
          54: tokio::runtime::task::UnownedTask<S>::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:546:9
          55: tokio::runtime::blocking::pool::Task::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:161:9
          56: run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:516:17
          57: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:474:13
          58: __rust_begin_short_backtrace<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs:152:18
          59: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:559:17
          60: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
          61: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
          62: __rust_try
          63: try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
          64: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
          65: {closure#1}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:557:30
          66: core::ops::function::FnOnce::call_once{{vtable.shim}}
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
          67: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
                     at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9
          68: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
                     at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9
          69: std::sys::pal::unix::thread::Thread::new::thread_start
                     at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/pal/unix/thread.rs:97:17
          70: _thrp_setup
          71: <unknown>

What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

We're saving the serialized form, so to give an example, the above one is about 16 KiB. Let's call it 32 KiB to be conservative. Each connection can have at most one of these at a time. At 16 possible connections per backend, and 5 CockroachDB backends (not sure if I'm remembering those right), that's 2.5 MiB per Nexus process. Seems fine.

pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
let id = self.next_id.fetch_add(1, Ordering::SeqCst);
let held_since = Utc::now();
let debug = Backtrace::force_capture().to_string();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

I think my opinion might change if this was like, < 10% vs 100%+ of the total time to get a claim by itself.

Couple other questions:

  • Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)
  • What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
let id = self.next_id.fetch_add(1, Ordering::SeqCst);
let held_since = Utc::now();
let debug = Backtrace::force_capture().to_string();
Copy link
Collaborator

@smklein smklein Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is paranoia on my part, but:

Does this backtrace contain any argument info, or is it purely the callstack? I understand this is mostly for omdb-level debugging, but I don't want it to be a side-channel for exfiltrating user data.

(EDIT: seems like "no" on the arguments)

false
} else {
q.claims_held
.insert_unique(HeldDbClaimInfo { id, held_since, debug })
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably fine, but to say it explicitly - this is a capture of where the claim started, not where the claim may be "stuck in-usage", right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. omdb reports this as acquired by <stack trace> (i.e., "acquired by this code path").

@smklein
Copy link
Collaborator

smklein commented Aug 1, 2025

As a higher-level comment on the semantics we want:

  • For DB connections, this PR 100% makes sense to me
  • For sagas, it seems like we're saying "Quiesce is complete for this Nexus when no sagas are actively running"

For sagas, is that sufficient? E.g., if we have a saga that partially completed, but "this nexus isn't trying to run it" (or it was, but it crashed and rebooted) couldn't we have a system where all Nexuses identify "I'm not running sagas NOW", but there's still on-disk state identifying a partially executed saga?

Comment on lines +1481 to +1483
async fn quiesce_start(
rqctx: RequestContext<Self::Context>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned about the case where:

  • Nexus is quiescing (old Nexuses are running, as are new Nexuses)
  • The old nexuses all quiesce, but one of them, for whatever reason, thinks that quiescing isn't done. Maybe it can't hear back from other old nexuses for some reason.
  • The new nexuses boot, perform schema changes, resume their lives
  • The old nexus retries: it reads internal DNS, sees records for all the new nexuses, and tells them to quiesce
  • Now the system has an outage

Should we have an argument here, that's "only start quiescing if you are ________", where "_______" might be "running an old version", or something like that? Just something to make it conditional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is really for demo, dev, and testing. My expectation is not that it will ever be invoked in a deployed system. This means the situation you're describing can't happen because the old Nexus would never tell the new Nexus instances to quiesce.

I expect it to work something like this:

  • Nexus decides it's time to quiesce (probably in the Reconfigurator executor) and writes quiesce_started to the database.
  • Nexus invokes Nexus::quiesce_start() (maybe from the same code path).
  • Nexus polls the other Nexus instances to see when they've finished quiescing.
  • When they have all finished, Nexus writes quiesce_done.

I'm not sure that's exactly right but my main point is that Nexus is deciding for itself when to quiesce itself, and that code will be calling this code. It won't be some external thing telling it to do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main point is that Nexus is deciding for itself when to quiesce itself

ack, makes sense. This is the key piece - we should have enough context when observing our own "expected version" vs the db_metadata table to make this call in the immediate future.

// appears before messages from code paths that saw this change.
info!(&self.log, "starting db pool quiesce");
self.quiesce.send_modify(|q| {
q.new_claims_allowed = ClaimsAllowed::Disallowed;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care if we were already in the Disallowed state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters. With this sort of API you can either make it idempotent (like this: do nothing if it's already done) or have it fail and communicate this case and let callers implement the idempotence on top if they want. I don't see a reason to do the latter here. (In practice, there is only one caller, and it can only ever call it once in the lifetime of the process. If it did call it more than once, this would work fine.)

Copy link
Collaborator Author

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I think there are several ways this work is incomplete, ranging from "important and solvable problems" to "not so important and hard to solve problems". I still think this PR is a step forward so I think we should proceed with landing it while also fixing and filing separate issues for the other problems.


Unrelated to the discussions here, I realized that I think this PR does not account for sagas that go through recovery (i.e., those that were not created during the lifetime of this process, but rather were resumed from the database). Those won't show up in sagas_running and so Nexus could think it's quiesced even though there are (recovered) sagas still running. It would even rip the database out from under them, preventing them from finishing. This definitely needs to be fixed. I'm hopeful it will be straightforward (see below) so I'm inclined to do that in a follow-on PR (shortly) because this one's already big, most of the contents here shouldn't change, and nothing depends on quiesce yet so there's no new risk introduced by landing this one. Let me know if you disagree!

My plan here:

  • pass the saga watch::Sender<Quiesce> channel to the saga recovery background task
  • have that background task keep sagas_running up to date for recovered sagas the same way we do it for newly-created sagas (i.e., insert them into sagas_running when we resume them, then remove them when they finish)
  • add a boolean flag to that structure that reflects "saga recovery has completed at least once", then don't consider sagas quiesced until that's true and the existing conditions are true (sagas disallowed and no sagas running).

That brings me to @smklein's question:

As a higher-level comment on the semantics we want:

* For DB connections, this PR 100% makes sense to me

* For sagas, it seems like we're saying "Quiesce is complete for this Nexus when no sagas are actively running"

For sagas, is that sufficient? E.g., if we have a saga that partially completed, but "this nexus isn't trying to run it" (or it was, but it crashed and rebooted) couldn't we have a system where all Nexuses identify "I'm not running sagas NOW", but there's still on-disk state identifying a partially executed saga?

Under normal operation, it shouldn't be possible for there to be unfinished sagas that a Nexus is supposed to be running that it doesn't know about. Aside from bugs, there are two cases where this could happen and they both should be transient. The first case is saga recovery after a Nexus restart. Immediately on startup, Nexus will think there are no sagas running, but that's because it hasn't finished recovering them yet. This is the case that I mentioned above and I think we have a plausible plan to fixing it.

The second case is a little trickier: it's when Nexus re-assigns sagas to itself that were owned by another Nexus that has recently been expunged. Concretely: suppose we have a sled with Nexus N1 running a saga S1. The sled bursts into flames and is expunged. During the next blueprint execution, responding to the expungement, Nexus N2 assigns saga S1 to itself. As part of a concurrent upgrade, N2 decides to quiesce itself. It hasn't actually recovered S1 so it doesn't know about it yet. It may think it's quiesced, then try to recover it. That wouldn't be great! This specific case could probably be eliminated, either by having the quiesce state reflect that there may be unrecovered sagas as a result of this situation or (probably better) disallowing us from assigning ourselves sagas once we've started quiescing. I may try to incorporate that into the follow-on PR.

This case caused me to see a a deeper problem with this mechanism. This is extremely unlikely, but suppose:

  • we're doing an upgrade and decide to start quiescing
  • there are sagas running in some Nexus N1, so it's waiting for those to finish
  • there are no sagas running in any other Nexus so they've disabled their database access
  • the sled hosting N1 fails and needs to be expunged

With my suggested fixes above, there'd be no way to expunge the busted sled because the only remaining Nexus instances have shut off their database access. The upgrade would be stuck, since quiescing cannot complete without affirmative confirmation from all in-service Nexus instances. In this situation, support would need to either pause the upgrade before the handoff, get Nexus back up, do the expungement, and then proceed again; or else force the handoff (e.g., writing quiesce_completed) and allowing the saga to be abandoned. This sounds bad in that it seems hard to fix, but it's also extremely unlikely (both sled and disk failure seem quite rare in practice and this would be a pretty small window), and it is recoverable, so I'd be comfortable filing an "important non-blocker" issue on this.

This raises a related problem, which is that if one Nexus takes a lot longer to drain its sagas than the others, then the other ones are going to be pitching 503s for a long time. This suggests we might be better off having an extra step in the middle of quiesce which is "wait for other Nexus instances to report having drained their sagas before disabling database access". That way, all Nexus instances would keep doing most things (just not creating new sagas) until all sagas are done. Then they'd all shut off database access about the same time. This may well be worthwhile but I'd say that's well outside the scope of our initial upgrade MVP.


All of that said, my inclination would be:

  • land this PR as-is
  • do a follow-on PR (a blocker for actually using this work) that deals with recovered sagas and prevents sagas from being re-assigned while we're quiescing
  • file issues for:
    • dealing with needing to expunge a Nexus during quiesce
    • having Nexus instances synchronize before moving through the quiesce stages (to minimize the window of downtime and the exposure to the above problem)

pub async fn claim(&self) -> Result<DataStoreConnection, Error> {
let id = self.next_id.fetch_add(1, Ordering::SeqCst);
let held_since = Utc::now();
let debug = Backtrace::force_capture().to_string();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we measure this? What's the cost of the backtrace relative to accessing the DB connection?

We could. To be precise, what I think matters most is how long the backtrace takes relative to the time that the connection is held, not how long it took to acquire the connection. So what I'd probably do is:

  • define some USDT probes before and after creating the backtrace
  • write a D script that traces these probes, plus the qorb claim acquired / released probes and breaks out the time spent in these two areas
  • deploy somewhere (best would be a racklette) and trace it for, like, 10 minutes

I'm honestly not sure it's worth the effort. A cheaper thing I could do is measure how long it takes to do a few backtraces on my own dev system and make sure it's as quick as I'd hope. I'd expect this to be on the order of tens (maybe hundreds) of microseconds and I'd expect it to be dwarfed by even a basic database query round-trip.

Is this output meaningful for us, given that we're in an async calling context? (I assume "yes" because the task is running, just wanna make sure these don't all point to the same spot in the tokio runtime)

Yup. Here's an example:

Details
    claim 1248 held since 2025-08-01 21:17:03.232936025 UTC (2s 7ms ago)
    acquired by:
           0: {async_fn#0}
                     at ./nexus/db-queries/src/db/pool.rs:197:21
           1: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/mod.rs:379:27
           2: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/mod.rs:368:45
           3: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/deployment.rs:959:59
           4: {async_fn#0}
                     at ./nexus/db-queries/src/db/datastore/deployment.rs:2294:70
           5: {async_block#0}
                     at ./nexus/src/app/background/tasks/blueprint_load.rs:60:18
           6: poll<alloc::boxed::Box<(dyn core::future::future::Future<Output=serde_json::value::Value> + core::marker::Send), alloc::alloc::Global>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
           7: {async_fn#0}
                     at ./nexus/src/app/background/driver.rs:339:54
           8: {async_fn#0}
                     at ./nexus/src/app/background/driver.rs:292:62
           9: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17
          10: with_mut<tokio::runtime::task::core::Stage<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9
          11: poll<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13
          12: tokio::runtime::task::harness::poll_future::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19
          13: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
          14: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
          15: __rust_try
          16: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
          17: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<omicron_nexus::app::background::driver::{impl#2}::run::{async_fn_env#0}, alloc::sync::Arc<tokio::runtime::scheduler::multi_thread::handle::Handle, alloc::alloc::Global>>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
          18: tokio::runtime::task::harness::poll_future
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18
          19: tokio::runtime::task::harness::Harness<T,S>::poll_inner
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27
          20: tokio::runtime::task::harness::Harness<T,S>::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15
          21: tokio::runtime::task::raw::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5
          22: tokio::runtime::task::raw::RawTask::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18
          23: tokio::runtime::task::LocalNotified<S>::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:509:9
          24: {closure#0}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:600:13
          25: with_budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:167:5
          26: budget<core::result::Result<alloc::boxed::Box<tokio::runtime::scheduler::multi_thread::worker::Core, alloc::alloc::Global>, ()>, tokio::runtime::scheduler::multi_thread::worker::{impl#1}::run_task::{closure_env#0}>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:133:5
          27: run_task
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:591:9
          28: tokio::runtime::scheduler::multi_thread::worker::Context::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:539:24
          29: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:504:21
          30: tokio::runtime::context::scoped::Scoped<T>::set
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/scoped.rs:40:9
          31: tokio::runtime::context::set_scheduler::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:26
          32: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(), tokio::runtime::scheduler::multi_thread::worker::run::{closure#0}::{closure_env#0}>, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:315:12
          33: std::thread::local::LocalKey<T>::with
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/local.rs:279:15
          34: tokio::runtime::context::set_scheduler
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:9
          35: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:499:9
          36: tokio::runtime::context::runtime::enter_runtime
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/runtime.rs:65:16
          37: tokio::runtime::scheduler::multi_thread::worker::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:491:5
          38: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/multi_thread/worker.rs:457:45
          39: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/task.rs:42:21
          40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:365:17
          41: with_mut<tokio::runtime::task::core::Stage<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>>, core::task::poll::Poll<()>, tokio::runtime::task::core::{impl#6}::poll::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/loom/std/unsafe_cell.rs:16:9
          42: poll<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/core.rs:354:13
          43: tokio::runtime::task::harness::poll_future::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:535:19
          44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
          45: do_call<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
          46: __rust_try
          47: try<core::task::poll::Poll<()>, core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
          48: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<tokio::runtime::task::harness::poll_future::{closure_env#0}<tokio::runtime::blocking::task::BlockingTask<tokio::runtime::scheduler::multi_thread::worker::{impl#0}::launch::{closure_env#0}>, tokio::runtime::blocking::schedule::BlockingSchedule>>, core::task::poll::Poll<()>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
          49: tokio::runtime::task::harness::poll_future
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:523:18
          50: tokio::runtime::task::harness::Harness<T,S>::poll_inner
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:210:27
          51: tokio::runtime::task::harness::Harness<T,S>::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/harness.rs:155:15
          52: tokio::runtime::task::raw::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:325:5
          53: tokio::runtime::task::raw::RawTask::poll
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/raw.rs:255:18
          54: tokio::runtime::task::UnownedTask<S>::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/task/mod.rs:546:9
          55: tokio::runtime::blocking::pool::Task::run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:161:9
          56: run
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:516:17
          57: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
                     at /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/blocking/pool.rs:474:13
          58: __rust_begin_short_backtrace<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/sys/backtrace.rs:152:18
          59: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:559:17
          60: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:272:9
          61: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:589:40
          62: __rust_try
          63: try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panicking.rs:552:19
          64: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>>, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/panic.rs:359:14
          65: {closure#1}<tokio::runtime::blocking::pool::{impl#6}::spawn_thread::{closure_env#0}, ()>
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/std/src/thread/mod.rs:557:30
          66: core::ops::function::FnOnce::call_once{{vtable.shim}}
                     at /home/dap/.rustup/toolchains/1.88.0-x86_64-unknown-illumos/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
          67: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
                     at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9
          68: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
                     at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/alloc/src/boxed.rs:1966:9
          69: std::sys::pal::unix::thread::Thread::new::thread_start
                     at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/sys/pal/unix/thread.rs:97:17
          70: _thrp_setup
          71: <unknown>

What's the size of these backtraces? Presumably we're going to have an extra one for each live DB claim now. Like latency - maybe this is acceptable! just want to know what we're dealing with

We're saving the serialized form, so to give an example, the above one is about 16 KiB. Let's call it 32 KiB to be conservative. Each connection can have at most one of these at a time. At 16 possible connections per backend, and 5 CockroachDB backends (not sure if I'm remembering those right), that's 2.5 MiB per Nexus process. Seems fine.

false
} else {
q.claims_held
.insert_unique(HeldDbClaimInfo { id, held_since, debug })
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. omdb reports this as acquired by <stack trace> (i.e., "acquired by this code path").

// appears before messages from code paths that saw this change.
info!(&self.log, "starting db pool quiesce");
self.quiesce.send_modify(|q| {
q.new_claims_allowed = ClaimsAllowed::Disallowed;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters. With this sort of API you can either make it idempotent (like this: do nothing if it's already done) or have it fail and communicate this case and let callers implement the idempotence on top if they want. I don't see a reason to do the latter here. (In practice, there is only one caller, and it can only ever call it once in the lifetime of the process. If it did call it more than once, this would work fine.)

Comment on lines +1481 to +1483
async fn quiesce_start(
rqctx: RequestContext<Self::Context>,
) -> Result<HttpResponseUpdatedNoContent, HttpError> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is really for demo, dev, and testing. My expectation is not that it will ever be invoked in a deployed system. This means the situation you're describing can't happen because the old Nexus would never tell the new Nexus instances to quiesce.

I expect it to work something like this:

  • Nexus decides it's time to quiesce (probably in the Reconfigurator executor) and writes quiesce_started to the database.
  • Nexus invokes Nexus::quiesce_start() (maybe from the same code path).
  • Nexus polls the other Nexus instances to see when they've finished quiescing.
  • When they have all finished, Nexus writes quiesce_done.

I'm not sure that's exactly right but my main point is that Nexus is deciding for itself when to quiesce itself, and that code will be calling this code. It won't be some external thing telling it to do this.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main holdup with this PR was "dealing with sagas in the quiesce case" - if we're planning on punting that to later, fine with me, let's just make sure we get an issue filed before merging.

@davepacheco
Copy link
Collaborator Author

All of that said, my inclination would be:

* land this PR as-is

See below.

* do a follow-on PR (a blocker for actually _using_ this work) that deals with recovered sagas and prevents sagas from being re-assigned while we're quiescing

This is now up for review: #8794.

* file issues for:
  
  * dealing with needing to expunge a Nexus _during_ quiesce
  * having Nexus instances synchronize before moving through the quiesce stages (to minimize the window of downtime and the exposure to the above problem)

Filed #8795 and #8796.


I had said I wanted to land this PR as-is, but I realized what I really meant was I wanted to get it to the point where it was ready to land so that the follow-up could be reviewed independent of this one. Now that #8794 is up for review, my plan is to get that one ready to land (address any review feedback), then land both of these together. Although there shouldn't be much risk here, given the timing, I figure I'll wait for the R17 gates to open.

@davepacheco davepacheco merged commit f423987 into main Aug 12, 2025
18 checks passed
@davepacheco davepacheco deleted the dap/nexus-quiesce branch August 12, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

schema update: implementation of quiesce inside Nexus

4 participants