Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Jul 7, 2025

Part of #8501 .

What this PR does:

  • Adds quiesce_started and quiesce_completed to db metadata
  • Refuse to boot Nexus if quiescing is in-progress
  • Adds tests that Nexus actually refuses to boot if quiescing is in-progress
  • Modifies schema migration query to reset quiesce values to false at the last step of finalization
  • Tests that quiesce values get set to false
  • Allows the existing schema-update binary to explicitly ignore the "quiescing" booleans

What this PR does NOT do:

  • Sets the "quiesce_started" and "quiesce_completed" booleans. This will be the responsibility of Nexus, decided by the reconfigurator, and coordinated between "old Nexuses" as described in schema update: orchestration of the handoff from old to new Nexus #8501.
  • Change any configuration to attempt to auto-format Nexus. When new Nexuses boot, they'll still be using the "UpdateConfiguration::Disabled" option, and will be waiting for the schema-update tool to trigger.

@smklein smklein force-pushed the db_metadata_quiesce branch from 911d2bd to 17ca05f Compare July 7, 2025 22:49
@smklein smklein force-pushed the db_metadata_quiesce branch 4 times, most recently from 679d105 to 44940af Compare July 31, 2025 20:16
@smklein smklein changed the title [wip] Add schema changes for quiescing [nexus] Add schema changes for quiescing Jul 31, 2025
@smklein smklein force-pushed the db_metadata_quiesce branch 2 times, most recently from 010dc30 to 4f2e7fc Compare July 31, 2025 21:29
@smklein smklein marked this pull request as ready for review July 31, 2025 22:18
@smklein smklein force-pushed the db_metadata_quiesce branch from 4f2e7fc to 233203d Compare August 1, 2025 00:39
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at nexus/db-queries/src/db/datastore/db_metadata.rs yet.

@smklein smklein force-pushed the db_metadata_quiesce branch 2 times, most recently from e2c68fb to 47dd307 Compare August 1, 2025 21:47
@davepacheco
Copy link
Collaborator

Thanks! This is a good step forward. The rest of this comment is about future work.


I feel like we're close to the point where a re-think of the control flow here could result in a clearer, more testable system. I started writing this thinking it would make sense in text and it's gotten so long and specific that maybe a PR would make more sense...sorry. But I'm curious if this direction makes sense to folks.

It looks like right now, Nexus startup blocks on DataStore::new_with_timeout, generally with a None timeout meaning "wait indefinitely" and a None config that means "don't upgrade automatically". So we get stuck early in Nexus startup waiting for the schema to be right. This is convenient because it means we don't go kick off background tasks or open our external listening socket in a state where trying to use the database will fail. But I think this is going to make it tricky to implement #8501 for the reason you mentioned in #8750 (we want the internal API up in order to carry out quiesce). But it's worses than that: In order to implement quiesce, we need to be able to do saga recovery and finish executing sagas, and that requires using the database! So I think we need to allow the datastore to come fully up when it's the expected version and quiesce_started && !quiesce_completed (basically, the current NexusTooOld { handoff: true } case) and notice at a higher level when it's time to quiesce.

Then we have a bunch of consumers with different policies:

  • omdb wants to start no matter what, but warn if the version doesn't match what it expects
  • tests and some dev tools want to start only if the schema exactly matches -- don't apply update, don't wait for it to be updated, etc.
  • schema updater wants to upgrade no matter what

Thinking out loud, I wonder if it'd be cleanest to separate these:

  • fetch the database metadata
  • interpret what it means
  • decide what to do about it

and then split the logic into a couple of decoupled tasks.

To be more concrete, supposed we had:

/// Reports how the schema version deployed in the database compares to what this Nexus expects
/// (does NOT imply what to do about it)
enum SchemaVersionStatus {
    DatabaseIsNewer, // includes both (1) database is actually on a newer version and (2) database is on the same version but `quiesce_completed` is true
    DatabaseIsOlderUnquiesced, // the database is on an older version and quiesce has not completed
    DatabaseIsOlderQuiesced, // the database is on an older version and quiesce has completed
    DatabaseMatchesQuiescing, // the database version matches but we should be quiescing
    DatabaseMatchesReady, // the database version matches and we're not quiescing
}

impl SchemaVersionStatus {
    /// Interpret how the current db metadata compares to our desired version -- this is a pretty simple table based on `(found_version, desired_version, quiesce_started, quiesce_completed)`
    fn interpret(db_metadata: DbMetadata, desired_version) -> SchemaVersionStatus { ... }
}

/// Describes how to respond to a given `SchemaVersionStatus`
enum SchemaAction {
    /// do not touch the database at all any more
    StopAllDbActivity,
    /// begin quiescing Nexus (database may still be unquiesced if sagas are running)
    Quiesce,
    /// normal operation -- use the database normally, no other actions needed
    Ready,
    /// wait for either the schema to be updated or (if willing to update it ourselves but we want to wait for quiesce to complete) for quiescing to finish
    /// (do not use the database, do not try to upgrade it)
    Wait,
    /// fail immediately (used by tests or other things that don't expect an update is needed and don't want to do one)
    Fail,
    /// start a schema update
    Update,
} 

/// Determines how the consumer wants to behave when the schema doesn't match what they expect
enum ConsumerPolicy {
    /// tests, dev tools that always expect things to match
    FailOnMismatch,
    /// used by omdb to do best-effort
    IgnoreMismatch,
    /// used by Nexus in the automated update world to wait for old Nexus to quiesce
    UpdateGracefully,
    /// used by schema-updater to force an update regardless of quiesce
    UpdateForcefully,
}

impl SchemaAction {
    fn new(
        schema_version_status: SchemaVersionStatus,
        consumer_policy: ConsumerPolicy,
    ) -> SchemaAction {
        // This would be one big `match` that I'm not going to write out but I think is where basically all of the real logic of this whole system would live.  It would be pretty testable!
    }
}

impl DataStore {
    fn check_schema_version(&self, desired_version) -> Result<SchemaVersionStatus, Error> {
        // just fetches the row from `db_metadata` and calls `SchemaVersionStatus::interpret`
    }
    fn ensure_schema_version(&self, desired_version, force: bool) -> Result<(), Error> {
        // largely the same as today, but would not be called unless we're in a consumer that actually wants to update the schema
    }
}

Then I'm imagining that consumers pass a ConsumerPolicy into Datastore::new (or maybe it's separate constructors, like today), that thing calls check_schema_version(), then SchemaAction::new() with the returned SchemaVersionStatus and ConsumerPolicy. Then based on the action it gets back:

  • StopAllDbActivity: block indefinitely (we don't want the consumer to start up)
  • Quiesce: finish constructing the datastore, but return a signal to the consumer to start quiescing (call the nexus.quiesce_start() added in implement Nexus quiesce (sagas, db activity) for upgrade #8740, say, at the end of Nexus::new_with_id())
  • Ready: return
  • Wait: sleep and retry
  • Fail: return an error
  • Update: call ensure_schema, then treat it like Ready. On failure, sleep and retry.

I think what makes me nervous about the current implementation is that the complicated control logic is now spread through a bunch of different functions with diverging call chains that makes it hard to be sure it's all correct. The idea here is to separate policy vs. observations vs. decisions, represent those with types, and have simple, testable functions that go from each one to the next. Then we put it all together in just one or a couple of places.

Sorry for the ramble. Maybe we can discuss next week. Again, none of this really impacts this PR. This PR is an important step in the right direction even if we do decide to rework the control flow. It's just that it adds enough complexity to the decision-making that I'm having a harder time working through all the cases and that got me thinking about this restructure.

@smklein smklein force-pushed the db_metadata_quiesce branch from 47dd307 to cc364a8 Compare August 7, 2025 21:11
@smklein
Copy link
Collaborator Author

smklein commented Aug 18, 2025

Closing this PR - RFD 588 describes our new plans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants