-
Notifications
You must be signed in to change notification settings - Fork 66
[nexus] Add schema changes for quiescing #8533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
911d2bd to
17ca05f
Compare
679d105 to
44940af
Compare
010dc30 to
4f2e7fc
Compare
4f2e7fc to
233203d
Compare
davepacheco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at nexus/db-queries/src/db/datastore/db_metadata.rs yet.
e2c68fb to
47dd307
Compare
|
Thanks! This is a good step forward. The rest of this comment is about future work. I feel like we're close to the point where a re-think of the control flow here could result in a clearer, more testable system. I started writing this thinking it would make sense in text and it's gotten so long and specific that maybe a PR would make more sense...sorry. But I'm curious if this direction makes sense to folks. It looks like right now, Nexus startup blocks on Then we have a bunch of consumers with different policies:
Thinking out loud, I wonder if it'd be cleanest to separate these:
and then split the logic into a couple of decoupled tasks. To be more concrete, supposed we had: /// Reports how the schema version deployed in the database compares to what this Nexus expects
/// (does NOT imply what to do about it)
enum SchemaVersionStatus {
DatabaseIsNewer, // includes both (1) database is actually on a newer version and (2) database is on the same version but `quiesce_completed` is true
DatabaseIsOlderUnquiesced, // the database is on an older version and quiesce has not completed
DatabaseIsOlderQuiesced, // the database is on an older version and quiesce has completed
DatabaseMatchesQuiescing, // the database version matches but we should be quiescing
DatabaseMatchesReady, // the database version matches and we're not quiescing
}
impl SchemaVersionStatus {
/// Interpret how the current db metadata compares to our desired version -- this is a pretty simple table based on `(found_version, desired_version, quiesce_started, quiesce_completed)`
fn interpret(db_metadata: DbMetadata, desired_version) -> SchemaVersionStatus { ... }
}
/// Describes how to respond to a given `SchemaVersionStatus`
enum SchemaAction {
/// do not touch the database at all any more
StopAllDbActivity,
/// begin quiescing Nexus (database may still be unquiesced if sagas are running)
Quiesce,
/// normal operation -- use the database normally, no other actions needed
Ready,
/// wait for either the schema to be updated or (if willing to update it ourselves but we want to wait for quiesce to complete) for quiescing to finish
/// (do not use the database, do not try to upgrade it)
Wait,
/// fail immediately (used by tests or other things that don't expect an update is needed and don't want to do one)
Fail,
/// start a schema update
Update,
}
/// Determines how the consumer wants to behave when the schema doesn't match what they expect
enum ConsumerPolicy {
/// tests, dev tools that always expect things to match
FailOnMismatch,
/// used by omdb to do best-effort
IgnoreMismatch,
/// used by Nexus in the automated update world to wait for old Nexus to quiesce
UpdateGracefully,
/// used by schema-updater to force an update regardless of quiesce
UpdateForcefully,
}
impl SchemaAction {
fn new(
schema_version_status: SchemaVersionStatus,
consumer_policy: ConsumerPolicy,
) -> SchemaAction {
// This would be one big `match` that I'm not going to write out but I think is where basically all of the real logic of this whole system would live. It would be pretty testable!
}
}
impl DataStore {
fn check_schema_version(&self, desired_version) -> Result<SchemaVersionStatus, Error> {
// just fetches the row from `db_metadata` and calls `SchemaVersionStatus::interpret`
}
fn ensure_schema_version(&self, desired_version, force: bool) -> Result<(), Error> {
// largely the same as today, but would not be called unless we're in a consumer that actually wants to update the schema
}
}Then I'm imagining that consumers pass a
I think what makes me nervous about the current implementation is that the complicated control logic is now spread through a bunch of different functions with diverging call chains that makes it hard to be sure it's all correct. The idea here is to separate policy vs. observations vs. decisions, represent those with types, and have simple, testable functions that go from each one to the next. Then we put it all together in just one or a couple of places. Sorry for the ramble. Maybe we can discuss next week. Again, none of this really impacts this PR. This PR is an important step in the right direction even if we do decide to rework the control flow. It's just that it adds enough complexity to the decision-making that I'm having a harder time working through all the cases and that got me thinking about this restructure. |
47dd307 to
cc364a8
Compare
|
Closing this PR - RFD 588 describes our new plans |
Part of #8501 .
What this PR does:
quiesce_startedandquiesce_completedto db metadataquiescevalues tofalseat the last step of finalizationquiescevalues get set tofalseWhat this PR does NOT do:
schema-updatetool to trigger.