storage: push txn queues for high contention scenarios#13501
storage: push txn queues for high contention scenarios#13501spencerkimball merged 1 commit intomasterfrom
Conversation
What do you mean by immediately retried? Also, is the Pusher notified of the failed push and goes into a backoff loop, or does the request block and retry its push locally? If the latter, do you also trigger the push when the txn commits? Could tell from the code, but can't really dig on my phone, and would be good to belabor this in the description anyway. |
|
I haven't looked at the implementation yet, but does the deadlock detection handle cycles that are larger than 2? That is, A -> B -> C -> D -> A. |
|
I'm going to try creating a diagram of the old model and of the new model because there are just layers in layers here. Let me try to verbally answer the question: When a On the retry, the request hits a new conditional execution path in
As additional requests come in to push the txn record, they are immediately enqueued in same manner. So the Any call to The only backoff loop is for checking the status of the pusher's txn. This is the mechanism by which we find out about which txn might in turn be depending on the pusher. This is called at |
6889cab to
92d68c0
Compare
|
|
|
What concerns me with this approach is that we'll soon work on the ability to cancel ongoing transactions asynchronously (e.g. some interface to kill long-running queries from the UI) and introducing a queue makes it necessary to scan this queue and remove items in the middle when a txn is cancelled. Hence a couple of questions:
|
|
@knz, you don't need to look in the queue. Simply send a There is no need to explicitly remove from the queue (because pushers remove themselves). It's safe to ignore removing a txn from other txn's dependency lists when aborted. In the worst case, we'd end up aborting another txn unnecessarily in the belief that there is a cycle that doesn't exist involving the already-aborted txn. This is also true in other circumstances, like when a txn commits just before we notice the dependency cycle and abort anyway. We still need clients to do txn retry loops for SERIALIZABLE restarts, so this is just another retry occasion, and likely an exceedingly rare one at that. |
|
Thanks for clarifying! |
|
Reviewed 35 of 35 files at r1. .gitignore, line 20 at r1 (raw file):
I think we prefer to put things like this in a per-directory pkg/roachpb/data.go, line 750 at r1 (raw file):
Refer to {Min,Max}UserPriority instead of 0 and MaxInt32. pkg/sql/txn.go, line 73 at r1 (raw file):
I'm worried about removing the old randomized priority bands. Now, high priority transactions are special, and two contending high-priority transactions will interact badly with each other. With randomized bands, they'd still preempt most lower-priority transactions but interact with each other the same way two normal transactions would. This is going to make it risky to ever use a non-default priority. pkg/storage/intent_resolver.go, line 86 at r1 (raw file):
Rename pkg/storage/intent_resolver.go, line 88 at r1 (raw file):
The old code here would hide certain errors from the caller (primarily AmbiguousResultError - an ambiguous result on a push shouldn't make the entire operation ambiguous) pkg/storage/intent_resolver.go, line 97 at r1 (raw file):
I think we probably want to return errors here instead of swallowing them now (with the same AmbiguousResultError caveat as above). If we hide errors from either push or resolve, we should hide from both. pkg/storage/push_txn_queue.go, line 66 at r1 (raw file):
This comment appears to be out of date; we write (possibly nil) values to the channel. pkg/storage/push_txn_queue.go, line 89 at r1 (raw file):
Why is this sync.Locker instead of syncutil.Mutex? pkg/storage/push_txn_queue.go, line 115 at r1 (raw file):
What's stopping another goroutine from adding a new waiter right after we've cleared the map here? pkg/storage/push_txn_queue.go, line 158 at r1 (raw file):
We've been bitten before by passing pointers to the same Transaction around to multiple callers. Do we need to be worried about that here? pkg/storage/push_txn_queue.go, line 187 at r1 (raw file):
This can be pkg/storage/push_txn_queue.go, line 191 at r1 (raw file):
s/of/or/ pkg/storage/push_txn_queue.go, line 245 at r1 (raw file):
Why is this conditional? (instead of always basing it on pkg/storage/push_txn_queue.go, line 275 at r1 (raw file):
s/the txn/the pushee txn/ pkg/storage/push_txn_queue.go, line 279 at r1 (raw file):
If we hit this early return pkg/storage/push_txn_queue.go, line 287 at r1 (raw file):
s/the transaction/the pusher transaction/ Rename pkg/storage/push_txn_queue.go, line 329 at r1 (raw file):
Is this scenario covered by existing tests? pkg/storage/replica.go, line 2070 at r1 (raw file):
Why does this happen in tryAddWriteCmd instead of in the loop in Store? pkg/storage/replica_command.go, line 1374 at r1 (raw file):
This is sketchy - it runs on each replica and produces different results. We don't currently use the replies from different replicas in ways that could cause them to diverge, but I think it's possible for this to return incorrect results if e.g. the lease changed hands while this PushTxn command was in flight. It might be better to introduce a new read-only command that returns waiting transactions instead of adding to the overloaded PushTxn. pkg/storage/replica_command.go, line 1399 at r1 (raw file):
For the record, this is going to require a stop-the-world migration. pkg/storage/replica_proposal.go, line 157 at r1 (raw file):
That's not a sentence. pkg/util/retry/retry.go, line 135 at r1 (raw file):
Add a simple test of this method in retry_test.go Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 22 unresolved discussions, some commit checks failed. .gitignore, line 20 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/roachpb/data.go, line 750 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/sql/txn.go, line 73 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Two contending high priority transactions don't react badly IMO. They simply cannot abort each other. In this new model, normal transactions react the same way as two high priority transactions. The priorities have been reduced to a random variable for deciding which txns in a dependency cycle to abort to avoid deadlock. We will need to update the docs to clarify this new txn model. Our behavior is now much closer to that of traditional locking. For priorities, we currently only allow {LOW, NORMAL, HIGH} in the SQL API. What we need to highlight here is that LOW is for low-priority background work and will always be aborted by contending non-LOW transactions. HIGH is for high-priority low-latency work that will always abort non-HIGH transactions. pkg/storage/intent_resolver.go, line 86 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/intent_resolver.go, line 88 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
OK, I'm now swallowing ambiguous result errors. pkg/storage/intent_resolver.go, line 97 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 66 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 89 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 115 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think we might want to add pkg/storage/push_txn_queue.go, line 158 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 187 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 191 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 245 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Because there are multiple waiters, and if you're a waiter which isn't handling the active update, you don't want to go into a busy loop on the expired heartbeat timeout. You also don't want to just not wait with any timeout or you might never wake up to discover the txn has expired. pkg/storage/push_txn_queue.go, line 275 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 279 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 287 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 329 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No, I'm adding tests still. pkg/storage/replica.go, line 2070 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No real reason. This seemed more appropriate to me. Do you think it should be in pkg/storage/replica_command.go, line 1374 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yeah, I think we want a different push query type operation. This would benefit from being a read-only op. I'll make that change. pkg/storage/replica_command.go, line 1399 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Yes. Will add that to the PR description. pkg/storage/replica_proposal.go, line 157 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/util/retry/retry.go, line 135 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Will do; was just getting this out the door, as @petermattis suggested. Will get tests done this weekend. Comments from Reviewable |
d6d79fc to
8448919
Compare
|
@bdarnell I've addressed the Also, removed the Tests still to come. |
|
This is a Big Scary Change. Can it be protected by an env var? That seems a bit difficult, but I worry about the beta release we enable this if we don't have prior testing on our test clusters. Review status: 19 of 40 files reviewed at latest revision, 25 unresolved discussions, some commit checks failed. pkg/roachpb/batch_generated.go, line 1 at r2 (raw file):
You're supposed to edit pkg/storage/push_txn_queue.go, line 37 at r2 (raw file):
I thought the QUERY pushes were removed in this PR. pkg/storage/push_txn_queue.go, line 67 at r2 (raw file):
I find the "push" terminology confusing. Not for this PR, but we should think about revisiting it now that we have something that is essentially a lock. In the short term, .gitignore, line 20 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Or use Comments from Reviewable |
|
I don't think putting this behind an env var is going to be feasible. I think we should put it on at least one test cluster before merging it to master, then merge it to master just after a beta release so we'll have a full week to run it on all the test clusters. Reviewed 22 of 22 files at r2. pkg/roachpb/data.go, line 751 at r2 (raw file):
These should be pkg/storage/intent_resolver.go, line 85 at r2 (raw file):
The code that handled WriteIntentErrors specially at the call site for this method (in store.go) is gone; this will cause the entire operation to fail. Instead, we want to assume that the resolution succeeded and allow the retry in store.go to proceed. We could either transform AmbiguousResultError into pkg/storage/push_txn_queue.go, line 287 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
To clarify, I meant pkg/storage/replica.go, line 2070 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I think so. It's related to the intentResolver stuff done in that loop, and I think it's a better fit there than here (since pkg/storage/replica_command.go, line 1399 at r1 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Actually, I don't think stop-the-world is enough without pkg/storage/replica_command.go, line 1329 at r2 (raw file):
We need to keep the implementation of Comments from Reviewable |
|
Agreed. Would be very difficult and we'd have to leave a lot of stuff in place which is complex (the backoff loops). I'll target getting this in by Thursday to meet the suggested schedule. Review status: all files reviewed at latest revision, 14 unresolved discussions, some commit checks failed. pkg/roachpb/batch_generated.go, line 1 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
I didn't edit this file. I edited pkg/roachpb/data.go, line 751 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/intent_resolver.go, line 85 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Ah, good point. I'm moving this error handling to pkg/storage/push_txn_queue.go, line 287 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/push_txn_queue.go, line 37 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. pkg/storage/push_txn_queue.go, line 67 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
The waiter isn't necessarily a transaction, so I'm not in favor of pkg/storage/replica.go, line 2070 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica_command.go, line 1399 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica_command.go, line 1329 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Comments from Reviewable |
4d650cd to
45ec826
Compare
|
Looks good aside from the tests that still need to be added. Reviewed 12 of 12 files at r3. pkg/storage/replica_command.go, line 1329 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
OK Comments from Reviewable |
|
Reviewed 22 of 35 files at r1, 11 of 22 files at r2, 11 of 12 files at r3. pkg/kv/txn_correctness_test.go, line 380 at r3 (raw file):
what transactions? this function does not take any transactions as arguments. pkg/roachpb/api.proto, line 472 at r3 (raw file):
should we also reserve this one? pkg/roachpb/api.proto, line 520 at r3 (raw file):
is this transitional? how is it intended to be used? pkg/roachpb/batch_generated.go, line 1 at r2 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
@spencerkimball pkg/roachpb/data_test.go, line 490 at r3 (raw file):
how come this changed? pkg/roachpb/data_test.go, line 493 at r3 (raw file):
nit: move this above the "table" above since it isn't related. pkg/roachpb/errors.go, line 33 at r3 (raw file):
Can you add "Fixes #5249." to the commit message? That issue was mistakenly left closed in May of last year, but I've now reopened it and assigned to you so that things can be linked properly. pkg/roachpb/errors.proto, line 267 at r3 (raw file):
don't we want to keep this to avoid reusing it? pkg/storage/intent_resolver.go, line 76 at r3 (raw file):
why the named return? pkg/storage/replica.go, line 1969 at r3 (raw file):
nit: random newline in otherwise untouched area pkg/storage/replica_command.go, line 1401 at r3 (raw file):
looks like we lost some fidelity here; perhaps pkg/storage/replica_command.go, line 1432 at r3 (raw file):
the comment in the proto is confusing in light of this code - you say that if true (newPriorities) then only minpriority can always be pushed, and only maxpriority can always push. Why do we need to bother checking the other txn's priority? pkg/storage/replica_command.go, line 1439 at r3 (raw file):
pkg/storage/replica_command.go, line 1463 at r3 (raw file):
s/should/does not/? pkg/storage/replica_test.go, line 3058 at r3 (raw file):
restore this comment for consistency, you kept it everywhere else. pkg/storage/store.go, line 2436 at r3 (raw file):
Rather than also, this code currently does not exit the loop if the context is cancelled. pkg/storage/store.go, line 2500 at r3 (raw file):
can you explain in the comment why this doesn't propagate the original error? pkg/storage/store.go, line 2548 at r3 (raw file):
don't you need to set pkg/storage/store.go, line 2557 at r3 (raw file):
debugging code? you're only logging the type. Comments from Reviewable |
45ec826 to
f61f143
Compare
|
Tests added. PTAL Review status: all files reviewed at latest revision, 29 unresolved discussions, some commit checks failed. pkg/kv/txn_correctness_test.go, line 380 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
The comment is referring to the transactions involved in the history. You don't need to know anything specific about the transactions themselves, only how many there are ( pkg/roachpb/api.proto, line 472 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
OK, will restore. pkg/roachpb/api.proto, line 520 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Not sure we can get rid of it if we always want to be able to upgrade from an older version, no matter how much older. We should probably sunset it at some point regardless. Not sure we have a policy for things like this. pkg/roachpb/batch_generated.go, line 1 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Sorry, yes I didn't edit pkg/roachpb/data_test.go, line 490 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Because specifying 1000 doesn't generate a random priority any longer. It just returns a deterministic max. pkg/roachpb/data_test.go, line 493 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/roachpb/errors.go, line 33 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/roachpb/errors.proto, line 267 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
OK, added a deprecated warning. pkg/storage/intent_resolver.go, line 76 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Oops vestigial when we had a defer to intercept. pkg/storage/replica.go, line 1969 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_command.go, line 1329 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica_command.go, line 1401 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
That reason hasn't been useful yet. We still have the right info here. Easier this way. pkg/storage/replica_command.go, line 1432 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
minpriority can't push minpriority and maxpriority can't push maxpriority. I clarified api.proto a bit. pkg/storage/replica_command.go, line 1439 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_command.go, line 1463 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_test.go, line 3058 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/store.go, line 2436 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Oops on not exiting the loop. Done. pkg/storage/store.go, line 2500 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/store.go, line 2548 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/store.go, line 2557 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
|
Haven't reviewed tests yet; will do so tomorrow. Reviewed 14 of 16 files at r4. pkg/kv/txn_correctness_test.go, line 380 at r3 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I see. I think it'd be clearer not to refer to "the transactions" here, but I won't insist. pkg/roachpb/api.proto, line 520 at r3 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
OK. Can you documented that nuance in the comment on this field (or wherever you see fit)? pkg/storage/replica_command.go, line 1389 at r4 (raw file):
debugging code? pkg/util/retry/retry.go, line 122 at r4 (raw file):
nit: another random whitespace change pkg/util/retry/retry_test.go, line 118 at r4 (raw file):
this comment doesn't make sense? pkg/util/retry/retry_test.go, line 119 at r4 (raw file):
why can't this be You could eliminate Comments from Reviewable |
|
Review status: 41 of 43 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed. pkg/roachpb/api.proto, line 520 at r3 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_command.go, line 1389 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/util/retry/retry.go, line 122 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/util/retry/retry_test.go, line 118 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Removed. pkg/util/retry/retry_test.go, line 119 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Because I'm checking the use of Comments from Reviewable |
|
Reviewed 16 of 16 files at r4. pkg/storage/push_txn_queue_test.go, line 164 at r4 (raw file):
Maybe add an pkg/storage/push_txn_queue_test.go, line 214 at r4 (raw file):
Include the returned value in the error message. pkg/storage/push_txn_queue_test.go, line 220 at r4 (raw file):
Maybe Comments from Reviewable |
f61f143 to
15a57f8
Compare
|
Reviewed 2 of 16 files at r4, 6 of 6 files at r5. pkg/roachpb/api.proto, line 520 at r3 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
heh, not quite the nuance i was hoping for. pkg/storage/push_txn_queue.go, line 41 at r4 (raw file):
shouldn't this check for TOUCH rather than checking for not ABORT and not TIMETSTAMP? pkg/storage/push_txn_queue.go, line 55 at r4 (raw file):
is this nil check needed? isn't that undefined behaviour for this function? pkg/storage/push_txn_queue.go, line 77 at r4 (raw file):
what does nil mean? pkg/storage/push_txn_queue.go, line 87 at r4 (raw file):
(optional) s/, atomically updated// pkg/storage/push_txn_queue.go, line 88 at r4 (raw file):
(optional) i'd remove this comment, it's exactly repeating the code pkg/storage/push_txn_queue.go, line 91 at r4 (raw file):
"map of queues" seems wrong? i'm not sure what this comment is saying. pkg/storage/push_txn_queue.go, line 99 at r4 (raw file):
(optional) consider removing this comment pkg/storage/push_txn_queue.go, line 128 at r4 (raw file):
you can remove this special case without changing what this function does. pkg/storage/push_txn_queue.go, line 252 at r4 (raw file):
this checks pkg/storage/push_txn_queue.go, line 269 at r4 (raw file):
nit: its pkg/storage/push_txn_queue.go, line 275 at r4 (raw file):
does this need to be accessed under lock? my undersatnding is that the cfg struct is not modified pkg/storage/push_txn_queue.go, line 285 at r4 (raw file):
this appears to very nearly duplicate pkg/storage/push_txn_queue.go, line 287 at r4 (raw file):
does this need to be a util.Timer thing? might be leaking some timers here. pkg/storage/push_txn_queue.go, line 401 at r4 (raw file):
this comment should be one line lower pkg/storage/push_txn_queue_test.go, line 140 at r4 (raw file):
consider returning an error (it is otherwise difficult to locate failures). pkg/storage/push_txn_queue_test.go, line 148 at r4 (raw file):
ditto (don't pass pkg/storage/push_txn_queue_test.go, line 182 at r4 (raw file):
optional golf (applies throughout): and below pkg/storage/push_txn_queue_test.go, line 187 at r4 (raw file):
consider closing after sending (throughout) pkg/storage/push_txn_queue_test.go, line 273 at r4 (raw file):
nit: context. Canceled.Error() pkg/storage/push_txn_queue_test.go, line 401 at r4 (raw file):
prefer to avoid starting clocks at zero, we've had some fragile tests as a result in the past. I've been using pkg/storage/push_txn_queue_test.go, line 565 at r4 (raw file):
does this context need to be cancellable? pkg/storage/replica_command.go, line 1367 at r4 (raw file):
this now duplicates the logic in the function pkg/util/retry/retry_test.go, line 110 at r4 (raw file):
seems like you could empty out this entire structure, all these values are red herrings pkg/util/retry/retry_test.go, line 119 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
Ah, I misunderstood this test. Thanks. Comments from Reviewable |
f26830b to
dd14808
Compare
|
Reviewed 25 of 25 files at r7. pkg/storage/push_txn_queue.go, line 91 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
I don't understand your reply; I wasn't even thinking about that implementation detail, I'm just confused about the double use of "queue" in the first line of this comment - it's both in the name of this structure (which doesn't appear to be a queue itself) and in the description of its members (which are not true queues either). Am I crazy? Does this not seem confusing to you? pkg/storage/push_txn_queue.go, line 99 at r4 (raw file): Previously, spencerkimball (Spencer Kimball) wrote…
not done pkg/storage/push_txn_queue.go, line 294 at r7 (raw file):
instead of this cast, you can write ...or whatever your preferred wrapped style. Comments from Reviewable |
|
Review status: all files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. pkg/storage/push_txn_queue.go, line 91 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
This struct enqueues pkg/storage/push_txn_queue.go, line 99 at r4 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/push_txn_queue.go, line 294 at r7 (raw file): Previously, tamird (Tamir Duberstein) wrote…
I'm not a fan of this substitution. Comments from Reviewable |
dd14808 to
a71065b
Compare
|
Reviewed 2 of 2 files at r8. Comments from Reviewable |
|
Looks like this PR is ready to merge. @bdarnell, will it be safe to merge this tomorrow, and if so, at what time? |
|
We can merge it now - the commit for the beta has already been chosen. But we shouldn't deploy it to clusters other than cyan and cobalt until the beta is released. |
|
@spencerkimball sorry, you'll need to rebase because of the Go 1.8 bump (and regenerate the protos with 1.8). |
This commit replaces the abort-immediately-if-higher-priority txn semantics with a more traditional locking approach. This yields significantly higher performance on high-contention workloads. Locking is done by disallowing intents to be pushed with two exceptions: 1) the pusher has a maximum txn priority (always pushes), or 2) the pushee has a minimum txn priority (can always be pushed). When a `PushTxn` request fails because the intent's txn is still live, the pushee txn is added as a blocking txn to a new per-replica `pushTxnQueue`. This queue maintains a list of waiting pushers per blocking txn. Push failures are immediately retried and in the event that the `PushTxn` request's `PusheeTxn` is in the queue, the pusher request is enqueued behind it, waiting for it to be resolved. While waiting on a txn to be resolved, a pusher periodically (with an exponentially increasing backoff/retry) updates its own status. This informs it whether it has had its priority ratcheted or been concurrently aborted or committed. It also gathers information on which txns are in turn waiting on the pusher in order to build a transitive closure of txn dependencies. This is used to determine whether there is a dependency cycle which would mean a deadlock. In the event of a deadlock, txns with lower priorities are aborted. Removed support for `PushType=PUSH_QUERY` from the `PushTxn` request and added a new `QueryTxn` read-only command in its place which returns both the transaction record as well as the list of txn IDs for txns which are waiting on the queried txn. Note that this PR will require a stop-the-world migration. Fixes #5249
a71065b to
1222d5d
Compare
A bug was introduced by cockroachdb#13501. Need to track that down.
storage: minor cleanups from #13501
This was deprecated in cockroachdb#13501, almost 2 years ago. Release note: None
This commit replaces the abort-immediately-if-higher-priority txn semantics
with a more traditional locking approach. This yields significantly higher
performance on high-contention workloads.
Locking is done by disallowing intents to be pushed with two exceptions:
pushee has a minimum txn priority (can always be pushed).
When a
PushTxnrequest fails because the intent's txn is still live,the pushee txn is added as a blocking txn to a new per-replica
pushTxnQueue. This queue maintains a list of waiting pushers perblocking txn. Push failures are immediately retried and in the event
that the
PushTxnrequest'sPusheeTxnis in the queue, the pusherrequest is enqueued behind it, waiting for it to be resolved.
While waiting on a txn to be resolved, a pusher periodically (with an
exponentially increasing backoff/retry) updates its own status. This
informs it whether it has had its priority ratcheted or been concurrently
aborted or committed. It also gathers information on which txns are in
turn waiting on the pusher in order to build a transitive closure of
txn dependencies. This is used to determine whether there is a
dependency cycle which would mean a deadlock. In the event of a deadlock,
txns with lower priorities are aborted.
This change is