Catch-up after Federation Outage#8096
Conversation
Confused me for a second there!
synapse/storage/databases/main/schema/delta/58/11recovery_after_outage.sql
|
2 fails, but I still believe it to be in a reviewable state, I'll sort the 2 failures on monday, they can't be too bad (he said, knowing nothing at all). |
|
I note that test https://buildkite.com/matrix-dot-org/synapse/builds/11053#98a85fe9-ce0d-4a7b-83dd-09812911bff7 (Python: 3.7 / For now I'll open this to review. |
It was noted in #synapse-dev that this is a known flakey test. |
anoadragon453
left a comment
There was a problem hiding this comment.
I'd still like to see a cursory review of this from either @richvdh or @erikjohnston as they were involved in the initial design phase, but on the whole looks like an excellent stab at it.
Comments are particularly good 👍
| self._federation_shard_config = hs.config.federation.federation_shard_config | ||
| self._federation_shard_config = hs.config.worker.federation_shard_config | ||
|
|
||
| self._should_send_on_this_instance = True |
There was a problem hiding this comment.
What are those arguments?
There was a problem hiding this comment.
D'ow, that'll teach me for not writing my full thoughts whilst I still remember what I was on about.
So, vaguely I may have had two wavelengths of thought here:
- We're dropping 50 PDUs, isn't that bad?! Won't that mean we will forget to catch-up some rooms potentially?
If I recall, I appear to have been under the delusion that those PDUs would be lost altogether.
However, now I realise that this will enable catch-up, and those PDUs will be eligible for catch-up if they are the most recent PDUs in their respective rooms (a destination_rooms entry will exist and we won't update our last_successful_stream_ordering on a failure…) — so the remote will still get a chance to hear about them when they next come up. (If they aren't, then the more recent PDUs will be in the main queue so the remote will hear eventually!)
So: all OK here.
- Is there any point in dropping 50 but not just dropping the whole queue?
- Erik mentioned that sometimes, moving on to a different transaction can help a remote recover (i.e. just one particular one causes it to blow up), so 'skipping' is not necessarily bad.
- I wonder if waiting for retries to exceed an hour before sweeping out the queue could be bad for memory usage (of course, no worse than present-day, but …. is this a chance to make it better?)
- On the other hand, I don't suppose that /get_missing_events is the most efficient/high-throughput way to get events so forcing catch-up onto destinations too aggressively may be harmful.
Broadly I am tempted to think this is currently fine, but may need tweaking/limiting/aggressifying in the future in case that really does grow too large. I think Erik or Rich will have a more trustworthy sense here, though.
There was a problem hiding this comment.
I think it's fine for now, though once we have more confidence in the "catching up" behaviour, I think we may as well drop the whole queue immediately.
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
| -- of the latest event to be enqueued for transmission to that destination. | ||
| CREATE TABLE IF NOT EXISTS destination_rooms ( | ||
| -- the destination in question | ||
| destination TEXT NOT NULL, |
There was a problem hiding this comment.
I'd kinda like to see the logic shuffled so that this can be a foreign key; however at the very least it needs a comment saying why it can't be one.
There was a problem hiding this comment.
Sure, would it suffice to INSERT IGNORE a NULL row into the destinations table when upserting a destination_rooms row?
There was a problem hiding this comment.
probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue for the destination), but that might be fiddly, and is something we can do later.
| self._catching_up = False | ||
| return | ||
|
|
||
| if self._catch_up_max_stream_order is None: |
There was a problem hiding this comment.
question: why do we need a high-water-mark at all? why not just keep going until get_catch_up_room_event_ids returns an empty list?
There was a problem hiding this comment.
(doing so without care will introduce races though...)
There was a problem hiding this comment.
I suppose we could do that if you are keen — as you say though, needs more thought.
The advantage of this approach is that it's easy to keep the logic in my head. Was keen not to give a chance for any nasty bugs to crawl in, because it'd be nice to have confidence in this.
|
|
||
| # zip them together with their stream orderings | ||
| catch_up_pdus = [ | ||
| (event, event.internal_metadata.stream_ordering) for event in events |
There was a problem hiding this comment.
as noted elsewhere: I think the order is redundant: it might be easier to get rid of it (in a separate PR).
There was a problem hiding this comment.
I'll put it on my list, then :)
| if not catch_up_pdus: | ||
| break |
There was a problem hiding this comment.
when does this happen, and why is breaking the loop the correct behaviour?
There was a problem hiding this comment.
I suppose this should be log an ERROR, since this shouldn't happen.
There was a problem hiding this comment.
break isn't the correct behaviour really, since it disables catch-up when we supposedly should have some to do.
But only break will let us make progress and go to the main loop again.
So I will log an error either way, but is it best to break or return?
There was a problem hiding this comment.
Is this being unnecessarily paranoid?
There will be a foreign key constraint to link our rows to events, so this should darn well be impossible.
Better to assert events or if not events: raise AssertionError(...)?
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
(hopefully with a worm as the return value)
|
@richvdh Thanks for the juicy review — some questions above on how to proceed |
|
Gah, broken. |
| table: str, | ||
| keyvalues: Dict[str, Any], | ||
| retcol: Iterable[str], | ||
| retcol: str, |
There was a problem hiding this comment.
please can you take these fixes to a separate PR?
| -- of the latest event to be enqueued for transmission to that destination. | ||
| CREATE TABLE IF NOT EXISTS destination_rooms ( | ||
| -- the destination in question | ||
| destination TEXT NOT NULL, |
There was a problem hiding this comment.
probably. It would be nice to avoid doing so on every PDU (for example by doing it when we first create a PerDestinationQueue for the destination), but that might be fiddly, and is something we can do later.
| stream_ordering INTEGER NOT NULL, | ||
| PRIMARY KEY (destination, room_id), | ||
| FOREIGN KEY (room_id) REFERENCES rooms (room_id) | ||
| ON DELETE CASCADE, |
There was a problem hiding this comment.
ON DELETE CASCADE sounds scary. Why do we do that?
|
this is being superceded by smaller PRs. |
There are a few points marked
XXX REVIEWwhich I'd like help with.This fixes #2528 by:
/get_missing_eventsto get other events in roomswake_destinationis about)Database notes:
destination_roomsstores the reference to the latest PDU of each room that has been enqueued to be sent to a destination.destinationshas sproutedlast_successful_stream_orderingwhich is thestream_orderingof the most recent successfully-sent PDU to the destinationLimitations/Flaws: