Avoid checking the event cache when backfilling events#14164
Conversation
reivilibre
left a comment
There was a problem hiding this comment.
Seems reasonable; agreed that getting them in one call to the DB should be more efficient.
| if event_id in existing_events_map: | ||
| existing_event = existing_events_map[event_id] | ||
|
|
||
| # ...and the event itself was not previously stored as an outlier... | ||
| if not existing_event.event.internal_metadata.is_outlier(): | ||
| # ...then there's no need to persist it. We have it already. | ||
| logger.info( | ||
| "_process_pulled_event: Ignoring received event %s which we " | ||
| "have already seen", | ||
| event.event_id, | ||
| ) | ||
|
|
||
| # While we have seen this event before, it was stored as an outlier. | ||
| # We'll now persist it as a non-outlier. | ||
| logger.info("De-outliering event %s", event_id) |
There was a problem hiding this comment.
This control flow is only doing logging! I think you wanted a continue in the inner if block, right? :-)
There was a problem hiding this comment.
(Maybe this indicates that a test is needed...)
There was a problem hiding this comment.
Serves me right for not manually testing again before putting this up!
I'll write up a unit test for this function, but otherwise I find this behaviour a little hard to test...
There was a problem hiding this comment.
A complement test seems like it would fit #14161 (comment) well
If you're looking for a test that works with _process_pulled_events, I wrote one in #13864, see (still a bit messy in scratch state)
synapse/tests/handlers/test_federation_event.py
Lines 872 to 1320 in 3f8fef2
There was a problem hiding this comment.
The problem with a Complement test is that we'd need to call Synapse's Delete Room Admin API, which would only be specific to Synapse. I could write this test, but I wouldn't be able to publish it anywhere permanent, and the steps in it would be no different than manually testing.
Manually testing again, with the continue backfill doesn't drop any events. Thanks for catching that :)
There was a problem hiding this comment.
I cracked and settled for writing a unit test for backfill, while including a bit that specifically checks that backfill won't rely on the event cache to determine whether to persist an event. The result is a test for the backfill + the specific behaviour that PR removes.
synapse/tests/handlers/test_federation.py
Lines 332 to 426 in 42f1410
reivilibre
left a comment
There was a problem hiding this comment.
OK with the change suggested
| # Check if we already any of these have these events. | ||
| # Note: we currently make a lookup in the database directly here rather than | ||
| # checking the event cache, due to: | ||
| # https://github.com/matrix-org/synapse/issues/13476 |
There was a problem hiding this comment.
We could add an additional comment about potentially adding a cached check back when "something like #13916 comes along and correctly invalidates the event cache."
There was a problem hiding this comment.
I went ahead and added a comment to the referenced bug instead, which should serve the same purpose and tie things together a bit better on GitHub.
And explicitly ensure that it ignores get_event_cache for now.
|
I successfully tested the following on my homeserver on develop (2c63cdc) + #14161 + #14164 (this PR):
Worked as expected 🎉 |
Most of the context is in #14161.
This PR augments another optimisation. In #14161, when we purged and rejoined a room, we found that while the state was now persisted correctly, non-state events were not. These events are not initially received when you rejoin a room, but must be backfilled from the remote homeserver after you join the room.
Backfilling involves another area where we refuse to persist events if they already exist in the event cache:
synapse/synapse/handlers/federation_event.py
Lines 849 to 859 in 7b7478e
It has one more conditional over the
_have_seen_events_dictcase though. If we found an event in the cache, and it was an outlier, we'll still persist it. Of course, most events aren't outliers, including messages. Hence why we see messages get lost when rejoining a room after purging it in #14161 (comment),This PR changes this optimisation such that it skips the event cache entirely, and instead just checks the database directly. Again, as with #14161, there's a potential performance degradation here, but it prevents breakage until something like #13916 comes along and correctly invalidates the event cache.
The main change here is skipping the event cache during backfill when checking to see if we already have an event before persisting it.
Additionally, there's a potential speedup in this PR as well. Before, we were checking one event ID at a time against the cache (or DB in case of a cache miss) in
_process_pulled_event.This PR moves this checking code outside of
_process_pulled_eventand excludes events we've already seen before sorting them and passing them one-by-one to_process_pulled_event.In the case of lots of cache misses, this code should theoretically be faster...