Add indexes for quarantined media lookups#19312
Conversation
| -- we should start at ordering 9305, but there's higher conflict if we steal 9306's ordering, so | ||
| -- we'll steal 9304's ordering instead. 9304 is where the applicable columns were added. |
There was a problem hiding this comment.
Where is the 9306 conflict?
Perhaps we should just separate the insert into background_updates so they each have their own 93/05_xxx.sql/93/06_xxx.sql files.
|
|
||
| -- Note: We *probably* should have an index on quarantined_ts, but we're going | ||
| -- to try to defer that to a future migration after seeing the performance impact. | ||
| -- Note: the index is added in 05_add_quarantined_by_index.sql |
There was a problem hiding this comment.
| -- Note: the index is added in 05_add_quarantined_by_index.sql | |
| -- Note: the index is added in 93/05_add_quarantined_by_index.sql |
| columns=[ | ||
| # We include columns in both the WHERE and ORDER BY clauses to make | ||
| # the resulting query a bit more efficient. | ||
| "quarantined_by", |
There was a problem hiding this comment.
I think the where_clause is sufficient for quarantined_by
| # We include columns in both the WHERE and ORDER BY clauses to make | ||
| # the resulting query a bit more efficient. |
There was a problem hiding this comment.
For my own reference, any good sources? Is this just from looking at the query planner?
This could also use more details. Something like:
| # We include columns in both the WHERE and ORDER BY clauses to make | |
| # the resulting query a bit more efficient. | |
| # We include columns in both the WHERE and ORDER BY clauses to make the | |
| # resulting query a bit more efficient (can allow the database to use a | |
| # single index that covers both the filtering and the sorting). |
| # We include columns in both the WHERE and ORDER BY clauses to make | ||
| # the resulting query a bit more efficient. |
There was a problem hiding this comment.
Should this detail be moved next to the query itself?
| # the resulting query a bit more efficient. | ||
| "quarantined_by", | ||
| "quarantined_ts", | ||
| "media_id", |
There was a problem hiding this comment.
Unclear if media_id is really beneficial here. I assume quarantined_ts gets us most of the way there and Postgres can easily figure out the tie-break without the index on this column.
| if local: | ||
| sql = "SELECT '' as media_origin, media_id FROM local_media_repository WHERE quarantined_by IS NOT NULL ORDER BY quarantined_ts, media_id ASC LIMIT ? OFFSET ?" | ||
| else: | ||
| sql = "SELECT media_origin, media_id FROM remote_media_cache WHERE quarantined_by IS NOT NULL ORDER BY quarantined_ts, media_origin, media_id ASC LIMIT ? OFFSET ?" |
There was a problem hiding this comment.
Why do we sort by media_origin? Is media_id not sufficient enough as a tie-break?
|
closing in light of #19351 |
Fixes #19352 (See issue for history of this feature and previous PRs) > First, a [naive implementation](#19268) of the endpoint was introduced, but it quickly ran into [performance issues on query](#19312) and [long startup times](#19349), leading to its [removal](#19351). It also didn't actually work, and would fail to expose media when it was "unquarantined", so a [partial fix](#19308) was attempted, where the suggested direction is to use a [stream](https://element-hq.github.io/synapse/latest/development/synapse_architecture/streams.html#cheatsheet-for-creating-a-new-stream) instead of a timestamp column. This PR re-introduces the API building on the previous feedback: * Adds a stream which tracks when media becomes (un)quarantined. * Runs a background update to capture already-quarantined media. * Adds a new admin API to return rows from the stream table. We track both quarantine and unquarantine actions in the stream to allow downstream consumers to process the records appropriately. Namely, to allow our Synapse exchange in HMA to remove hashes for unquarantined media (use case further explained in the [issue](#19352)). **Note**: This knowingly does not capture all cases of media being quarantined. Other call sites are lower priority for T&S, and can be addressed in a future PR. ~~An issue will be created after this PR is merged to track those sites.~~ #19672 ### Pull Request Checklist <!-- Please read https://element-hq.github.io/synapse/latest/development/contributing_guide.html before submitting your pull request --> * [x] Pull request is based on the develop branch * [x] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [x] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters)) --------- Co-authored-by: turt2live <1190097+turt2live@users.noreply.github.com> Co-authored-by: Eric Eastwood <madlittlemods@gmail.com> Co-authored-by: Eric Eastwood <erice@element.io>
This is intended to overlap with #19308
Lack of index was introduced in #19268
We require an index on these lookups because early
EXPLAINresults on matrix.org indicate that it'd likely destroy performance if we try to get the first page of results. The cost is far too high (>50k minimum).Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.