Skip to content

Add indexes for quarantined media lookups#19312

Closed
turt2live wants to merge 2 commits into
developfrom
travis/create-quarantined-index
Closed

Add indexes for quarantined media lookups#19312
turt2live wants to merge 2 commits into
developfrom
travis/create-quarantined-index

Conversation

@turt2live
Copy link
Copy Markdown
Member

@turt2live turt2live commented Dec 16, 2025

This is intended to overlap with #19308

Lack of index was introduced in #19268

We require an index on these lookups because early EXPLAIN results on matrix.org indicate that it'd likely destroy performance if we try to get the first page of results. The cost is far too high (>50k minimum).

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

Comment on lines +15 to +16
-- we should start at ordering 9305, but there's higher conflict if we steal 9306's ordering, so
-- we'll steal 9304's ordering instead. 9304 is where the applicable columns were added.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the 9306 conflict?

Perhaps we should just separate the insert into background_updates so they each have their own 93/05_xxx.sql/93/06_xxx.sql files.


-- Note: We *probably* should have an index on quarantined_ts, but we're going
-- to try to defer that to a future migration after seeing the performance impact.
-- Note: the index is added in 05_add_quarantined_by_index.sql
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- Note: the index is added in 05_add_quarantined_by_index.sql
-- Note: the index is added in 93/05_add_quarantined_by_index.sql

columns=[
# We include columns in both the WHERE and ORDER BY clauses to make
# the resulting query a bit more efficient.
"quarantined_by",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the where_clause is sufficient for quarantined_by

Comment on lines +186 to +187
# We include columns in both the WHERE and ORDER BY clauses to make
# the resulting query a bit more efficient.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own reference, any good sources? Is this just from looking at the query planner?

This could also use more details. Something like:

Suggested change
# We include columns in both the WHERE and ORDER BY clauses to make
# the resulting query a bit more efficient.
# We include columns in both the WHERE and ORDER BY clauses to make the
# resulting query a bit more efficient (can allow the database to use a
# single index that covers both the filtering and the sorting).

Comment on lines +186 to +187
# We include columns in both the WHERE and ORDER BY clauses to make
# the resulting query a bit more efficient.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this detail be moved next to the query itself?

# the resulting query a bit more efficient.
"quarantined_by",
"quarantined_ts",
"media_id",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear if media_id is really beneficial here. I assume quarantined_ts gets us most of the way there and Postgres can easily figure out the tie-break without the index on this column.

if local:
sql = "SELECT '' as media_origin, media_id FROM local_media_repository WHERE quarantined_by IS NOT NULL ORDER BY quarantined_ts, media_id ASC LIMIT ? OFFSET ?"
else:
sql = "SELECT media_origin, media_id FROM remote_media_cache WHERE quarantined_by IS NOT NULL ORDER BY quarantined_ts, media_origin, media_id ASC LIMIT ? OFFSET ?"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we sort by media_origin? Is media_id not sufficient enough as a tie-break?

@turt2live
Copy link
Copy Markdown
Member Author

closing in light of #19351

@turt2live turt2live closed this Jan 6, 2026
turt2live added a commit that referenced this pull request Apr 9, 2026
Fixes #19352

(See issue for history of this feature and previous PRs)

> First, a [naive
implementation](#19268) of the
endpoint was introduced, but it quickly ran into [performance issues on
query](#19312) and [long
startup times](#19349),
leading to its
[removal](#19351). It also
didn't actually work, and would fail to expose media when it was
"unquarantined", so a [partial
fix](#19308) was attempted,
where the suggested direction is to use a
[stream](https://element-hq.github.io/synapse/latest/development/synapse_architecture/streams.html#cheatsheet-for-creating-a-new-stream)
instead of a timestamp column.

This PR re-introduces the API building on the previous feedback:
* Adds a stream which tracks when media becomes (un)quarantined.
* Runs a background update to capture already-quarantined media.
* Adds a new admin API to return rows from the stream table.

We track both quarantine and unquarantine actions in the stream to allow
downstream consumers to process the records appropriately. Namely, to
allow our Synapse exchange in HMA to remove hashes for unquarantined
media (use case further explained in the
[issue](#19352)).

**Note**: This knowingly does not capture all cases of media being
quarantined. Other call sites are lower priority for T&S, and can be
addressed in a future PR. ~~An issue will be created after this PR is
merged to track those sites.~~
#19672

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))

---------

Co-authored-by: turt2live <1190097+turt2live@users.noreply.github.com>
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
Co-authored-by: Eric Eastwood <erice@element.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants