Fix "list quarantined media" API to have semi-stable ordering#19308
Fix "list quarantined media" API to have semi-stable ordering#19308turt2live wants to merge 5 commits into
Conversation
https://github.com/element-hq/synapse/pull/19268/changes#r2614217300 wasn't applied and the migration had copy/paste artifacts. PR: #19268
| `from` and `limit` are optional parameters, and default to the first page and `100` respectively. `from` is the `next_batch` | ||
| token returned by a previous request and `limit` is the number of rows to return. Note that `next_batch` is not intended | ||
| to survive longer than about a minute and may produce inconsistent results if used after that time. Neither `from` or | ||
| `limit` is a timestamp, though `from` does encode a timestamp. | ||
|
|
||
| If you require a long-lived `from` token, split `next_batch` on `-` and combine the first part with a `0`, separated by | ||
| a `-` again. For example: `1234-5678` becomes `1234-0`. Your application will need to deduplicate `media` rows it has | ||
| already seen if using this method. |
There was a problem hiding this comment.
I do accept that this is custom, weird, and not-quite how pagination works. My defence is I have a custom, weird, not-quite-paginating use case 😇
(though, if someone more qualified wants to make this use PaginationHandler instead, I'd happily close this PR in favour of that one)
| elif len(start) > 0: | ||
| start_index = int(start) |
There was a problem hiding this comment.
We probably don't need this backwards compatibility given the PR which introduced the endpoint has only been on develop for a few days as of writing. I've kept it anyway because I honestly just don't want to fight the unit tests too much. I can be convinced to enter battle, however.
There was a problem hiding this comment.
We should get rid of the tech debt
c7ec79d to
1128c59
Compare
1128c59 to
516b740
Compare
| If you require a long-lived `from` token, split `next_batch` on `-` and combine the first part with a `0`, separated by | ||
| a `-` again. For example: `1234-5678` becomes `1234-0`. Your application will need to deduplicate `media` rows it has | ||
| already seen if using this method. |
There was a problem hiding this comment.
huh, why are we suggesting this? Why is this a use case?
There was a problem hiding this comment.
The use case is part of the linked PRs/projects: https://github.com/matrix-org/hma-matrix/blob/4f0b9676beb7b5d72b2d55ae7034609593f5fba3/matrix_exchanges/synapse_quarantined.py requires a time-relative token to work from, so it creates one.
There was a problem hiding this comment.
The use case should be explained in the PR description (and also probably in #19268)
| token returned by a previous request and `limit` is the number of rows to return. Note that `next_batch` is not intended | ||
| to survive longer than about a minute and may produce inconsistent results if used after that time. Neither `from` or |
There was a problem hiding this comment.
Where is this limitation coming from?
There was a problem hiding this comment.
we don't want people to be storing them forever (or if they are, they're using the ts-0 trick) because the token will become invalid upon media being (un)quarantined. On some servers this may be often, but others it could be close to never. We strike a balance here and choose "about a minute" to indicate the estimated volatility.
Alternatively, we could try to express the detail of volatility, but that felt a bit too technical at the time.
| to survive longer than about a minute and may produce inconsistent results if used after that time. Neither `from` or | ||
| `limit` is a timestamp, though `from` does encode a timestamp. |
There was a problem hiding this comment.
Superfluous details. The end user can treat them as opaque.
| to survive longer than about a minute and may produce inconsistent results if used after that time. Neither `from` or | |
| `limit` is a timestamp, though `from` does encode a timestamp. | |
| to survive longer than about a minute and may produce inconsistent results if used after that time. |
There was a problem hiding this comment.
This sort of detail was requested in the prior PR: #19268 (comment)
There was a problem hiding this comment.
From the linked thread, it seems like the suggestion was also to treat them as opaque and then magically settled for some other state and I don't even see the change mentioned.
| elif len(start) > 0: | ||
| start_index = int(start) |
There was a problem hiding this comment.
We should get rid of the tech debt
| # Batch tokens are structured as `timestamp-index`, where `index` is relative | ||
| # to the timestamp. This is done to support pages having many records with | ||
| # the same timestamp (like existing servers having a ton of `ts=0` records). |
There was a problem hiding this comment.
This is different (partial) from the reason in the PR description
There was a problem hiding this comment.
The reason from the PR description explains that we're doing all of this because media can be unquarantined and we want to make sure to get a semi-stable order.
| # known) to ensure the ordering is stable for established servers. | ||
| if local: | ||
| sql = "SELECT '' as media_origin, media_id FROM local_media_repository WHERE quarantined_by IS NOT NULL ORDER BY quarantined_ts, media_id ASC LIMIT ? OFFSET ?" | ||
| sql = "SELECT '' as media_origin, media_id, quarantined_ts FROM local_media_repository WHERE quarantined_by IS NOT NULL AND quarantined_ts >= ? ORDER BY quarantined_ts, media_id ASC LIMIT ? OFFSET ?" |
There was a problem hiding this comment.
Instead of relying on quarantined_ts and index for pagination (which is still flawed), we could add a new column quarantined_stream_id that is filled in when something is quarantined.
Docs: docs/development/synapse_architecture/streams.md#cheatsheet-for-creating-a-new-stream
There was a problem hiding this comment.
A stream feels like way more overhead than we need for this. Is there something lighter weight we can use instead? (like just a simple ID generator?)
There was a problem hiding this comment.
I think a stream is the Synapse way to solve this. We only need to do the first few steps from the docs since this isn't something that is going to be part of /sync or the StreamToken.
There was a problem hiding this comment.
There was a problem hiding this comment.
The sync steps are the very last bit, at which point there's already a ton of what feels like excess infrastructure.
The sync stuff is not necessary (mentioned above). This isn't something that's going to be used in /sync.
|
This feature was removed in #19351 and doesn't look like it'll pass review, so closing. |
Fixes #19352 (See issue for history of this feature and previous PRs) > First, a [naive implementation](#19268) of the endpoint was introduced, but it quickly ran into [performance issues on query](#19312) and [long startup times](#19349), leading to its [removal](#19351). It also didn't actually work, and would fail to expose media when it was "unquarantined", so a [partial fix](#19308) was attempted, where the suggested direction is to use a [stream](https://element-hq.github.io/synapse/latest/development/synapse_architecture/streams.html#cheatsheet-for-creating-a-new-stream) instead of a timestamp column. This PR re-introduces the API building on the previous feedback: * Adds a stream which tracks when media becomes (un)quarantined. * Runs a background update to capture already-quarantined media. * Adds a new admin API to return rows from the stream table. We track both quarantine and unquarantine actions in the stream to allow downstream consumers to process the records appropriately. Namely, to allow our Synapse exchange in HMA to remove hashes for unquarantined media (use case further explained in the [issue](#19352)). **Note**: This knowingly does not capture all cases of media being quarantined. Other call sites are lower priority for T&S, and can be addressed in a future PR. ~~An issue will be created after this PR is merged to track those sites.~~ #19672 ### Pull Request Checklist <!-- Please read https://element-hq.github.io/synapse/latest/development/contributing_guide.html before submitting your pull request --> * [x] Pull request is based on the develop branch * [x] Pull request includes a [changelog file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog). The entry should: - Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from `EventStore` to `EventWorkerStore`.". - Use markdown where necessary, mostly for `code blocks`. - End with either a period (.) or an exclamation mark (!). - Start with a capital letter. - Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry. * [x] [Code style](https://element-hq.github.io/synapse/latest/code_style.html) is correct (run the [linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters)) --------- Co-authored-by: turt2live <1190097+turt2live@users.noreply.github.com> Co-authored-by: Eric Eastwood <madlittlemods@gmail.com> Co-authored-by: Eric Eastwood <erice@element.io>

Note: This cleans up previous code introduced by #19268
When first writing the API, "unquarantining" media was not properly considered. If media was unquarantined and an application was treating
fromtokens as long-lived when listing quarantined media, the endpoint could skip rows like so:from=21to01from=2, gets[D]To fix this, we invent a pagination token which uses time and a relative index. It's not super stable still because the relative index can still change, but it's likely stable enough for most usage (iterate as fast as possible to the end).
If an application requires a proper time-based stable token, it can generate a timestamp then append
-0to it to set the relative position to the zeroth row. This may return rows the application has already seen, as described by the admin API docs. This particular method of generating the timestamp manually is not documented because it's not as stable as relying on the last seennext_batch's internal timestamp.This PR doesn't shift the whole endpoint to timestamp-only tokens because the prior PR populates rows with
0for a timestamp, which may span thousands (or millions) of rows, breaking the ability to uselimitproperly.Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.