Move `RestartDelayedEventServlet` to workers by anoadragon453 · Pull Request #19207 · element-hq/synapse

anoadragon453 · 2025-11-20T17:53:51Z

To allow delayed event timeouts to be extended even if the main process is currently restarting (for instance, during a pod resize). Addresses the /restart portion of #19141.

Note that this does not move the older, general POST /org.matrix.msc4140/delayed_events/{delayID} endpoint to workers. That endpoint had the caller specify {restart,send,cancel} as a JSON body parameter. As such, clients will need to ensure they're using /restart to get the benefit of worker request handling.

Part of MSC4140

Pull Request Checklist

Pull request is based on the develop branch
Pull request includes a changelog file. The entry should:
- Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
- Use markdown where necessary, mostly for code blocks.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
Code style is correct (run the linters)

To allow delayed event timeouts to be extended even if the main process is currently restarting (for instance, during a pod resize).

anoadragon453 · 2025-11-21T10:39:26Z

            "^/_matrix/client/(api/v1|r0|v3|unstable)/keys/upload",
            "^/_matrix/client/(api/v1|r0|v3|unstable)/keys/device_signing/upload$",
            "^/_matrix/client/(api/v1|r0|v3|unstable)/keys/signatures/upload$",
+            "^/_matrix/client/unstable/org.matrix.msc4140/delayed_events/.*/restart$",


This routes the endpoint to workers when running the relevant Complement tests.

devonh · 2025-11-21T23:43:53Z

+        # Workers don't need to wait for delayed events queued in the DB to be
+        # sent as that will be handled by the main process.
+        #
+        # If a worker attempts to schedule a timer for the next delayed event
+        # while the main process is pulling delayed events from the DB, the two
+        # will not race, as the main process will mark the delayed events as
+        # processed in the DB immediately before sending them.
+        self._initialized_from_db = defer.succeed(None)


I think the changes here are okay, but I'm mostly taking you at your word that this comment is true.

Would you be able to provide a more thorough explanation / links to the relevant code which validates this?

Ah sorry. See the code just below where self._initialized_from_db is set. I failed to realise that someone reading this from top-to-bottom wouldn't yet know what self._initialized_from_db was.

Essentially _initialized_from_db is a deferred that completes once all pending delayed events in the DB have been scheduled. The main process will manage that. The restart handler will update the timeout in the DB, so is not dependent

It's probably simpler if await make_deferred_yieldable(self._initialized_from_db) below is only called on the main process. I've made that visual change in 6f6227c.

We're effectively letting the restart handler code on workers and _schedule_db_events on the main process run in parallel. I had a think about whether this would result in a race condition.

The restart handler calls restart_delayed_event, which updates a delayed event's timeout as long as it hasn't been marked as processed (is_processed) yet.

is_processed means "sending of this event has begun". Once a delayed event has been successfully sent, it is deleted from the delayed_events table.

I don't believe there's a point between the queries in _schedule_db_events where running UPDATE delayed_events SET send_ts = ? + delay WHERE delay_id = ? AND NOT is_processed would be detrimental.

The only thing that stands out to me is between self._store.unprocess_delayed_events and self._store.process_timeout_delayed_events, when all events marked as "processed" in the DB are set to unprocessed again (because processing may have failed partway during homeserver shutdown). But in that case, the event in question will just have its send_ts successfully extended and self._store.process_timeout_delayed_events won't send it yet, which sounds like what the client intended anyhow.

And then given that, if the restart code can't race with the initialisation code anyhow, we could just remove await make_deferred_yieldable(self._initialized_from_db)fromrestart` altogether, even when running on the main process.

cc @AndrewFerr who originally wrote this code.

The only thing that stands out to me is between self._store.unprocess_delayed_events and self._store.process_timeout_delayed_events, when all events marked as "processed" in the DB are set to unprocessed again (because processing may have failed partway during homeserver shutdown). But in that case, the event in question will just have its send_ts successfully extended and self._store.process_timeout_delayed_events won't send it yet, which sounds like what the client intended anyhow.

There is a slight amount of "destructiveness" here. The startup also handles sending delayed events that should have been sent while the homeserver was offline, i.e. delayed events with an effective send time in the past. Calling restart on such a delayed event during this startup would prevent it from being sent as it ought to be.

Here's an idea: instead of the startup calling unprocess_delayed_events before process_timeout_delayed_events (so that the latter will retry "processed" sends that failed to persist), process_timeout_delayed_events could simply have a flag for whether it should skip over "processed" events or not (which the startup would set to "not skip").

The scenario in my last post is possible only because calling unprocess_delayed_events on startup makes "past" delayed events eligible targets of restart. This issue would go away if the startup were to retry past delayed events without having to alter the "processed" state of delayed events in the DB.

GET is the only method on `/delayed_events$`, so it's fine to put it into the above section. Thanks Ben BZ for the suggestion.

So that we test the GET `/delayed_events$` endpoint on workers as well.

Rework the code to be clearer. There should be no functional difference in this commit.

AndrewFerr · 2025-11-24T14:31:51Z

+        if self._is_master:
+            # Wait for the processing of existing delayed events from the DB to
+            # complete before accepting any modifications.
+            await make_deferred_yieldable(self._initialized_from_db)

        next_send_ts = await self._store.restart_delayed_event(


If you want, a worker handling restart may skip calling _schedule_next_at and instead only update the DB with the new send time of the restarted delayed event. The main process will notice the new send time on its next scheduled _send_on_timeout.

Currently, if a worker schedules calls of _send_on_timeout, those calls will often end up being scheduled at the same time as the main process' calls of it, i.e. the worker & main process will often both try to send the same delayed event. Only one of them will "win" at a time, so the work of actually sending a delayed event won't be repeated between them, but there isn't any explicit load balancing done to try to spread that work evenly. So maybe it's better to keep the main process as the sole manager of scheduled sends.

That certainly sounds easier to reason about. I walked through an example to convince myself that this would still result in events being sent on time:

User says they want to send an event at timestamp 10.

At timestamp 8, the user pushes the send time to 20 on a worker.

The main process still tries to send the event at 10. It's no longer there. Now schedules a timer for 20.

Main process sends the event at timestamp 20.

Since we're always pushing the event forwards, and calling _send_on_timeout without any events to send only results in some small overhead, I see no issue with your proposal. Updated in 66f026e.

The main process won't miss sending the event. When the original timeout is hit, the main process will simply pick up the new timeout and schedule a timer for that moment instead.

@AndrewFerr

per @AndrewFerr's suggestion in #19207 (comment)

devonh

Awesome, this looks a lot better now with easier to understand rationale for why we can rely on certain behaviour. :)

…ayed_events_restart_workers

The team has decided to deprecate and stop publishing python wheels for MacOS. Synapse docker images will continue to work on MacOS, as will building Synapse from source (though note this requires a Rust compiler). Admins using the unstable [MSC2666](matrix-org/matrix-spec-proposals#2666) endpoint (`/_matrix/client/unstable/uk.half-shot.msc2666/user/mutual_rooms`), please check [the relevant section in the upgrade notes](https://github.com/element-hq/synapse/blob/develop/docs/upgrade.md#upgrading-to-v11440) as this release contains changes that disable that endpoint by default. No significant changes since 1.144.0rc1. Admins using the unstable [MSC2666](matrix-org/matrix-spec-proposals#2666) endpoint (`/_matrix/client/unstable/uk.half-shot.msc2666/user/mutual_rooms`), please check [the relevant section in the upgrade notes](https://github.com/element-hq/synapse/blob/develop/docs/upgrade.md#upgrading-to-v11440) as this release contains changes that disable that endpoint by default. - Add experimentatal implememntation of [MSC4380](matrix-org/matrix-spec-proposals#4380) (invite blocking). ([\#19203](#19203)) - Allow restarting delayed event timeouts on workers. ([\#19207](#19207)) - Fix a bug in the database function for fetching state deltas that could result in unnecessarily long query times. ([\#18960](#18960)) - Fix v12 rooms when running with `use_frozen_dicts: True`. ([\#19235](#19235)) - Fix bug where invalid `canonical_alias` content would return 500 instead of 400. ([\#19240](#19240)) - Fix bug where `Duration` was logged incorrectly. ([\#19267](#19267)) - Document in the `--config-path` help how multiple files are merged - by merging them shallowly. ([\#19243](#19243)) - Stop building release wheels for MacOS. ([\#19225](#19225)) - Improve event filtering for Simplified Sliding Sync. ([\#17782](#17782)) - Export `SYNAPSE_SUPPORTED_COMPLEMENT_TEST_PACKAGES` environment variable from `scripts-dev/complement.sh`. ([\#19208](#19208)) - Refactor `scripts-dev/complement.sh` logic to avoid `exit` to facilitate being able to source it from other scripts (composable). ([\#19209](#19209)) - Expire sliding sync connections that are too old or have too much pending data. ([\#19211](#19211)) - Require an experimental feature flag to be enabled in order for the unstable [MSC2666](matrix-org/matrix-spec-proposals#2666) endpoint (`/_matrix/client/unstable/uk.half-shot.msc2666/user/mutual_rooms`) to be available. ([\#19219](#19219)) - Prevent changelog check CI running on @dependabot's PRs even when a human has modified the branch. ([\#19220](#19220)) - Auto-fix trailing spaces in multi-line strings and comments when running the lint script. ([\#19221](#19221)) - Move towards using a dedicated `Duration` type. ([\#19223](#19223), [\#19229](#19229)) - Improve robustness of the SQL schema linting in CI. ([\#19224](#19224)) - Add log to determine whether clients are using `/messages` as expected. ([\#19226](#19226)) - Simplify README and add ESS Getting started section. ([\#19228](#19228), [\#19259](#19259)) - Add a unit test for ensuring associated refresh tokens are erased when a device is deleted. ([\#19230](#19230)) - Prompt user to consider adding future deprecations to the changelog in release script. ([\#19239](#19239)) - Fix check of the Rust compiled code being outdated when using source checkout and `.egg-info`. ([\#19251](#19251)) - Stop building macos wheels in CI pipeline. ([\#19263](#19263)) * Bump Swatinem/rust-cache from 2.8.1 to 2.8.2. ([\#19244](#19244)) * Bump actions/checkout from 5.0.0 to 6.0.0. ([\#19213](#19213)) * Bump actions/setup-go from 6.0.0 to 6.1.0. ([\#19214](#19214)) * Bump actions/setup-python from 6.0.0 to 6.1.0. ([\#19245](#19245)) * Bump attrs from 25.3.0 to 25.4.0. ([\#19215](#19215)) * Bump docker/metadata-action from 5.9.0 to 5.10.0. ([\#19246](#19246)) * Bump http from 1.3.1 to 1.4.0. ([\#19249](#19249)) * Bump pydantic from 2.12.4 to 2.12.5. ([\#19250](#19250)) * Bump pyopenssl from 25.1.0 to 25.3.0. ([\#19248](#19248)) * Bump rpds-py from 0.28.0 to 0.29.0. ([\#19216](#19216)) * Bump rpds-py from 0.29.0 to 0.30.0. ([\#19247](#19247)) * Bump sentry-sdk from 2.44.0 to 2.46.0. ([\#19218](#19218)) * Bump types-bleach from 6.2.0.20250809 to 6.3.0.20251115. ([\#19217](#19217)) * Bump types-jsonschema from 4.25.1.20250822 to 4.25.1.20251009. ([\#19252](#19252))

github-actions Bot deployed to PR Documentation Preview November 20, 2025 17:55 Active

anoadragon453 force-pushed the anoa/delayed_events_restart_workers branch from 9511351 to 5677215 Compare November 21, 2025 10:37

anoadragon453 marked this pull request as ready for review November 21, 2025 10:37

anoadragon453 requested a review from a team as a code owner November 21, 2025 10:37

github-actions Bot deployed to PR Documentation Preview November 21, 2025 10:38 Active

anoadragon453 added 3 commits November 21, 2025 10:38

Move RestartDelayedEventServlet to workers

2b70a45

To allow delayed event timeouts to be extended even if the main process is currently restarting (for instance, during a pod resize).

newsfile

0940968

Add entry to upgrade notes

d3eaadf

anoadragon453 force-pushed the anoa/delayed_events_restart_workers branch from 5677215 to d3eaadf Compare November 21, 2025 10:38

anoadragon453 commented Nov 21, 2025

View reviewed changes

github-actions Bot deployed to PR Documentation Preview November 21, 2025 10:39 Active

anoadragon453 mentioned this pull request Nov 21, 2025

Consider supporting all endpoints for delayed events (MSC4140) across workers #19141

Open

devonh reviewed Nov 21, 2025

View reviewed changes

MadLittleMods added the A-Workers label Nov 22, 2025

benbz reviewed Nov 24, 2025

View reviewed changes

Comment thread docs/workers.md Outdated

benbz mentioned this pull request Nov 24, 2025

Be clearer about routing the delayed_events GET request element-hq/ess-helm#889

Merged

anoadragon453 added 2 commits November 24, 2025 11:49

Move delayed_events GET to "MSC4140" section

baf3cee

GET is the only method on `/delayed_events$`, so it's fine to put it into the above section. Thanks Ben BZ for the suggestion.

Modify Complement config

8975839

So that we test the GET `/delayed_events$` endpoint on workers as well.

github-actions Bot deployed to PR Documentation Preview November 24, 2025 11:51 Active

Rework await-ing initialisation logic to be clearer

6f6227c

Rework the code to be clearer. There should be no functional difference in this commit.

github-actions Bot deployed to PR Documentation Preview November 24, 2025 13:43 Active

AndrewFerr reviewed Nov 24, 2025

View reviewed changes

anoadragon453 added 2 commits November 24, 2025 17:38

Only schedule delayed events on the main process

66f026e

The main process won't miss sending the event. When the original timeout is hit, the main process will simply pick up the new timeout and schedule a timer for that moment instead.

Add reprocess_events to process_timeout_delayed_events

853938d

per @AndrewFerr's suggestion in #19207 (comment)

github-actions Bot deployed to PR Documentation Preview November 24, 2025 17:48 Active

Drop _initialized_from_db await in restart

8ac0725

github-actions Bot deployed to PR Documentation Preview November 24, 2025 17:52 Active

AndrewFerr reviewed Nov 24, 2025

View reviewed changes

Comment thread synapse/handlers/delayed_events.py

re-add note regarding double-sending events

8706876

anoadragon453 requested a review from devonh November 24, 2025 18:06

anoadragon453 requested a review from AndrewFerr November 24, 2025 18:06

github-actions Bot deployed to PR Documentation Preview November 24, 2025 18:07 Active

Fix syntax typo in SQL logic

32a5d2f

github-actions Bot deployed to PR Documentation Preview November 24, 2025 18:28 Active

AndrewFerr approved these changes Nov 24, 2025

View reviewed changes

devonh approved these changes Nov 27, 2025

View reviewed changes

Merge branch 'develop' of github.com:element-hq/synapse into anoa/del…

955e4a2

…ayed_events_restart_workers

anoadragon453 enabled auto-merge (squash) November 27, 2025 16:14

github-actions Bot deployed to PR Documentation Preview November 27, 2025 16:15 Active

anoadragon453 merged commit 566670c into develop Nov 27, 2025
47 checks passed

anoadragon453 deleted the anoa/delayed_events_restart_workers branch November 27, 2025 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `RestartDelayedEventServlet` to workers#19207

Move `RestartDelayedEventServlet` to workers#19207
anoadragon453 merged 12 commits into
developfrom
anoa/delayed_events_restart_workers

anoadragon453 commented Nov 20, 2025 •

edited by MadLittleMods

Loading

Uh oh!

anoadragon453 Nov 21, 2025

Uh oh!

devonh Nov 21, 2025

Uh oh!

anoadragon453 Nov 24, 2025

Uh oh!

AndrewFerr Nov 24, 2025

Uh oh!

AndrewFerr Nov 24, 2025

Uh oh!

Uh oh!

AndrewFerr Nov 24, 2025

Uh oh!

anoadragon453 Nov 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

devonh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

anoadragon453 commented Nov 20, 2025 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Checklist

Uh oh!

anoadragon453 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

devonh Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

anoadragon453 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

AndrewFerr Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

AndrewFerr Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AndrewFerr Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

anoadragon453 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

devonh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

anoadragon453 commented Nov 20, 2025 •

edited by MadLittleMods

Loading

anoadragon453 Nov 24, 2025 •

edited

Loading