Limit to-device EDU sizes#19617
Conversation
If a set of messages exceeds this limit, the messages are split across several EDUs. Fix #17035 (should) There is currently [no official specced limit for EDUs](matrix-org/matrix-spec#807), but the consensus seems to be that it would be useful to have one to avoid this bug by bounding the transaction size. As a side effect it also limits the size of a single to-device message to a bit less than 65536. This should probably be added to the spec similarly to the [message size limit.](https://spec.matrix.org/v1.14/client-server-api/#size-limits) Spec PR: matrix-org/matrix-spec#2340 --------- Co-authored-by: mcalinghee <mcalinghee.dev@gmail.com> Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
This is based on #18416, which got reverted due to it rejecting to-device messages to users with many devices. The main change here is that if a to-device EDU for a single user is too large, then we split it up into multiple EDUs.
It's not that much more costly to just keep re-encoding the values.
There was a problem hiding this comment.
Pull request overview
This PR updates Synapse’s handling of outbound to-device messages so that outgoing m.direct_to_device EDUs are kept to a target size by splitting payloads more granularly (including splitting a single recipient’s devices across multiple EDUs), while also rejecting individual to-device messages that are too large.
Changes:
- Introduce a generic dict-splitting helper (
split_dict_to_fit_to_size) and unit tests for its behavior. - Refactor to-device sending to split remote to-device payloads across multiple EDUs (including per-user device splitting) and queue them separately for federation.
- Add constants for EDU size targets / transaction EDU budgeting and expand tests to cover splitting and size-limit failures.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/util/test_split_dict.py | Adds unit tests for the new dict-splitting utility. |
| tests/rest/client/test_sendtodevice.py | Adds tests for to-device size rejection and EDU/transaction splitting behavior. |
| tests/handlers/test_appservice.py | Updates test setup to use the renamed datastore method for local to-device inbox seeding. |
| synapse/util/init.py | Adds split_dict_to_fit_to_size and helper state/size-calculation logic. |
| synapse/storage/databases/main/deviceinbox.py | Splits the old “add to-device inbox” method into separate local vs remote queueing APIs. |
| synapse/handlers/devicemessage.py | Enforces per-message size checking and implements splitting remote to-device messages into multiple EDUs. |
| synapse/federation/sender/per_destination_queue.py | Uses constants for EDU transaction limits and reserved EDU slots. |
| synapse/api/constants.py | Introduces SOFT_MAX_EDU_SIZE and transaction EDU constants. |
| changelog.d/19617.bugfix | Documents the bugfix around large to-device queues blocking outbound federation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ): | ||
| # The returned subset might be larger than the soft max size if it | ||
| # contains a single entry that is larger than the soft max size. | ||
| if estimated_size <= SOFT_MAX_EDU_SIZE - base_edu_size: |
There was a problem hiding this comment.
estimated_size returned by split_dict_to_fit_to_size(..., wrapping_object_size=base_edu_size) already includes base_edu_size, so comparing it to SOFT_MAX_EDU_SIZE - base_edu_size double-subtracts the header and will force unnecessary splitting (and can incorrectly treat fitting subsets as oversized). Compare estimated_size directly to SOFT_MAX_EDU_SIZE (or adjust what estimated_size represents) to make the fit/overflow decision consistent.
| if estimated_size <= SOFT_MAX_EDU_SIZE - base_edu_size: | |
| if estimated_size <= SOFT_MAX_EDU_SIZE: |
There was a problem hiding this comment.
I suspect you would need a unit test which tries to send a message that's just under SOFT_MAX_EDU_SIZE in order to stress this edge case.
| def split_dict_to_fit_to_size( | ||
| original_dict: dict[str, Any], | ||
| *, | ||
| soft_max_size: int, | ||
| wrapping_object_size: int = 2, | ||
| ) -> Iterator[tuple[dict[str, JsonDict], int]]: | ||
| """Splits a dict up into a list of dicts, each of which is small enough to |
There was a problem hiding this comment.
The return type of split_dict_to_fit_to_size is too specific: it currently declares Iterator[tuple[dict[str, JsonDict], int]], but this helper is used (and tested) with non-dict values (e.g. strings) and should preserve the original value type. Consider making it generic over the value type (TypeVar) and returning dict[str, V] (or dict[str, Any]) to avoid mypy errors and misleading annotations.
| for key, payload in original_dict.items(): | ||
| current_payload.subset[key] = payload | ||
| current_size = _len_with_wrapping_object( | ||
| current_payload.subset, wrapping_object_size | ||
| ) |
There was a problem hiding this comment.
split_dict_to_fit_to_size recomputes encode_canonical_json(current_payload.subset) on every iteration as the subset grows, making the algorithm O(n^2) in the number of entries. For large to-device payloads (many devices/users), this can become a noticeable CPU cost; consider precomputing per-entry sizes once and tracking cumulative size (including commas/quotes/keys) to split without repeatedly re-encoding the whole subset.
There was a problem hiding this comment.
The PR description mentions calling encode_canonical_json on each entry. But Copilot is correct; we are technically calling encode_canonical_json multiple times on each entry in current_payload.subset as we build it up.
I presume you'd have to store the length for each entry in subset to avoid this.
| import logging | ||
| from typing import TypeVar | ||
|
|
||
| from canonicaljson import encode_canonical_json | ||
|
|
||
| from synapse.util import split_dict_to_fit_to_size | ||
|
|
||
| from tests.unittest import TestCase | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| T = TypeVar("T") | ||
|
|
||
|
|
There was a problem hiding this comment.
This new test module has unused symbols (logging, logger, and the TypeVar T). With the repo's ruff/pyflakes configuration, these will raise unused-import/unused-variable violations and fail CI; please remove them or use them.
| import logging | |
| from typing import TypeVar | |
| from canonicaljson import encode_canonical_json | |
| from synapse.util import split_dict_to_fit_to_size | |
| from tests.unittest import TestCase | |
| logger = logging.getLogger(__name__) | |
| T = TypeVar("T") | |
| from canonicaljson import encode_canonical_json | |
| from synapse.util import split_dict_to_fit_to_size | |
| from tests.unittest import TestCase |
There was a problem hiding this comment.
And yet all our linting passed 🤔
| This function takes many to-device messages and fits/splits them into several EDUs | ||
| as necessary. We split the messages up as the overall request can overrun the | ||
| `max_request_body_size` and prevent outbound federation traffic because of the size | ||
| of the transaction (cf. `MAX_EDU_SIZE`). |
There was a problem hiding this comment.
The docstring references MAX_EDU_SIZE, but the constant in use is now SOFT_MAX_EDU_SIZE (and MAX_EDU_SIZE doesn't appear to exist anymore). Updating the reference will avoid confusion for future readers.
| of the transaction (cf. `MAX_EDU_SIZE`). | |
| of the transaction (cf. `SOFT_MAX_EDU_SIZE`). |
anoadragon453
left a comment
There was a problem hiding this comment.
Overall looks great. Just some minor things below.
| # https://github.com/matrix-org/matrix-spec/pull/2340 tracks adding | ||
| # this to the spec | ||
| raise EventSizeError( | ||
| f"To-device message for {user_id}:{device_id} is too large to send", |
There was a problem hiding this comment.
Could we add the computed message size to the log line to make this slightly more useful for debugging?
| original_dict: The dict to split. soft_max_size: The maximum size of | ||
| each dict when encoded as JSON. |
There was a problem hiding this comment.
| original_dict: The dict to split. soft_max_size: The maximum size of | |
| each dict when encoded as JSON. | |
| original_dict: The dict to split. | |
| soft_max_size: The maximum size of each dict when encoded as JSON. |
| def split_dict_to_fit_to_size( | ||
| original_dict: dict[str, Any], | ||
| *, | ||
| soft_max_size: int, | ||
| wrapping_object_size: int = 2, | ||
| ) -> Iterator[tuple[dict[str, JsonDict], int]]: |
There was a problem hiding this comment.
We should note that this is a generator in the docstring.
| for key, payload in original_dict.items(): | ||
| current_payload.subset[key] = payload | ||
| current_size = _len_with_wrapping_object( | ||
| current_payload.subset, wrapping_object_size | ||
| ) |
There was a problem hiding this comment.
The PR description mentions calling encode_canonical_json on each entry. But Copilot is correct; we are technically calling encode_canonical_json multiple times on each entry in current_payload.subset as we build it up.
I presume you'd have to store the length for each entry in subset to avoid this.
| ): | ||
| # The returned subset might be larger than the soft max size if it | ||
| # contains a single entry that is larger than the soft max size. | ||
| if estimated_size <= SOFT_MAX_EDU_SIZE - base_edu_size: |
There was a problem hiding this comment.
I suspect you would need a unit test which tries to send a message that's just under SOFT_MAX_EDU_SIZE in order to stress this edge case.
| import logging | ||
| from typing import TypeVar | ||
|
|
||
| from canonicaljson import encode_canonical_json | ||
|
|
||
| from synapse.util import split_dict_to_fit_to_size | ||
|
|
||
| from tests.unittest import TestCase | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| T = TypeVar("T") | ||
|
|
||
|
|
There was a problem hiding this comment.
And yet all our linting passed 🤔
This is based on #18416, which got reverted (#19614) due to it incorrectly rejecting to-device messages to users with many devices (and thus breaking message sending).
Fix #17035
A to-device message content looks like:
{ "@user:domain": {"device1": {...}, "device2": {...}}, ... }The previous PR would split up into multiple EDUs, each with a subset of the users. However, if one user's entry was too large it would not further split it up and then error out.
The main change in this PR is to allow splitting up a single user into multiple EDUs.
Other changes:
SOFT_MAX_EDU_SIZEto indicate that we sometimes send EDUs with larger size than that, and its more a target than a hard limit.This still means that if a client send a large individual to-device message it will fail, but I don't believe we ever send such large to-device messages (normally they're in the range of a few KB).
I ended up changing the implementation a bunch to make it easy to reuse the code to split up dictionaries. Instead of repeatedly splitting up the EDU until each bit fits into the size, we instead record the size of each entry in the dict and instead split up based on cumulative size. This means we call
encode_canonical_jsonon each entry rather than once on the entire struct, but its not significantly slower to do so.--
cc @MatMaul @MadLittleMods