Skip to content

Limit to-device EDU sizes#19617

Open
erikjohnston wants to merge 7 commits into
developfrom
erikj/limit_to_device
Open

Limit to-device EDU sizes#19617
erikjohnston wants to merge 7 commits into
developfrom
erikj/limit_to_device

Conversation

@erikjohnston
Copy link
Copy Markdown
Member

@erikjohnston erikjohnston commented Mar 27, 2026

This is based on #18416, which got reverted (#19614) due to it incorrectly rejecting to-device messages to users with many devices (and thus breaking message sending).

Fix #17035

A to-device message content looks like:

{
  "@user:domain": {"device1": {...}, "device2": {...}},
  ...
}

The previous PR would split up into multiple EDUs, each with a subset of the users. However, if one user's entry was too large it would not further split it up and then error out.

The main change in this PR is to allow splitting up a single user into multiple EDUs.

Other changes:

  1. Rename to SOFT_MAX_EDU_SIZE to indicate that we sometimes send EDUs with larger size than that, and its more a target than a hard limit.
  2. Check early if any to-device message (to a specific device) is too large to send, even if we're not going to send it over federation. This ensures that we catch issues where clients try to send too large to-device.

This still means that if a client send a large individual to-device message it will fail, but I don't believe we ever send such large to-device messages (normally they're in the range of a few KB).


I ended up changing the implementation a bunch to make it easy to reuse the code to split up dictionaries. Instead of repeatedly splitting up the EDU until each bit fits into the size, we instead record the size of each entry in the dict and instead split up based on cumulative size. This means we call encode_canonical_json on each entry rather than once on the entire struct, but its not significantly slower to do so.

--

cc @MatMaul @MadLittleMods

MatMaul and others added 3 commits March 27, 2026 12:01
If a set of messages exceeds this limit, the messages are split
across several EDUs.

Fix #17035 (should)

There is currently [no official specced limit for
EDUs](matrix-org/matrix-spec#807), but the
consensus seems to be that it would be useful to have one to avoid this
bug by bounding the transaction size.

As a side effect it also limits the size of a single to-device message
to a bit less than 65536.

This should probably be added to the spec similarly to the [message size
limit.](https://spec.matrix.org/v1.14/client-server-api/#size-limits)

Spec PR: matrix-org/matrix-spec#2340

---------

Co-authored-by: mcalinghee <mcalinghee.dev@gmail.com>
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
This is based on #18416, which got reverted due to it rejecting
to-device messages to users with many devices.

The main change here is that if a to-device EDU for a single user is too
large, then we split it up into multiple EDUs.
Comment thread synapse/handlers/devicemessage.py Outdated
Comment thread changelog.d/19617.bugfix
It's not that much more costly to just keep re-encoding the values.
Comment thread synapse/handlers/devicemessage.py Outdated
Comment thread synapse/handlers/devicemessage.py
@erikjohnston erikjohnston marked this pull request as ready for review April 1, 2026 10:55
@erikjohnston erikjohnston requested a review from a team as a code owner April 1, 2026 10:55
@anoadragon453 anoadragon453 requested a review from Copilot April 29, 2026 12:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Synapse’s handling of outbound to-device messages so that outgoing m.direct_to_device EDUs are kept to a target size by splitting payloads more granularly (including splitting a single recipient’s devices across multiple EDUs), while also rejecting individual to-device messages that are too large.

Changes:

  • Introduce a generic dict-splitting helper (split_dict_to_fit_to_size) and unit tests for its behavior.
  • Refactor to-device sending to split remote to-device payloads across multiple EDUs (including per-user device splitting) and queue them separately for federation.
  • Add constants for EDU size targets / transaction EDU budgeting and expand tests to cover splitting and size-limit failures.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/util/test_split_dict.py Adds unit tests for the new dict-splitting utility.
tests/rest/client/test_sendtodevice.py Adds tests for to-device size rejection and EDU/transaction splitting behavior.
tests/handlers/test_appservice.py Updates test setup to use the renamed datastore method for local to-device inbox seeding.
synapse/util/init.py Adds split_dict_to_fit_to_size and helper state/size-calculation logic.
synapse/storage/databases/main/deviceinbox.py Splits the old “add to-device inbox” method into separate local vs remote queueing APIs.
synapse/handlers/devicemessage.py Enforces per-message size checking and implements splitting remote to-device messages into multiple EDUs.
synapse/federation/sender/per_destination_queue.py Uses constants for EDU transaction limits and reserved EDU slots.
synapse/api/constants.py Introduces SOFT_MAX_EDU_SIZE and transaction EDU constants.
changelog.d/19617.bugfix Documents the bugfix around large to-device queues blocking outbound federation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

):
# The returned subset might be larger than the soft max size if it
# contains a single entry that is larger than the soft max size.
if estimated_size <= SOFT_MAX_EDU_SIZE - base_edu_size:
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

estimated_size returned by split_dict_to_fit_to_size(..., wrapping_object_size=base_edu_size) already includes base_edu_size, so comparing it to SOFT_MAX_EDU_SIZE - base_edu_size double-subtracts the header and will force unnecessary splitting (and can incorrectly treat fitting subsets as oversized). Compare estimated_size directly to SOFT_MAX_EDU_SIZE (or adjust what estimated_size represents) to make the fit/overflow decision consistent.

Suggested change
if estimated_size <= SOFT_MAX_EDU_SIZE - base_edu_size:
if estimated_size <= SOFT_MAX_EDU_SIZE:

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you would need a unit test which tries to send a message that's just under SOFT_MAX_EDU_SIZE in order to stress this edge case.

Comment thread synapse/util/__init__.py
Comment on lines +187 to +193
def split_dict_to_fit_to_size(
original_dict: dict[str, Any],
*,
soft_max_size: int,
wrapping_object_size: int = 2,
) -> Iterator[tuple[dict[str, JsonDict], int]]:
"""Splits a dict up into a list of dicts, each of which is small enough to
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of split_dict_to_fit_to_size is too specific: it currently declares Iterator[tuple[dict[str, JsonDict], int]], but this helper is used (and tested) with non-dict values (e.g. strings) and should preserve the original value type. Consider making it generic over the value type (TypeVar) and returning dict[str, V] (or dict[str, Any]) to avoid mypy errors and misleading annotations.

Copilot uses AI. Check for mistakes.
Comment thread synapse/util/__init__.py
Comment on lines +233 to +237
for key, payload in original_dict.items():
current_payload.subset[key] = payload
current_size = _len_with_wrapping_object(
current_payload.subset, wrapping_object_size
)
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split_dict_to_fit_to_size recomputes encode_canonical_json(current_payload.subset) on every iteration as the subset grows, making the algorithm O(n^2) in the number of entries. For large to-device payloads (many devices/users), this can become a noticeable CPU cost; consider precomputing per-entry sizes once and tracking cumulative size (including commas/quotes/keys) to split without repeatedly re-encoding the whole subset.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions calling encode_canonical_json on each entry. But Copilot is correct; we are technically calling encode_canonical_json multiple times on each entry in current_payload.subset as we build it up.

I presume you'd have to store the length for each entry in subset to avoid this.

Comment on lines +15 to +28
import logging
from typing import TypeVar

from canonicaljson import encode_canonical_json

from synapse.util import split_dict_to_fit_to_size

from tests.unittest import TestCase

logger = logging.getLogger(__name__)

T = TypeVar("T")


Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test module has unused symbols (logging, logger, and the TypeVar T). With the repo's ruff/pyflakes configuration, these will raise unused-import/unused-variable violations and fail CI; please remove them or use them.

Suggested change
import logging
from typing import TypeVar
from canonicaljson import encode_canonical_json
from synapse.util import split_dict_to_fit_to_size
from tests.unittest import TestCase
logger = logging.getLogger(__name__)
T = TypeVar("T")
from canonicaljson import encode_canonical_json
from synapse.util import split_dict_to_fit_to_size
from tests.unittest import TestCase

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yet all our linting passed 🤔

This function takes many to-device messages and fits/splits them into several EDUs
as necessary. We split the messages up as the overall request can overrun the
`max_request_body_size` and prevent outbound federation traffic because of the size
of the transaction (cf. `MAX_EDU_SIZE`).
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring references MAX_EDU_SIZE, but the constant in use is now SOFT_MAX_EDU_SIZE (and MAX_EDU_SIZE doesn't appear to exist anymore). Updating the reference will avoid confusion for future readers.

Suggested change
of the transaction (cf. `MAX_EDU_SIZE`).
of the transaction (cf. `SOFT_MAX_EDU_SIZE`).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. Just some minor things below.

# https://github.com/matrix-org/matrix-spec/pull/2340 tracks adding
# this to the spec
raise EventSizeError(
f"To-device message for {user_id}:{device_id} is too large to send",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the computed message size to the log line to make this slightly more useful for debugging?

Comment thread synapse/util/__init__.py
Comment on lines +207 to +208
original_dict: The dict to split. soft_max_size: The maximum size of
each dict when encoded as JSON.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
original_dict: The dict to split. soft_max_size: The maximum size of
each dict when encoded as JSON.
original_dict: The dict to split.
soft_max_size: The maximum size of each dict when encoded as JSON.

Comment thread synapse/util/__init__.py
Comment on lines +187 to +192
def split_dict_to_fit_to_size(
original_dict: dict[str, Any],
*,
soft_max_size: int,
wrapping_object_size: int = 2,
) -> Iterator[tuple[dict[str, JsonDict], int]]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note that this is a generator in the docstring.

Comment thread synapse/util/__init__.py
Comment on lines +233 to +237
for key, payload in original_dict.items():
current_payload.subset[key] = payload
current_size = _len_with_wrapping_object(
current_payload.subset, wrapping_object_size
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions calling encode_canonical_json on each entry. But Copilot is correct; we are technically calling encode_canonical_json multiple times on each entry in current_payload.subset as we build it up.

I presume you'd have to store the length for each entry in subset to avoid this.

):
# The returned subset might be larger than the soft max size if it
# contains a single entry that is larger than the soft max size.
if estimated_size <= SOFT_MAX_EDU_SIZE - base_edu_size:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you would need a unit test which tries to send a message that's just under SOFT_MAX_EDU_SIZE in order to stress this edge case.

Comment on lines +15 to +28
import logging
from typing import TypeVar

from canonicaljson import encode_canonical_json

from synapse.util import split_dict_to_fit_to_size

from tests.unittest import TestCase

logger = logging.getLogger(__name__)

T = TypeVar("T")


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yet all our linting passed 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A long queue of to-device messages can prevent outgoing federation working

5 participants