Skip to content

Conversation

@mkysel
Copy link
Collaborator

@mkysel mkysel commented Oct 27, 2025

Shard gateway envelopes and add transactional auto-partitioned inserts with SAVEPOINT retries across V2 schema and APIs

Introduce a V2 sharded gateway envelope schema with partitioned meta/blob tables and a joined view, replace legacy queries with V2 selectors, and add db.InsertGatewayEnvelopeWithChecksTransactional/db.InsertGatewayEnvelopeWithChecksStandalone to auto-create partitions and retry inserts using SAVEPOINTs; update services, workers, indexers, and tests to use V2 params, views, and a configurable publish retry sleep.

📍Where to Start

Start with the insert flow in db.InsertGatewayEnvelopeAndIncrementUnsettledUsage and the new helpers db.InsertGatewayEnvelopeWithChecksTransactional and db.InsertGatewayEnvelopeWithChecksStandalone in gateway_envelope.go, then review the V2 schema and queries in 00021_sharded_gateway_envelopes.up.sql and envelopes_v2.sql.

Changes since #1279 opened

  • Cleaned up test implementation by removing debug logging and ignoring unused return values [0091da3]
  • Migrated database schema from V2-suffixed to base names for gateway envelope storage [ce375d2]
  • Updated SQLC query definitions and generated code to use non-V2 database objects [ce375d2]
  • Replaced V2-suffixed query parameter and return types throughout the application code [ce375d2]
  • Updated test helpers and mock implementations to use non-V2 types and query methods [ce375d2]
  • Fixed field initialization in publishWorker struct creation [32cb6c9]
  • Updated SQL table references in test files from 'gateway_envelopes_meta_v2' to 'gateway_envelopes_meta' [93bfb0f]

📊 Macroscope summarized 93bfb0f. 14 files reviewed, 29 issues evaluated, 27 issues filtered, 0 comments posted

🗂️ Filtered Issues

pkg/api/message/publish_worker.go — 0 comments posted, 4 evaluated, 4 filtered
  • line 114: Context cancellation cannot stop the worker once it begins processing a batch because the inner retry loop in publishWorker.start does not observe the context. When p.ctx is canceled during publishing, publishStagedEnvelope returns false (it checks p.ctx.Err() and returns false), causing the outer loop to keep retrying indefinitely. The outer select on p.ctx.Done() is not reached until the inner loop exits, which never happens after cancellation. This results in a stuck worker that cannot terminate gracefully on shutdown. [ Out of scope ]
  • line 117: Potential tight CPU loop on publish retry when sleepOnFailureTime is zero. The change replaces a fixed time.Sleep(time.Second) with time.Sleep(p.sleepOnFailureTime). If sleepOnFailureTime is zero (or very small), the inner retry loop in publishWorker.start will spin with minimal or no delay upon repeated failures, causing excessive CPU usage and preventing backoff under error conditions. This is reachable because sleepOnFailureTime is supplied by callers and is not validated. [ Low confidence ]
  • line 141: Permanent validation failures in publishStagedEnvelope (e.g., topic parsing error, malformed payer envelope, signature recovery failure, fee calculation errors) cause the function to return false and the caller to retry indefinitely without changing any state. This leads to an infinite retry loop that permanently blocks processing of subsequent envelopes in the batch. Examples: [ Out of scope ]
  • line 217: Inconsistent handling of context cancellation in publishStagedEnvelope can cause the worker to never exit on cancellation. After the insert step, the function checks p.ctx.Err() and returns false (lines 217–219), which the caller interprets as a failure and retries indefinitely. Later, if the context is cancelled before or during the delete step, the function returns true (lines 229–231), signaling success and allowing progress. [ Out of scope ]
pkg/api/message/service.go — 0 comments posted, 4 evaluated, 4 filtered
  • line 95: NewReplicationAPIService starts the publish worker goroutine before attempting to start the subscribe worker, but if startSubscribeWorker fails, the function returns an error without stopping/cleaning up the already-started publish worker. This leaks the goroutine and any associated subscription resources, leaving background work running with no owner and potentially causing further side effects. To fix, ensure that on any subsequent failure after starting the publish worker, you stop/cancel the publish worker (and any resources it acquired) before returning. [ Previously rejected ]
  • line 362: Supplying both topics and originator_node_ids in message_api.EnvelopesQuery now silently ignores originator_node_ids whenever topics is non-empty. The new logic in Service.fetchEnvelopes prioritizes the topics branch (if len(query.GetTopics()) != 0 { ... return rows, nil }) and returns early, never applying the originator filter. Previously, a single combined SelectGatewayEnvelopes call accepted both filters. This is a contract change: callers that expect both filters to apply will receive envelopes filtered only by topics, which can lead to incorrect results. [ Already posted ]
  • line 372: Possible nil database handle: queries.New(s.store) is called with s.store across all branches of fetchEnvelopes. If s.store is nil at runtime, the resulting Queries will have a nil db and calling QueryContext inside the query methods will panic. There is no guard in fetchEnvelopes ensuring s.store is non-nil. [ Low confidence ]
  • line 387: Unsigned-to-signed cast for originator IDs in fetchEnvelopes: uint32 values from EnvelopesQuery.GetOriginatorNodeIds() are converted to int32 and stored in params.OriginatorNodeIds. If any originator node ID exceeds math.MaxInt32, this will wrap to a negative number and cause incorrect filtering in SelectGatewayEnvelopesByOriginators. [ Low confidence ]
pkg/db/gateway_envelope.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 66: Concurrent use of a single SQL transaction (sql.Tx) from multiple goroutines inside InsertGatewayEnvelopeAndIncrementUnsettledUsage is unsafe and can cause runtime errors or deadlocks. The function launches two goroutines that both call txQueries methods (IncrementUnsettledUsage and IncrementOriginatorCongestion) within the same transaction context. Per Go's database/sql contract, a sql.Tx is not safe for concurrent use across goroutines. This can lead to driver-level errors like "driver: bad connection", serialization failures, or blocked execution due to contention on the single pinned connection. [ Out of scope ]
pkg/db/types.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 29: Potential integer overflow/truncation when converting uint32 node IDs and uint64 sequence IDs from the cursor to signed types used in SQL params. In SetVectorClockByTopics, SetVectorClockByOriginators, and SetVectorClockUnfiltered, nodeID is cast from uint32 to int32 and sequenceID from uint64 to int64. Similarly, in fetchEnvelopes, originator IDs are cast from uint32 to int32. If any nodeID > math.MaxInt32 or sequenceID > math.MaxInt64, these casts will wrap to negative or truncated values, causing incorrect filtering or vector clock behavior in queries. [ Previously rejected ]
  • line 55: Unsigned-to-signed cast for sequence IDs in vector clock setters: uint64 sequenceID values from the cursor are cast to int64 in SetVectorClockByTopics (lines 29–31), SetVectorClockByOriginators (lines 42–44), and SetVectorClockUnfiltered (lines 55–56). If a sequence ID exceeds math.MaxInt64, it will be truncated to a negative int64, corrupting the vector clock used in queries. [ Low confidence ]
pkg/indexer/app_chain/contracts/group_message_storer.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 72: StoreLog only validates that the client envelope payload type matches the topic kind via clientEnvelope.TopicMatchesPayload(), but it never verifies that the client envelope’s target topic identifier (the bytes after the kind) matches the on-chain GroupId from the MessageSent event. The code constructs topicStruct from msgSent.GroupId[:] and later stores to that topic, regardless of what topic identifier the client envelope carries. This can lead to storing an envelope under a topic derived from the event even if the envelope’s own target topic identifier differs. To preserve integrity, also check that clientEnvelope.TargetTopic().Bytes() (or identifier) matches topicStruct.Bytes()/msgSent.GroupId before storing; otherwise, reject the log. [ Low confidence ]
pkg/indexer/app_chain/contracts/identity_update_storer.go — 0 comments posted, 4 evaluated, 4 filtered
  • line 111: Misclassification of transient database errors as non-recoverable in StoreLog: errors from querier.GetLatestSequenceId are wrapped with re.NewNonRecoverableError(ErrGetLatestSequenceID, err) (lines 106–112). If the error is a transient database issue, returning a non-recoverable error will prevent retry and may lead to dropped events. Consider classifying database operation errors as recoverable (or propagate raw errors to be wrapped as recoverable at the outer level), consistent with other DB operations in this function. [ Out of scope ]
  • line 144: Misclassification of validation errors as non-recoverable may erroneously mark transient DB errors as non-retryable: StoreLog wraps all errors from validateIdentityUpdate with re.NewNonRecoverableError(ErrValidateIdentityUpdate, err) (lines 136–145). validateIdentityUpdate performs a DB query (SelectGatewayEnvelopesByTopics) and may return errors due to transient database issues. Treating these as non-recoverable will prevent retry, potentially dropping events. Consider distinguishing between validation failures (non-recoverable) and underlying IO/DB errors (recoverable). [ Out of scope ]
  • line 149: Potential nil pointer dereference: the code accesses associationState.StateDiff.NewMembers and associationState.StateDiff.RemovedMembers without checking whether associationState or associationState.StateDiff are non-nil. Since mlsvalidate.AssociationStateResult.StateDiff is a pointer, it can be nil (e.g., if there are no changes or the validator returns a state without a diff). Accessing a field on a nil pointer will panic. Add a guard such as if associationState == nil || associationState.StateDiff == nil { ... } before dereferencing. [ Out of scope ]
  • line 290: Defensive validation gap: validateIdentityUpdate passes identityUpdate.IdentityUpdate to MLSValidationService.GetAssociationStateFromEnvelopes without checking for nil. While the outer type assertion ensures the payload is an identity-update wrapper, the inner IdentityUpdate pointer can still be nil in protobuf-generated types. Passing nil may cause downstream logic to panic or misbehave if the service implementation assumes a non-nil value. Add a check like if identityUpdate.IdentityUpdate == nil { return nil, fmt.Errorf("identity update is nil") }. [ Out of scope ]
pkg/migrator/writer.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 60: Silent integer overflow risk when casting env.OriginatorSequenceID() from uint64 to int64 for multiple DB parameters. OriginatorSequenceID is derived from a uint64 (originator sequence), but the code passes it to the database as int64 for InsertGatewayEnvelopeParams.OriginatorSequenceID (line 60) and again for IncrementUnsettledUsage.SequenceID (line 81) and UpdateMigrationProgress.LastMigratedID (line 89). If the sequence ID exceeds math.MaxInt64, the conversion will wrap to a negative number silently, resulting in incorrect keys/progress and potential data corruption or failed lookups. [ Low confidence ]
  • line 65: Silent integer overflow risk when casting expiry time from uint64 to int64. Expiry is built from env.UnsignedOriginatorEnvelope.Proto().GetExpiryUnixtime() (returns uint64) and cast to int64 (lines 65–67). If expiry exceeds math.MaxInt64, conversion will wrap negative, producing invalid expiry timestamps in the database. [ Previously rejected ]
pkg/mlsvalidate/service.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 110: Potential nil newUpdate passed to GetAssociationState, yielding a gRPC request with a nil element in NewUpdates. In GetAssociationStateFromEnvelopes, newUpdate is forwarded without a nil check (line 110). Callers like IdentityUpdateStorer.validateIdentityUpdate do not verify identityUpdate.IdentityUpdate != nil, so a nil can be passed under realistic conditions if the client envelope has the oneof wrapper set but inner IdentityUpdate is nil. This may cause marshalling errors or runtime panics in gRPC/protobuf when serializing a request containing a nil message. [ Low confidence ]
pkg/server/server.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 388: In startAPIServer, serviceRegistrationFunc creates and starts a CursorUpdater via metadata.NewCursorUpdater before constructing replicationService. If NewReplicationAPIService returns an error, the function returns that error without stopping the CursorUpdater. This leaks the cursor updater goroutine and its resources. Any subsequent failure in the registration function should clean up already-started background components (e.g., call CursorUpdater.Stop() or cancel its context) before returning. [ Previously rejected ]
pkg/sync/envelope_sink.go — 0 comments posted, 5 evaluated, 4 filtered
  • line 128: originatorID := int32(env.OriginatorNodeID()) may overflow if the originator node ID exceeds math.MaxInt32. In that case, the resulting originatorID becomes negative, which will be propagated to unsettled usage/congestion accounting and persisted to the database. Add a bounds check (reject IDs > MaxInt32, or change types to int64/uint32 through to storage). [ Low confidence ]
  • line 141: storeEnvelope converts expiry from uint64 (GetExpiryUnixtime()) to int64 with int64(expiry) and writes it to the database via queries.InsertGatewayEnvelopeParams{ Expiry: int64(expiry) }. If expiry exceeds math.MaxInt64, the conversion silently wraps to a negative int64. This can corrupt stored expiry and lead to incorrect behavior downstream. Add an upper-bound check (e.g., if expiry > math.MaxInt64 then clamp/reject) and decide policy for zero/negative values. [ Previously rejected ]
  • line 141: Behavior change: previously storeEnvelope only persisted an expiry when expiry > 0 (writing a SQL NULL otherwise). The new code always persists an Expiry value, including 0. This changes the external contract/semantics from “no expiry stored” (NULL) to “expiry = 0”, which many schemas or queries treat differently. If consumers differentiate NULL vs 0, this can cause incorrect behavior. To preserve parity, either keep NULL semantics for non-positive expiry or explicitly migrate downstream logic to treat 0 equivalently and document the change. [ Low confidence ]
  • line 146: MinutesSinceEpoch returns an int32, and storeEnvelope passes it through as MinutesSinceEpoch: utils.MinutesSinceEpoch(originatorTime). If originatorTime is far in the future (or far past), the minute count can overflow int32, truncating to an incorrect value. Since OriginatorNs comes from the envelope, a malformed/malicious envelope could trigger this. Add bounds checks or clamp to a safe range, and consider rejecting envelopes with unreasonable timestamps. [ Previously rejected ]
pkg/testutils/store.go — 0 comments posted, 4 evaluated, 3 filtered
  • line 53: Unsafe SQL string concatenation for database identifier. The code builds SQL statements using "CREATE DATABASE " + dbName and "DROP DATABASE " + dbName without quoting or validating dbName. If dbName contains characters that are not valid in unquoted PostgreSQL identifiers (e.g., hyphen, space) or contains SQL metacharacters, this can cause runtime SQL errors or even SQL injection in tests. At minimum, the name should be validated to match allowed identifier characters or wrapped with proper identifier quoting (e.g., using pgx.Identifier.Sanitize/pgx.Identifier or a helper to quote identifiers). This affects both the create and drop statements. [ Low confidence ]
  • line 57: Using raw string concatenation for DROP DATABASE cleanup also lacks IF EXISTS, so if the database was never created or was already dropped (e.g., partial failures or external interference), the cleanup will error and, due to require.NoError in the cleanup, abort remaining cleanups. Consider using DROP DATABASE IF EXISTS to make cleanup idempotent and robust. [ Low confidence ]
  • line 76: In NewDBs, an empty slice is created with zero capacity and appended to in a loop. While not a functional bug, for large count this can cause avoidable reallocations. Preallocating with make([]*sql.DB, 0, count) would prevent repeated allocations. Note: this is a performance micro-optimization and does not affect correctness. [ Code style ]

@graphite-app
Copy link

graphite-app bot commented Oct 27, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • Queue - adds this PR to the back of the merge queue
  • Hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

@mkysel
Copy link
Collaborator Author

mkysel commented Nov 4, 2025

Keep the Hot Path Fast by Handling Unusual Partition Errors Separately

In normal operation, all necessary partitions already exist, and the vast majority of inserts succeed immediately. Handling the “missing partition” case as a rare exception instead of building defensive partition-creation logic into every insert keeps the hot path as lean as possible:

Avoids extra round-trips and locks.
If we were to check for or create partitions preemptively before every insert, each write would need to touch metadata tables and possibly acquire DDL locks. That adds latency and contention to every single insert, even though missing partitions almost never happen in steady state.

Optimizes for the common case.
The “no partition of relation …” error occurs only when a new node or sequence band is first seen. By treating it as an exceptional path, we let the normal insert remain a single SQL call — the fastest possible path for the 99.99 % case.

Isolates the slow, rare logic.
When the rare partition error does occur, we handle it immediately and locally:

  • detect the specific “no partition” message,
  • call EnsureGatewayParts in the same connection,
  • and retry once.
    This ensures forward progress without polluting the fast path with conditional checks or metadata lookups.

Improves scalability under load.
Systems ingesting millions of envelopes per node benefit from minimizing per-insert overhead. Every microsecond saved on the hot path compounds into tangible throughput gains, while the cold path for new partitions happens rarely enough to be negligible.

Maintains correctness and safety.
The retry path is still fully deterministic and idempotent: the partition is created exactly once, then subsequent inserts go straight through.

@mkysel mkysel marked this pull request as ready for review November 4, 2025 01:25
@mkysel mkysel requested a review from a team as a code owner November 4, 2025 01:25
@mkysel mkysel changed the title Sharded Gateway Envelopes Partitioned Gateway Envelopes Nov 4, 2025
@mkysel
Copy link
Collaborator Author

mkysel commented Nov 4, 2025

The is no migration path. This assumes a DB wipe. Our testnet-dev DBs are unreadable and we would never be able to migrate them anyway.

// This function runs inside a managed transaction created by RunInTxWithResult().
//
// Steps:
// 1. Calls InsertGatewayEnvelopeWithChecksTransactional() to insert the envelope,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were trying to avoid mixing the DDL operations with DML workflows?

This at least handles the flow quite nicely, only paying a performance penalty when the partitions are missing and handling rollbacks nicely. But still, it has a bit of an ick to it.

  1. Makes performance harder to reason about (some inserts take much longer than others)
  2. Scatters any errors in this flow across the logs of normal writes (maybe we can help address that by emitting a specific metric on these failures)

The alternative is to have some worker pre-creating the partitions, which has its own problems and complexities. So IDK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Perfect intuition here.

I dont like the background worker option. The worker then has to listen to the registry. It also has to run frequenty enough to pre-fill it. And its a nightmare for tests, since some of them create random originators. And there are special originators such as 10-13 which are not even in the registry and you have to remember they exist.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error flow is not so bad. If it fails with "missing partition" it will create the partition and retry, without showing any errors or printing any logs.

We could of course at least print the fact that we indeed did create a new partition for nodeid/seq-range.

If the DDL fails, then we might see the error in rather unexpected places. But the DDL is super simple, with IF NOT EXISTS so it should be pretty safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm alright with saying this is the least-bad option. Agree the worker would have its own ick. At least this should be consistent and reliable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mkysel mkysel requested a review from neekolas November 4, 2025 21:08
Copy link
Contributor

@neekolas neekolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Let's start wiping things

@mkysel mkysel merged commit a320a21 into main Nov 4, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error querying for DB subscription: ERROR: canceling statement due to statement timeout (SQLSTATE 57014) Enhance SelectGatewayEnvelopes performance

3 participants