Skip to content

Conversation

@fbac
Copy link
Collaborator

@fbac fbac commented Oct 24, 2025

Switch node APIs from gRPC to Connect-go and serve Replication/Metadata/Payer handlers over HTTP/2 h2c while clamping QueryEnvelopes limit to max when 0 or above max and sending an initial keepalive in message.Service.SubscribeEnvelopes

Replace gRPC servers and interceptors with Connect-go handlers, request/response types, and HTTP/2 h2c serving; update gateway, server startup, interceptors, metrics, and tests to use Connect; generate Connect clients/handlers for all protobuf APIs; adjust error handling to connect.NewError and change subscribe to send an initial keepalive; clamp QueryEnvelopes limits and move client construction helpers to Connect.

📍Where to Start

Start with the Connect service migration in message.Service at [file:pkg/api/message/service.go], then review server setup and registration changes in api.APIServer at [file:pkg/api/server.go] and server.startAPIServer at [file:pkg/server/server.go].

Changes since #1277 opened

  • Modified stream consumption pattern in TestSubscribeEnvelopesInvalidRequest test [a5d7161]
  • Updated cursor validation approach in TestSubscribeSyncCursorBasic test [a5d7161]
  • Replaced database usage in TestGetNewestEnvelope test [a5d7161]

📊 Macroscope summarized a5d7161. 22 files reviewed, 63 issues evaluated, 55 issues filtered, 0 comments posted

🗂️ Filtered Issues

cmd/xmtpd-cli/commands/generate.go — 0 comments posted, 3 evaluated, 3 filtered
  • line 96: Resource leak: stress.NewEnvelopesGenerator was changed to set cleanup to func() { cancel() } instead of closing the underlying transport connection (previously conn.Close()). As a result, welcomeMessageHandler's deferred generator.Close() (lines 96–101) now only cancels a timeout context and does not close any gRPC/GRPC-Web client connections created inside NewEnvelopesGenerator. This violates single paired cleanup and can leak network resources after the command completes. The handlers continue to call Close() expecting cleanup, but the transport is left open. [ Low confidence ]
  • line 197: Resource leak: stress.NewEnvelopesGenerator now sets cleanup to only cancel a context. groupMessageHandler defers generator.Close() (lines 197–202), but that no longer closes the gRPC client connection created inside NewEnvelopesGenerator. This breaks the single paired cleanup invariant and can leave open transports after command completion. [ Low confidence ]
  • line 292: Resource leak: stress.NewEnvelopesGenerator cleanup now only cancels a context. keyPackageHandler defers generator.Close() (lines 292–297), but this no longer closes the underlying gRPC client connection built inside NewEnvelopesGenerator. The transport remains open, violating single paired cleanup and potentially leaking resources. [ Low confidence ]
pkg/api/message/service.go — 0 comments posted, 8 evaluated, 8 filtered
  • line 164: SubscribeEnvelopes creates a time.Ticker with s.options.SendKeepAliveInterval and also calls ticker.Reset(s.options.SendKeepAliveInterval) on each channel receive. If SendKeepAliveInterval is zero or negative (e.g., misconfigured via XMTPD_API_SEND_KEEP_ALIVE_INTERVAL), time.NewTicker and ticker.Reset will panic with "non-positive interval for NewTicker/Reset". There is no validation enforcing a positive duration. This results in a runtime crash on subscription setup or during runtime when resetting the ticker. To fix, validate s.options.SendKeepAliveInterval > 0 up-front and reject the request or use a sane default before creating or resetting the ticker. [ Out of scope ]
  • line 334: QueryEnvelopes silently caps limit to maxRequestedRows when a larger limit is requested (lines 334-339). Previously, the implementation allowed any non-zero limit from the request (per diff), only defaulting when zero. This is an externally visible contract change: clients requesting more than maxRequestedRows will receive fewer envelopes than requested, with no error or warning, which may cause unexpected behavior. [ Low confidence ]
  • line 437: In fetchEnvelopes, originator node IDs are converted from uint32 to int32 when populating queries.SelectGatewayEnvelopesByOriginatorsParams.OriginatorNodeIds. If a caller supplies a value greater than math.MaxInt32, the conversion will wrap to a negative int32, potentially causing incorrect query semantics or filtering. While it may not crash, it can lead to silently incorrect behavior. Consider validating the range or using a DB parameter type that preserves unsigned semantics. [ Previously rejected ]
  • line 642: GetInboxIds does not limit or validate the number of requests processed. The handler builds addresses from req.Msg.Requests (lines 642-645) and queries the DB with the entire list (line 646) without enforcing a maximum. This can enable oversized requests leading to high memory usage and expensive queries, unlike QueryEnvelopes which enforces maxQueriesPerRequest. [ Low confidence ]
  • line 659: GetInboxIds response selection is potentially nondeterministic when multiple log entries exist per address. The code assigns resp.InboxId to the value from the last matching logEntry encountered (lines 659-664) without ordering by or comparing AssociationSequenceID. If addressLogEntries contains multiple entries per address, the chosen inbox ID depends on DB result order, which may not reflect the most recent or correct association. [ Low confidence ]
  • line 690: GetNewestEnvelope does not validate the number or size of topics before querying the database. There are constants like maxQueriesPerRequest and maxTopicLength used in validateQuery, but no analogous validation is performed here (lines 690-701). A caller can submit a very large list of topics or topics exceeding expected length, potentially causing excessive memory usage, oversized SQL array parameters, or DB errors. [ Low confidence ]
  • line 696: GetNewestEnvelope mishandles duplicate topics inputs. The code builds a single-index map originalSort keyed by string(topic) (lines 696-699) and later assigns a result only to the index returned for that key (lines 719-735). If the request includes the same topic multiple times, originalSort will contain only the last index for that topic, and earlier duplicate indices will remain nil in response.Msg.Results. This violates the externally visible contract where each provided topic should get its newest envelope (or null). [ Already posted ]
  • line 811: validateKeyPackage directly dereferences nested fields payload.UploadKeyPackage.KeyPackage.KeyPackageTlsSerialized without checking for nil. Both UploadKeyPackage and its inner KeyPackage are pointer fields in the protobuf types and can legitimately be nil in incoming requests. If either is nil, this will cause a panic due to a nil pointer dereference. The code should defensively check payload.UploadKeyPackage != nil and payload.UploadKeyPackage.KeyPackage != nil before accessing KeyPackageTlsSerialized, and return a connect error if they are missing. [ Low confidence ]
pkg/api/payer/service.go — 0 comments posted, 6 evaluated, 6 filtered
  • line 91: GetNodes now wraps registry fetch errors as connect.CodeInternal where the previous implementation returned Unavailable. This change in error code semantics can alter client retry logic and error handling behavior. [ Low confidence ]
  • line 99: GetNodes no longer returns an error when there are zero available nodes. Previously, the implementation returned Unavailable on empty results; now it returns a successful response with an empty Nodes map. This is a contract/parity change that can break clients relying on the prior error to trigger retries or fallback. [ Low confidence ]
  • line 136: PublishClientEnvelopes returns a raw error from env.payload.Bytes() instead of a connect.Error, leading to inconsistent error typing and potentially incorrect HTTP/Connect status mapping. The handler otherwise consistently wraps errors with connect.NewError. Returning a plain error here can cause the framework to surface an unknown/incorrect status to clients. [ Low confidence ]
  • line 140: PublishClientEnvelopes uses status.Errorf(codes.InvalidArgument, ...) (gRPC status) instead of connect.NewError when rejecting over-sized messages. This mixes error types in a Connect handler and can result in incorrect status mapping and response encoding for clients. [ Low confidence ]
  • line 378: Invalid input (identifier length) is reported as connect.CodeInternal rather than connect.CodeInvalidArgument. When ParseGroupID(identifier) fails due to an invalid identifier length, the code returns connect.NewError(connect.CodeInternal, ...) at case topic.TopicKindGroupMessagesV1. This misclassifies a client input validation error as an internal server error and breaks the external contract by not signaling the caller that the request is invalid. [ Low confidence ]
  • line 419: Invalid input (identifier length) is reported as connect.CodeInternal rather than connect.CodeInvalidArgument. When ParseInboxID(identifier) fails due to an invalid identifier length, the code returns connect.NewError(connect.CodeInternal, ...) at case topic.TopicKindIdentityUpdatesV1. This misclassifies a client input validation error as an internal server error and breaks the external contract by not signaling the caller that the request is invalid. [ Low confidence ]
pkg/api/server.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 25: APIServerConfig.PromRegistry and WithPrometheusRegistry are accepted/options-exposed but never used in NewAPIServer. In the prior gRPC implementation, the Prometheus registry was used to register gRPC server metrics (grpcprom.NewServerMetrics). After the refactor to HTTP/Connect handlers, cfg.PromRegistry is silently ignored, resulting in a loss of metrics registration and an externally visible contract change. Callers can pass a registry expecting metrics but receive none, with no error or warning. [ Low confidence ]
  • line 166: APIServer.Close can hang indefinitely when shutdown times out. At lines 166–181, Close creates a timeout context and calls svc.httpServer.Shutdown(shutdownCtx). If Shutdown returns an error due to the timeout (e.g., context deadline exceeded), the server may still be serving and the goroutine started in Start may remain blocked in Serve(svc.listener). The code then unconditionally calls svc.wg.Wait(). Without a forced shutdown (e.g., svc.httpServer.Close() or explicitly closing the listener) to unblock Serve, wg.Wait() can block forever, causing the whole shutdown sequence to hang. This is a regression compared to the previous gRPC implementation, which enforced a forced stop after a graceful timeout (GracefulStop then Stop). [ Already posted ]
pkg/gateway/builder.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 147: Build may construct multiple long-lived resources (Redis client via ensureRedis/setupNonceManager, blockchain publisher via setupBlockchainPublisher, node registry via setupNodeRegistry, and a metrics server via setupMetrics) before calling buildGatewayService. If buildGatewayService later fails (e.g., due to an invalid payer private key or API server initialization error), the function returns an error without tearing down these previously created resources. Given the interfaces (IBlockchainPublisher.Close() and NodeRegistry.Stop() exist) and the metrics server likely holds a listener, this results in leaked connections/goroutines. To fix, ensure that on any error after creating these resources, they are properly closed/stopped (e.g., using defers for cleanup that are canceled when the build completes successfully), or structure the build to allocate resources after all preceding steps are guaranteed not to fail. [ Already posted ]
  • line 243: buildGatewayService creates a net.Listener with net.Listen but does not close it on all failure paths. Specifically, if api.NewAPIServer (or its internally invoked RegistrationFunc) returns an error, the function calls cancel() and returns the error without calling listener.Close(). This leaves the port bound and the listener resource leaked. The same occurs if any error is returned after the listener is created but before serving, e.g., in registrationFunc when payer.NewPayerAPIService fails. To fix, explicitly defer listener.Close() immediately after successful creation, and if startup succeeds and ownership is transferred to the API server, cancel the defer (or reassign responsibility) appropriately. Alternatively, close the listener in each error branch after it is created. [ Already posted ]
pkg/gateway/interceptor.go — 0 comments posted, 6 evaluated, 6 filtered
  • line 42: WrapUnary and WrapStreamingHandler rely on req.Peer().Addr and conn.Peer().Addr without verifying presence. If Peer() returns an object with an empty Addr or if the environment doesn't set peer info (e.g., certain in-process transports), downstream IdentityFn implementations that assume a non-empty address could panic or misbehave. There is no fallback or explicit error when peer address is empty. [ Low confidence ]
  • line 42: Nil identityFn is not guarded against. WrapUnary calls i.identityFn(req.Header(), req.Peer().Addr) without checking that i.identityFn is non-nil. If the interceptor is constructed with a nil IdentityFn, the first request will panic due to a nil function call. [ Low confidence ]
  • line 68: Nil authorizer functions are not guarded against. WrapUnary iterates i.authorizers and calls each authorizer(ctx, identity, summary) without checking for nil. If any element in authorizers is nil, the request will panic. [ Low confidence ]
  • line 122: Authorization is not enforced for streaming handlers, leading to contract parity issues with unary path. WrapStreamingHandler sets identity on context and calls next but does not apply the same authorizer checks as WrapUnary. If any streaming publish or sensitive operation exists, this asymmetry can allow unauthorized operations over streaming while being blocked over unary. [ Low confidence ]
  • line 131: Client-visible error message contract changed for rate-limit and other GatewayServiceErrors. Previously, the code returned status.Error(rlError.Code(), rlError.ClientMessage()), ensuring a client-safe message. Now returnRetryAfterError wraps and returns the rlError itself via connect.NewError(rlError.Code(), rlError), potentially exposing internal error details instead of the intended client-facing ClientMessage(). This is a runtime behavior change that can leak implementation details to clients. [ Low confidence ]
  • line 135: Potentially invalid Retry-After value when negative durations are provided. returnRetryAfterError formats the header as int(retryAfter.Seconds()). If a negative duration is returned by RetryAfter(), the emitted Retry-After will be a negative integer string, which is not a valid value per HTTP spec. There is no validation or clamping to ensure non-negative seconds. [ Low confidence ]
pkg/interceptors/client/auth.go — 0 comments posted, 2 evaluated, 2 filtered
  • line 98: In both gRPC interceptors Unary and Stream, the code uses metadata.NewOutgoingContext(ctx, md) to attach the authorization header. NewOutgoingContext replaces any existing outgoing metadata in ctx with md, which can silently discard metadata previously set by callers or other interceptors. This can lead to loss of other headers (e.g., tracing, custom auth, per-RPC settings) and cause unexpected behavior downstream. To safely add the token without clobbering existing metadata, use metadata.AppendToOutgoingContext(ctx, constants.NodeAuthorizationHeaderName, token.SignedString) or merge with existing metadata via metadata.FromOutgoingContext and metadata.NewOutgoingContext with a combined md. [ Low confidence ]
  • line 156: In WrapStreamingClient, when token acquisition fails, the code still calls next(ctx, spec) to obtain a connect.StreamingClientConn and then wraps it with streamingAuthInterceptorFailure. This establishes an underlying streaming connection without an auth header even though all subsequent Send/Receive calls will fail with Unauthenticated. This can cause unintended resource acquisition on both client and server and risks leaking the connection if the caller doesn’t explicitly close it after encountering the immediate send/receive error. A safer approach is to avoid calling next when token generation fails and instead return a synthetic failing connection (or propagate the error immediately if the API allows), ensuring no underlying stream is created when authentication isn’t possible. [ Low confidence ]
pkg/interceptors/server/auth.go — 0 comments posted, 4 evaluated, 4 filtered
  • line 46: Multiple NodeAuthorization header values are no longer explicitly rejected. The prior gRPC interceptor logic (extractToken) returned an error if multiple auth tokens were provided. The new Connect implementation uses req.Header().Get(...)/conn.RequestHeader().Get(...), which silently selects one value when multiple are present. This changes the authentication contract and can lead to ambiguous or unsafe acceptance of requests with multiple tokens instead of failing fast. Explicitly detect and reject multiple header values. [ Low confidence ]
  • line 53: The interceptor assumes i.verifier is non-nil and unconditionally calls i.verifier.Verify(token) in both WrapUnary and WrapStreamingHandler. If NewServerAuthInterceptor is constructed with a nil authn.JWTVerifier, this will panic at runtime. Add a nil check or enforce non-nil verifier during construction. [ Low confidence ]
  • line 113: The interceptor assumes i.logger is non-nil and unconditionally dereferences it in connectLogIncomingAddress via i.logger.Core().Enabled(...). If NewServerAuthInterceptor is called with a nil *zap.Logger (there are no guards in the constructor), both unary and streaming paths will call i.connectLogIncomingAddress(...) and panic on a nil pointer dereference. Add a nil check before using i.logger or enforce a non-nil logger at construction. [ Low confidence ]
  • line 114: Logging behavior changed: on address parsing failure (net.SplitHostPort(addr) returns an error), connectLogIncomingAddress now skips logging entirely. Previously, the gRPC interceptor attempted reverse DNS and, on failure, still logged with dns_name set to "unknown". This silent drop of logging on malformed addresses reduces observability and deviates from prior behavior. Consider logging with dns_name=unknown and the raw address even when parsing fails. [ Code style ]
pkg/interceptors/server/logging.go — 0 comments posted, 3 evaluated, 2 filtered
  • line 88: Logged error code may not match the returned sanitized error code in WrapStreamingHandler, same as the unary interceptor: logging uses connect.CodeOf(err) while the returned error is sanitizeError(err) which maps context errors to specific codes. This can produce logs showing unknown where clients receive deadline_exceeded or canceled. [ Low confidence ]
  • line 156: sanitizeError discards *connect.Error metadata (details, headers, trailers) by constructing a new connect.Error with only finalCode and a new generic error for the message. If handlers attach structured error details or response headers/trailers to a connect.Error, these are lost. This breaks contract parity for error propagation: clients will not receive intended error details, and any server-attached metadata will be silently dropped. [ Low confidence ]
pkg/mlsvalidate/service.go — 0 comments posted, 5 evaluated, 5 filtered
  • line 47: NewMLSValidationService ties the gRPC connection’s lifetime to the incoming ctx (at lines 47-50), which in the visible call chain is cfg.Ctx from NewBaseServer, not the server’s derived svc.ctx. As a result, canceling the server (svc.cancel) will not close the MLS validation gRPC connection; it will remain open until cfg.Ctx is canceled. This creates a lifecycle mismatch and can leak the connection during server shutdown, violating the single paired cleanup and lifecycle contract. The fix is to use the server’s lifecycle context (e.g., svc.ctx) or ensure the passed ctx aligns with the server shutdown semantics. [ Low confidence ]
  • line 62: In GetAssociationState, there is no validation that newUpdates contains at least one update. If an empty slice (or one with nil elements) is passed, the server may reject the request or behave unexpectedly. While this method won’t panic locally, adding an explicit check (e.g., require exactly one new update or at least one, depending on API contract) would avoid sending malformed or semantically invalid requests and provides earlier, clearer errors. [ Code style ]
  • line 163: Potential nil-pointer dereference when building the request for ValidateGroupMessages: the method does not validate groupMessages elements before passing them to makeValidateGroupMessagesRequest. Inside that helper (outside this code object), each element is used via groupMessage.GetV1().Data. If any element in groupMessages is nil or if the oneof Version is not set to V1 (so GetV1() returns nil), dereferencing .Data will panic. Add explicit validation in ValidateGroupMessages to ensure each groupMessages[i] is non-nil and has a non-nil GetV1() before calling the helper, or make the helper handle these cases gracefully. [ Low confidence ]
  • line 173: ValidateGroupMessages does not verify that the number of responses from the server matches the number of input messages. It constructs out sized to len(response.Responses) and returns it directly. If the service returns fewer or more responses than inputs, this will silently produce a result slice whose length does not correspond to the input size, potentially losing alignment between inputs and outputs. Add a sanity check if len(response.Responses) != len(groupMessages) { return nil, fmt.Errorf(...) } to enforce contract parity. [ Low confidence ]
  • line 175: ValidateGroupMessages fails the entire batch on the first per-item error (!response.IsOk), returning an error and discarding any results computed earlier in the loop. If callers expect per-item results (successes alongside failures) for a batch, this behavior changes the externally visible contract by making the operation all-or-nothing. Consider returning a full results array with per-item error details, or at least accumulate errors and include the index to aid diagnosis; otherwise, document clearly that the method fails-fast and returns no partial results. [ Code style ]
pkg/server/server.go — 0 comments posted, 4 evaluated, 3 filtered
  • line 257: Resources started during NewBaseServer are not cleaned up on subsequent failures, causing goroutine and listener leaks. Specifically, the function starts components incrementally and returns early on later errors without stopping those already started: [ Previously rejected ]
  • line 416: startAPIServer builds up authentication interceptors but does not attach them unless both jwtVerifier and authInterceptor are non-nil. While the guard seems okay, there is no fallback or explicit logging when auth is expected but unavailable. More importantly, there is no validation that svc.registrant is present when API is enabled; startAPIServer proceeds without auth if either svc.nodeRegistry or svc.registrant is nil, allowing unauthenticated access. Given upstream logic, svc.registrant should be initialized whenever API is enabled, but if that invariant is broken (e.g., via options), the API would start without auth and with no error. At minimum, enforce that when API is enabled and nodeRegistry is provided, registrant must be non-nil, or explicitly document/guard the unauthenticated mode. [ Already posted ]
  • line 433: TCP listener leak on startAPIServer error path: the function creates a net.Listener (lines 433–436) and then attempts to construct the API server (lines 518–522). If api.NewAPIServer returns an error, the listener is not closed, resulting in a file descriptor/resource leak. Similarly, if registration or subsequent steps fail, there is no Close() on the listener. The listener should be closed on any error after creation unless ownership is transferred to the API server successfully. [ Already posted ]
pkg/stress/envelopes_generator.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 47: Using a timeout context to construct a long-lived HTTP/2 client causes future connection attempts to fail after the timeout elapses, due to the context being captured in the transport dial closure. In NewEnvelopesGenerator, a context with a 60-second timeout is created and then passed into the Connect client builders. Those builders call utils.BuildHTTP2Client(ctx, isTLS), which (on the h2c/plaintext path) sets http2.Transport.DialTLS to a closure that uses net.Dialer.DialContext(ctx, ...). Because that closure retains the ctx from construction time, once the 60-second timeout expires or cleanup calls cancel(), any subsequent RPCs that require a new connection will attempt dials with a canceled context and fail. This can manifest when using ProtocolConnect, ProtocolConnectGRPC, or ProtocolConnectGRPCWeb over plaintext HTTP (h2c). The correct approach is to avoid capturing a cancelable context in the transport's dialer (use a non-cancelable/background context or Dial with timeouts), or build the HTTP client with a context that remains valid for the client's lifetime. Specifically: [ Already posted ]
pkg/testutils/server/server.go — 0 comments posted, 1 evaluated, 1 filtered
  • line 66: Potential panic in test helper NewTestBaseServer due to missing nil check for cfg.PrivateKey. At line 66, hex.EncodeToString(crypto.FromECDSA(cfg.PrivateKey)) dereferences cfg.PrivateKey without validation. If a test passes nil (which is reachable from the type with no guards), crypto.FromECDSA will attempt to read from a nil pointer and panic. Add an explicit nil check and return a clear error (or generate a test key) when cfg.PrivateKey is nil. [ Test / Mock code ]
pkg/utils/api.go — 0 comments posted, 3 evaluated, 3 filtered
  • line 25: AuthorizationHeaderFromContext strips the Bearer scheme using strings.TrimPrefix(auth[0], "Bearer "), which is case-sensitive and only removes exactly "Bearer " with a single space. If the metadata carries authorization: bearer <token> (lowercase), has multiple spaces or different whitespace (e.g., tabs), or omits the space ("Bearer<token>"), the prefix will not be removed and the returned value will include the scheme text (e.g., "bearer <token>"). Downstream consumers expecting just the raw token (e.g., JWT parsers) will fail to parse, causing authentication failures. [ Low confidence ]
  • line 34: AuthorizationTokenFromHeader strips the Bearer scheme using strings.TrimPrefix(token, "Bearer "), which is case-sensitive and only removes exactly "Bearer " with a single space. If the HTTP header is authorization: bearer <token> (lowercase), contains multiple spaces or tabs, or uses "Bearer<token>" (no space), the prefix will not be removed and the returned value will include the scheme text. Downstream JWT parsing will likely fail due to the malformed token string. [ Low confidence ]
  • line 65: ClientIPFromHeaderOrPeer drops the peer address if net.SplitHostPort(peer) fails (returns empty string). This differs from the established fallback in ClientIPFromContext, which attempts a manual split by ':' when SplitHostPort fails. If the peer argument is a bare IP without a port (e.g., "192.0.2.1") or otherwise not in host:port form, this function will return an empty string instead of a usable IP, causing loss of caller IP in legitimate scenarios. [ Low confidence ]
pkg/utils/api_clients.go — 0 comments posted, 11 evaluated, 7 filtered
  • line 51: Passing the constructor-time context into Connect client builders perpetuates the same context lifetime bug across all protocol variants. In NewConnectReplicationAPIClient, NewConnectGRPCReplicationAPIClient, and NewConnectGRPCWebReplicationAPIClient, the ctx parameter is forwarded to BuildHTTP2Client. For plaintext HTTP (h2c), BuildHTTP2Client installs a DialTLS closure that uses DialContext(ctx, ...), causing future dials to fail after the ctx is canceled or times out. Since NewEnvelopesGenerator provides a 60-second timeout context, clients built with any of these functions will stop being able to establish new connections after ~60 seconds when using h2c. Fix by using a non-cancelable context for transport construction or by ensuring the dial closure does not capture a cancelable context. [ Already posted ]
  • line 76: Similarly, NewConnectGRPCReplicationAPIClient forwards ctx to BuildHTTP2Client, causing the same dial-closure context capture issue for classic gRPC over h2c. After the context is canceled or times out, future RPCs requiring new connections will fail on plaintext HTTP/2 targets. [ Already posted ]
  • line 101: NewConnectGRPCWebReplicationAPIClient also forwards ctx into BuildHTTP2Client, introducing the same transport dial closure context capture issue for gRPC-Web. Under plaintext HTTP (h2c), once the context expires or is canceled, subsequent connection attempts will fail. [ Already posted ]
  • line 202: Optional: NewGRPCConn does not block or validate connectivity at creation time (no WithBlock and likely grpc.NewClient is non-blocking), so it may return a *grpc.ClientConn successfully even for invalid/unreachable targets. Callers may expect a failure from NewGRPCConn when the address is bad (as suggested by returned error and example usage). If the intent is to validate upfront, add grpc.WithBlock() and a context with timeout. If non-blocking is desired, consider documenting that the returned error only reflects option/constructor errors, not connectivity. [ Code style ]
  • line 238: BuildHTTP2Client captures the caller-provided ctx inside the h2c (isTLS == false) transport's DialTLS closure. Because http2.Transport.DialTLS has no per-request context parameter, all future dials will use that single ctx. If the ctx has a deadline or is later canceled (e.g., service shutdown or a short-lived setup context), subsequent requests made by the returned http.Client will fail to dial, even if the requests themselves have valid contexts. This can cause intermittent or permanent failures unrelated to the actual request context. Use context.Background() (or a dedicated long-lived context) for the dialer here, or switch to an API that supports per-request contexts. [ Already posted ]
  • line 313: Addresses lacking an explicit scheme (e.g., "localhost:8080" or "example.com") are rejected with "missing host" because url.Parse treats them as paths and sets URL.Hostname() to empty. Although the code defaults scheme to "http" later, it occurs after checking for a non-empty host, so these common inputs are not supported. If scheme-less addresses are intended to be accepted, the function should detect this and prepend "http://" before parsing, or handle the host parsing manually. [ Already posted ]
  • line 361: buildTLSConfig treats a nil certPool from x509.SystemCertPool() as a hard error, even though a nil RootCAs in tls.Config is valid and means “use the host’s verified system roots.” On some platforms or minimal images, SystemCertPool() may return a nil pool (with or without error), and the current code would fail to build a TLS client where the default behavior might otherwise work. Prefer allowing RootCAs to be nil (using the platform default), or only error if SystemCertPool() returns a non-nil error. [ Low confidence ]

@graphite-app
Copy link

graphite-app bot commented Oct 24, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • Queue - adds this PR to the back of the merge queue
  • Hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

@fbac fbac force-pushed the 10-23-connectgo branch 2 times, most recently from 38ff8a1 to 1da8c33 Compare October 28, 2025 18:45
@fbac fbac force-pushed the 10-23-connectgo branch 5 times, most recently from 4aaaca1 to 9fc5408 Compare November 4, 2025 11:46
@fbac fbac marked this pull request as ready for review November 4, 2025 15:20
@fbac fbac requested review from a team as code owners November 4, 2025 15:20
Comment on lines 301 to 320
// Consume the keepalive message.
shouldReceive := stream.Receive()
require.True(t, shouldReceive)

// Validate the keepalive message.
msg := stream.Msg()
require.NotNil(t, msg)

// There shouldn't be any more messages.
shouldReceive = stream.Receive()
require.False(t, shouldReceive)

// Connect returns an empty (non-nil) message when the stream is closed.
msg = stream.Msg()
require.NotNil(t, msg)

// The stream should return an invalid argument error when the request is invalid.
err = stream.Err()
require.Error(t, err)
require.Equal(t, connect.CodeInvalidArgument, connect.CodeOf(err))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this handling feels fragile. Why do we need to know exactly what messages are received when, what nils and closing handling and keep alives?

All the test is doing is checking that the request is invalid.

Do we have to call Msg, Receive, Msg before we get an error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, modified to be way more elegant.

@fbac fbac requested a review from mkysel November 5, 2025 14:31
Copy link
Collaborator

@mkysel mkysel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing more to add

@fbac fbac merged commit 4f974e2 into main Nov 6, 2025
13 of 18 checks passed
@fbac fbac deleted the 10-23-connectgo branch November 6, 2025 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants