Skip to content

Conversation

@mkysel
Copy link
Collaborator

@mkysel mkysel commented Oct 23, 2025

Add startup diagnostics by logging failures and worker starts in pkg/server/server.go, pkg/api/message/service.go, pkg/api/message/publish_worker.go, and pkg/api/message/subscribe_worker.go

Add targeted logs across server and message workers to diagnose startup issues, and adjust vector clock handling in subscription bootstrap.

  • Add error logs before returns in NewReplicationServer and startAPIServer in server.go
  • Add error logs before returns in NewReplicationAPIService in service.go
  • Add debug logs after starting startPublishWorker and startSubscribeWorker goroutines in publish_worker.go and subscribe_worker.go
  • Change pollableQuery to return an empty VectorClock on SelectGatewayEnvelopes error and update per-node advancement only when the new sequence ID is greater than the current value in subscribe_worker.go

📍Where to Start

Start with startSubscribeWorker and the pollableQuery logic in subscribe_worker.go, then review error logging additions in NewReplicationServer and startAPIServer in server.go.


📊 Macroscope summarized 34bc29e. 4 files reviewed, 12 issues evaluated, 10 issues filtered, 0 comments posted

🗂️ Filtered Issues

pkg/server/server.go — 0 comments posted, 10 evaluated, 10 filtered
  • line 174: Resources started earlier in NewReplicationServer are not cleaned up when a later initialization step fails, causing background processes to keep running even though the constructor returns an error. Examples: when metrics are enabled, metrics.NewMetricsServer likely starts an HTTP server and is never stopped if a later block returns an error; StartIndexer() starts the indexer and will remain running if subsequent initialization (e.g., API, Sync, or payer report) fails; migratorServer.Start() starts migration workers and will keep running on later failures; after startAPIServer successfully starts the API server, any failure in subsequent blocks returns without shutting down the API server. This violates single paired cleanup and no-leak invariants. [ Previously rejected ]
  • line 211: Inconsistent lifecycle control: several components are initialized with cfg.Ctx instead of the server-owned s.ctx created by context.WithCancel(cfg.Ctx). Specifically, mlsvalidate.NewMlsValidationService (line 211), indexer.WithContext(cfg.Ctx) (line 226), and migrator.WithContext(cfg.Ctx) (line 244) receive cfg.Ctx, while other components (e.g., Sync server, payer report workers) use s.ctx. This prevents coordinated shutdown via s.cancel, leading to components that do not stop when the server’s context is canceled and causing potential leaks or split lifecycles. [ Low confidence ]
  • line 234: The indexer is started (StartIndexer) and on any later failure in the constructor a return nil, err occurs without stopping the indexer, leaving it running in the background. This violates single paired cleanup and can lead to duplicate indexers if the caller retries or tests create multiple servers. [ Previously rejected ]
  • line 255: The migrator service is started (migratorServer.Start) and any later error in initialization (API, Sync, payer report workers) returns without stopping the migrator, leaving it running. This breaks the all-or-nothing initialization expectation and leaks resources. [ Previously rejected ]
  • line 268: After successfully starting the API server via startAPIServer, any subsequent failure (e.g., during Sync server initialization at line 299 or payer report workers setup) returns an error without shutting down the API server, leaving it running. This constitutes a leak and violates the all-or-nothing initialization contract for NewReplicationServer. [ Previously rejected ]
  • line 382: The function does not validate that cfg is non-nil before dereferencing it. cfg is used immediately at line 382 (cfg.Options...) and subsequently on many lines (cfg.Logger, cfg.DB, cfg.FeeCalculator, cfg.GRPCListener, cfg.Options.Reflection.Enable, cfg.ServerVersion). If a caller passes a nil *ReplicationServerConfig, the function will panic due to a nil pointer dereference. There is no guard or constructor guarantee in this code that cfg cannot be nil. [ Low confidence ]
  • line 385: The function mutates shared state s.cursorUpdater inside serviceRegistrationFunc at line 385 before validating that later service initialization succeeds. If message.NewReplicationAPIService or metadata.NewMetadataAPIService fails, startAPIServer returns an error but leaves s.cursorUpdater set. This introduces partial mutation without rollback, which can lead to inconsistent state if other parts of the server read or act on s.cursorUpdater assuming the API server initialization succeeded. There is no cleanup or restoration of the prior value on error paths. [ Previously rejected ]
  • line 385: The function does not validate that s is non-nil before dereferencing its fields. s is dereferenced at line 385 (s.ctx to construct metadata.NewCursorUpdater), and repeatedly (s.registrant, s.validationService, s.nodeRegistry, s.ctx again). If a caller passes a nil *ReplicationServer, the function will panic due to a nil pointer dereference. [ Low confidence ]
  • line 399: The function calls methods on cfg.Logger without ensuring it is non-nil. For example, at line 399, cfg.Logger.Error(...) is called, and similar calls occur at lines 414, 434, 459, 404, 419. If cfg.Logger is nil, invoking methods on it will panic (zap Logger methods have pointer receivers and do not support a nil receiver). There is no guard that prevents a nil Logger. [ Low confidence ]
  • line 426: The function conditionally builds JWT auth interceptors only when both s.nodeRegistry and s.registrant are non-nil (line 426). However, s.registrant is passed (unconditionally) into message.NewReplicationAPIService at line 387. If s.registrant is nil, the replication API service may dereference it internally, leading to a runtime panic or undefined behavior. This mismatch between conditional use for auth and unconditional passing to service initialization is a defensive coding gap. [ Low confidence ]

@mkysel mkysel requested a review from a team as a code owner October 23, 2025 15:25
@graphite-app
Copy link

graphite-app bot commented Oct 23, 2025

How to use the Graphite Merge Queue

Add either label to this PR to merge it via the merge queue:

  • Queue - adds this PR to the back of the merge queue
  • Hotfix - for urgent hot fixes, skip the queue and merge this PR next

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

@mkysel mkysel merged commit 09749b5 into main Oct 23, 2025
11 checks passed
@mkysel mkysel deleted the mkysel/logging branch October 23, 2025 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants