Add logging to diagnose startup issues #1271
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add startup diagnostics by logging failures and worker starts in
pkg/server/server.go,pkg/api/message/service.go,pkg/api/message/publish_worker.go, andpkg/api/message/subscribe_worker.goAdd targeted logs across server and message workers to diagnose startup issues, and adjust vector clock handling in subscription bootstrap.
NewReplicationServerandstartAPIServerin server.goNewReplicationAPIServicein service.gostartPublishWorkerandstartSubscribeWorkergoroutines in publish_worker.go and subscribe_worker.gopollableQueryto return an emptyVectorClockonSelectGatewayEnvelopeserror and update per-node advancement only when the new sequence ID is greater than the current value in subscribe_worker.go📍Where to Start
Start with
startSubscribeWorkerand thepollableQuerylogic in subscribe_worker.go, then review error logging additions inNewReplicationServerandstartAPIServerin server.go.📊 Macroscope summarized 34bc29e. 4 files reviewed, 12 issues evaluated, 10 issues filtered, 0 comments posted
🗂️ Filtered Issues
pkg/server/server.go — 0 comments posted, 10 evaluated, 10 filtered
NewReplicationServerare not cleaned up when a later initialization step fails, causing background processes to keep running even though the constructor returns an error. Examples: when metrics are enabled,metrics.NewMetricsServerlikely starts an HTTP server and is never stopped if a later block returns an error;StartIndexer()starts the indexer and will remain running if subsequent initialization (e.g., API, Sync, or payer report) fails;migratorServer.Start()starts migration workers and will keep running on later failures; afterstartAPIServersuccessfully starts the API server, any failure in subsequent blocks returns without shutting down the API server. This violates single paired cleanup and no-leak invariants. [ Previously rejected ]cfg.Ctxinstead of the server-owneds.ctxcreated bycontext.WithCancel(cfg.Ctx). Specifically,mlsvalidate.NewMlsValidationService(line 211),indexer.WithContext(cfg.Ctx)(line 226), andmigrator.WithContext(cfg.Ctx)(line 244) receivecfg.Ctx, while other components (e.g., Sync server, payer report workers) uses.ctx. This prevents coordinated shutdown vias.cancel, leading to components that do not stop when the server’s context is canceled and causing potential leaks or split lifecycles. [ Low confidence ]StartIndexer) and on any later failure in the constructor areturn nil, erroccurs without stopping the indexer, leaving it running in the background. This violates single paired cleanup and can lead to duplicate indexers if the caller retries or tests create multiple servers. [ Previously rejected ]migratorServer.Start) and any later error in initialization (API, Sync, payer report workers) returns without stopping the migrator, leaving it running. This breaks the all-or-nothing initialization expectation and leaks resources. [ Previously rejected ]startAPIServer, any subsequent failure (e.g., during Sync server initialization at line 299 or payer report workers setup) returns an error without shutting down the API server, leaving it running. This constitutes a leak and violates the all-or-nothing initialization contract forNewReplicationServer. [ Previously rejected ]cfgis non-nil before dereferencing it.cfgis used immediately atline 382(cfg.Options...) and subsequently on many lines (cfg.Logger,cfg.DB,cfg.FeeCalculator,cfg.GRPCListener,cfg.Options.Reflection.Enable,cfg.ServerVersion). If a caller passes a nil*ReplicationServerConfig, the function will panic due to a nil pointer dereference. There is no guard or constructor guarantee in this code thatcfgcannot be nil. [ Low confidence ]s.cursorUpdaterinsideserviceRegistrationFuncatline 385before validating that later service initialization succeeds. Ifmessage.NewReplicationAPIServiceormetadata.NewMetadataAPIServicefails,startAPIServerreturns an error but leavess.cursorUpdaterset. This introduces partial mutation without rollback, which can lead to inconsistent state if other parts of the server read or act ons.cursorUpdaterassuming the API server initialization succeeded. There is no cleanup or restoration of the prior value on error paths. [ Previously rejected ]sis non-nil before dereferencing its fields.sis dereferenced atline 385(s.ctxto constructmetadata.NewCursorUpdater), and repeatedly (s.registrant,s.validationService,s.nodeRegistry,s.ctxagain). If a caller passes a nil*ReplicationServer, the function will panic due to a nil pointer dereference. [ Low confidence ]cfg.Loggerwithout ensuring it is non-nil. For example, atline 399,cfg.Logger.Error(...)is called, and similar calls occur at lines 414, 434, 459, 404, 419. Ifcfg.Loggeris nil, invoking methods on it will panic (zapLoggermethods have pointer receivers and do not support a nil receiver). There is no guard that prevents a nilLogger. [ Low confidence ]s.nodeRegistryands.registrantare non-nil (line 426). However,s.registrantis passed (unconditionally) intomessage.NewReplicationAPIServiceatline 387. Ifs.registrantis nil, the replication API service may dereference it internally, leading to a runtime panic or undefined behavior. This mismatch between conditional use for auth and unconditional passing to service initialization is a defensive coding gap. [ Low confidence ]