-
Notifications
You must be signed in to change notification settings - Fork 39
Gateway Alternative Node Selection Strategies #1483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add NodeSelectorStrategy, NodeSelectorPreferredNodes, NodeSelectorCacheExpiry, NodeSelectorConnectTimeout to PayerOptions - Create NodeSelectorConfig struct and NewNodeSelector factory function for clean initialization - Wire configuration through gateway builder to use config-based node selection - Add comprehensive validation for node selector options in config validator - All strategies (stable, manual, ordered, random, closest) now seamlessly integrate with configuration system
…ector factory - Add Payer PayerOptions field to ServerOptions struct to fix validation errors - Add comprehensive tests for NewNodeSelector factory function covering all strategies - Test coverage includes: stable, manual, ordered, random, closest strategies - Verify error handling for invalid strategies and missing required configuration - All 33 node selector tests passing
- Align struct field spacing in NodeSelectorConfig - Align struct field spacing in PayerOptions - Remove extra blank line
- Shorten flag names and env variable names to meet golines 100 char limit - Rename NodeSelectorConnectTimeout to NodeSelectorTimeout - Update all references to renamed field
- Run gofmt to add proper tab-based column alignment - Shorten flag names further to fit 100 char limit with tabs - Remove default tag from NodeSelectorStrategy (handled in code) - All lines now under 100 chars with proper formatting
The ordered and manual node selection strategies require parsing comma-separated node IDs from the XMTPD_PAYER_NODE_NODES environment variable. Added env-delim:"," to NodeSelectorPreferredNodes struct tag to enable proper parsing of comma-separated uint32 values by go-flags library. This fixes the ordered strategy which was failing with exit code 1 due to inability to parse the node list from environment variables. Tested with all 5 strategies (stable, manual, ordered, random, closest) in testnet-dev environment. All strategies working correctly.
…event localhost dialing (#1485) #### Originated from #1483 <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will post its summary as a comment. --> ### Add an empty-hostname guard in `ClosestNodeSelectorAlgorithm.measureLatency` to avoid localhost dialing in [node_selector.go](https://github.com/xmtp/xmtpd/pull/1485/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31) Introduce a guard clause that returns `-1` when the parsed URL has an empty hostname in `ClosestNodeSelectorAlgorithm.measureLatency` within [node_selector.go](https://github.com/xmtp/xmtpd/pull/1485/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31). #### 📍Where to Start Start at `ClosestNodeSelectorAlgorithm.measureLatency` in [node_selector.go](https://github.com/xmtp/xmtpd/pull/1485/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31). ---- <!-- Macroscope's review summary starts here --> <a href="https://app.macroscope.com">Macroscope</a> summarized ac62651. <!-- Macroscope's review summary ends here --> <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
QUALITY IMPROVEMENTS:
- Fixed awkward XMTPD_PAYER_NODE_NODES environment variable naming
* Changed to XMTPD_PAYER_NODE_SELECTOR_PREFERRED_NODES for clarity
* Better semantic meaning: these are preferred nodes for the selector
* Consistent with other node-selector-* prefixed options
- Improved option naming and descriptions:
* Full descriptive names for all flags (node-selector-strategy, etc.)
* Clear descriptions for each configuration option
* Added default value for strategy ('stable')
- Fixed documentation comment:
* HashKey comment now correctly states uint32 instead of uint16
- Enhanced validation:
* Added check to allow empty strategy (defaults to 'stable')
* Improved error messages for clarity
All 33 node selector tests pass.
Code builds without errors or lint issues.
Maintains backward compatibility with existing 'stable' strategy.
FEATURE ENHANCEMENT: - Extended 'closest' strategy to support PreferredNodes filtering - When PreferredNodes is set, measures latency ONLY to those nodes - Picks lowest latency from the preferred set - Falls back to all nodes if none of the preferred are available - Backward compatible: empty PreferredNodes measures all nodes (existing behavior) USE CASES: - Regional preference: 'Use closest node from us-east region' - Tier preference: 'Use closest from premium tier' - Cost optimization: 'Use closest from cheaper node set' - Smart hybrid routing with manual preference list IMPLEMENTATION: - Added preferredNodes field to ClosestNodeSelectorAlgorithm - Modified GetNode() to filter nodes before latency measurement - Updated factory function to pass PreferredNodes to closest strategy - Added 3 comprehensive tests for new behavior All 39 tests passing. Production-ready enhancement for smart regional routing.
…inter warnings in node s (#1486) #### Originated from #1483 <!-- Macroscope's pull request summary starts here --> <!-- Macroscope will only edit the content between these invisible markers, and the markers themselves will not be visible in the GitHub rendered markdown. --> <!-- If you delete either of the start / end markers from your PR's description, Macroscope will post its summary as a comment. --> ### Prevent overwriting `ClosestNodeSelectorAlgorithm.updateLatencyCache` when all probes fail and mark unused `GetNode` parameters to satisfy linters in [node_selector.go](https://github.com/xmtp/xmtpd/pull/1486/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31) Add a non-empty guard to `ClosestNodeSelectorAlgorithm.updateLatencyCache` to retain the previous cache on probe failure and rename unused `GetNode` parameters to `_`. Update payer CLI option tag descriptions in [options.go](https://github.com/xmtp/xmtpd/pull/1486/files#diff-6731fb6f709392ce3e37d3b0c42074cddbce566dad2bab86af24ba7585eeb57c). #### 📍Where to Start Start with `ClosestNodeSelectorAlgorithm.updateLatencyCache` in [node_selector.go](https://github.com/xmtp/xmtpd/pull/1486/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31). ---- <!-- Macroscope's review summary starts here --> <a href="https://app.macroscope.com">Macroscope</a> summarized 7917b40. <!-- Macroscope's review summary ends here --> <!-- macroscope-ui-refresh --> <!-- Macroscope's pull request summary ends here --> --------- Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
| if cacheExpired { | ||
| c.updateLatencyCache(nodesToConsider) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refreshing on the hot path might cause unpredictable delays when a client thread pulls the short straw and has to refresh all nodes.
| func (c *ClosestNodeSelectorAlgorithm) updateLatencyCache(nodes []registry.Node) { | ||
| newCache := make(map[uint32]time.Duration) | ||
|
|
||
| for _, node := range nodes { | ||
| latency, err := c.measureLatency(node.HTTPAddress) | ||
| if err == nil && latency > 0 { | ||
| newCache[node.NodeID] = latency | ||
| } | ||
| } | ||
|
|
||
| // Only update cache if at least one latency measurement was successful | ||
| // This prevents wiping out the previous cache when all probes fail | ||
| if len(newCache) > 0 { | ||
| c.cacheMutex.Lock() | ||
| c.latencyCache = newCache | ||
| c.lastUpdate = time.Now() | ||
| c.cacheMutex.Unlock() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple clients can refresh at the same time, wasting effort
| if options.NodeSelectorStrategy != "" && !validStrategies[options.NodeSelectorStrategy] { | ||
| return fmt.Errorf( | ||
| "invalid node-selector-strategy: %s (must be one of: stable, manual, ordered, random, closest)", | ||
| options.NodeSelectorStrategy, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty --payer.node-selector-strategy currently passes validation. Consider checking against validStrategies unconditionally so "" is rejected (or document if "" is intentionally treated as default).
- if options.NodeSelectorStrategy != "" && !validStrategies[options.NodeSelectorStrategy] {
+ if !validStrategies[options.NodeSelectorStrategy] {
🚀 Want me to fix this? Reply ex: "fix it for me".
|
Approved. We should probably refresh the latencies in a background thread. Not on the client hot path. Can be done in a separate PR. |
Add configurable node selection strategies to gateway payer API and construct selector in
GatewayServiceBuilder.buildGatewayServicewith default cache expiry 5m and connect timeout 2sIntroduce a
selectors.NodeSelectorAlgorithminterface with strategiesstable,manual,ordered,random, andclosest; wire the selector intopayer.NewPayerAPIService; add config and validation for strategy and parameters; and implement latency-based selection inselectors.ClosestNodeSelectorAlgorithm.📍Where to Start
Start with selector construction in
GatewayServiceBuilder.buildGatewayServicein builder.go, then review the factoryselectors.NewNodeSelectorin interface.go and the closest strategy in closest.go.📊 Macroscope summarized 7310e7b. 10 files reviewed, 6 issues evaluated, 4 issues filtered, 1 comment posted
🗂️ Filtered Issues
pkg/api/payer/selectors/closest.go — 0 comments posted, 3 evaluated, 3 filtered
GetNode, when the cache is expired (line 102-104),updateLatencyCacheis called. However, if all latency measurements fail insideupdateLatencyCache, the cache is not updated (lines 147-152), butlastUpdatealso remains unchanged. The next call toGetNodewill again findcacheExpired == trueand retry measurements, potentially causing repeated network probes on every call when nodes are unreachable. [ Low confidence ]if closestNodeID == 0on line 128 incorrectly treats NodeID 0 as an invalid/unset value. However, 0 is a validuint32NodeID. If a node withNodeID == 0is the closest available node, this function will incorrectly return an error"no available nodes with latency measurements"instead of returning the valid node ID 0. [ Already posted ]updateLatencyCache, latency measurements for nodes are performed sequentially without any concurrency. When there are many nodes, this could causeGetNodeto block for an extended period (up toconnectTimeout * len(nodes)duration) while the caller waits, potentially causing timeouts or poor responsiveness in the calling code. [ Low confidence ]pkg/config/validation.go — 1 comment posted, 2 evaluated, 1 filtered
Payer.Enableistrue,validatePayerOptionsdoes not validate thatPayer.PrivateKeyis set. If the payer functionality requires a private key to sign blockchain transactions (as suggested by its description), enabling the payer without providing a private key could cause a runtime failure when attempting to sign transactions. [ Already posted ]