Skip to content

Conversation

@api-Hypernova
Copy link
Contributor

@api-Hypernova api-Hypernova commented Dec 19, 2025

Add configurable node selection strategies to gateway payer API and construct selector in GatewayServiceBuilder.buildGatewayService with default cache expiry 5m and connect timeout 2s

Introduce a selectors.NodeSelectorAlgorithm interface with strategies stable, manual, ordered, random, and closest; wire the selector into payer.NewPayerAPIService; add config and validation for strategy and parameters; and implement latency-based selection in selectors.ClosestNodeSelectorAlgorithm.

📍Where to Start

Start with selector construction in GatewayServiceBuilder.buildGatewayService in builder.go, then review the factory selectors.NewNodeSelector in interface.go and the closest strategy in closest.go.


📊 Macroscope summarized 7310e7b. 10 files reviewed, 6 issues evaluated, 4 issues filtered, 1 comment posted

🗂️ Filtered Issues

pkg/api/payer/selectors/closest.go — 0 comments posted, 3 evaluated, 3 filtered
  • line 103: In GetNode, when the cache is expired (line 102-104), updateLatencyCache is called. However, if all latency measurements fail inside updateLatencyCache, the cache is not updated (lines 147-152), but lastUpdate also remains unchanged. The next call to GetNode will again find cacheExpired == true and retry measurements, potentially causing repeated network probes on every call when nodes are unreachable. [ Low confidence ]
  • line 128: The check if closestNodeID == 0 on line 128 incorrectly treats NodeID 0 as an invalid/unset value. However, 0 is a valid uint32 NodeID. If a node with NodeID == 0 is the closest available node, this function will incorrectly return an error "no available nodes with latency measurements" instead of returning the valid node ID 0. [ Already posted ]
  • line 138: In updateLatencyCache, latency measurements for nodes are performed sequentially without any concurrency. When there are many nodes, this could cause GetNode to block for an extended period (up to connectTimeout * len(nodes) duration) while the caller waits, potentially causing timeouts or poor responsiveness in the calling code. [ Low confidence ]
pkg/config/validation.go — 1 comment posted, 2 evaluated, 1 filtered
  • line 66: When Payer.Enable is true, validatePayerOptions does not validate that Payer.PrivateKey is set. If the payer functionality requires a private key to sign blockchain transactions (as suggested by its description), enabling the payer without providing a private key could cause a runtime failure when attempting to sign transactions. [ Already posted ]

@api-Hypernova api-Hypernova changed the title Hypernova/1475 Introduce Alternative Node Selection Strategies Dec 19, 2025
@api-Hypernova api-Hypernova self-assigned this Dec 20, 2025
- Add NodeSelectorStrategy, NodeSelectorPreferredNodes, NodeSelectorCacheExpiry, NodeSelectorConnectTimeout to PayerOptions
- Create NodeSelectorConfig struct and NewNodeSelector factory function for clean initialization
- Wire configuration through gateway builder to use config-based node selection
- Add comprehensive validation for node selector options in config validator
- All strategies (stable, manual, ordered, random, closest) now seamlessly integrate with configuration system
Ubuntu and others added 3 commits December 21, 2025 17:43
…ector factory

- Add Payer PayerOptions field to ServerOptions struct to fix validation errors
- Add comprehensive tests for NewNodeSelector factory function covering all strategies
- Test coverage includes: stable, manual, ordered, random, closest strategies
- Verify error handling for invalid strategies and missing required configuration
- All 33 node selector tests passing
- Align struct field spacing in NodeSelectorConfig
- Align struct field spacing in PayerOptions
- Remove extra blank line
- Shorten flag names and env variable names to meet golines 100 char limit
- Rename NodeSelectorConnectTimeout to NodeSelectorTimeout
- Update all references to renamed field
Isaac Hartford added 2 commits December 21, 2025 18:28
- Run gofmt to add proper tab-based column alignment
- Shorten flag names further to fit 100 char limit with tabs
- Remove default tag from NodeSelectorStrategy (handled in code)
- All lines now under 100 chars with proper formatting
@api-Hypernova api-Hypernova changed the title Introduce Alternative Node Selection Strategies Gateway Alternative Node Selection Strategies Dec 22, 2025
The ordered and manual node selection strategies require parsing comma-separated
node IDs from the XMTPD_PAYER_NODE_NODES environment variable.

Added env-delim:"," to NodeSelectorPreferredNodes struct tag to enable proper
parsing of comma-separated uint32 values by go-flags library.

This fixes the ordered strategy which was failing with exit code 1 due to
inability to parse the node list from environment variables.

Tested with all 5 strategies (stable, manual, ordered, random, closest) in
testnet-dev environment. All strategies working correctly.
macroscopeapp bot and others added 3 commits December 24, 2025 18:45
…event localhost dialing (#1485)

#### Originated from #1483

<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will post its summary as a comment. -->
### Add an empty-hostname guard in
`ClosestNodeSelectorAlgorithm.measureLatency` to avoid localhost dialing
in
[node_selector.go](https://github.com/xmtp/xmtpd/pull/1485/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31)
Introduce a guard clause that returns `-1` when the parsed URL has an
empty hostname in `ClosestNodeSelectorAlgorithm.measureLatency` within
[node_selector.go](https://github.com/xmtp/xmtpd/pull/1485/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31).

#### 📍Where to Start
Start at `ClosestNodeSelectorAlgorithm.measureLatency` in
[node_selector.go](https://github.com/xmtp/xmtpd/pull/1485/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31).

----
<!-- Macroscope's review summary starts here -->

<a href="https://app.macroscope.com">Macroscope</a> summarized ac62651.
<!-- Macroscope's review summary ends here -->

<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
QUALITY IMPROVEMENTS:
- Fixed awkward XMTPD_PAYER_NODE_NODES environment variable naming
  * Changed to XMTPD_PAYER_NODE_SELECTOR_PREFERRED_NODES for clarity
  * Better semantic meaning: these are preferred nodes for the selector
  * Consistent with other node-selector-* prefixed options

- Improved option naming and descriptions:
  * Full descriptive names for all flags (node-selector-strategy, etc.)
  * Clear descriptions for each configuration option
  * Added default value for strategy ('stable')

- Fixed documentation comment:
  * HashKey comment now correctly states uint32 instead of uint16

- Enhanced validation:
  * Added check to allow empty strategy (defaults to 'stable')
  * Improved error messages for clarity

All 33 node selector tests pass.
Code builds without errors or lint issues.
Maintains backward compatibility with existing 'stable' strategy.
FEATURE ENHANCEMENT:
- Extended 'closest' strategy to support PreferredNodes filtering
- When PreferredNodes is set, measures latency ONLY to those nodes
- Picks lowest latency from the preferred set
- Falls back to all nodes if none of the preferred are available
- Backward compatible: empty PreferredNodes measures all nodes (existing behavior)

USE CASES:
- Regional preference: 'Use closest node from us-east region'
- Tier preference: 'Use closest from premium tier'
- Cost optimization: 'Use closest from cheaper node set'
- Smart hybrid routing with manual preference list

IMPLEMENTATION:
- Added preferredNodes field to ClosestNodeSelectorAlgorithm
- Modified GetNode() to filter nodes before latency measurement
- Updated factory function to pass PreferredNodes to closest strategy
- Added 3 comprehensive tests for new behavior

All 39 tests passing.
Production-ready enhancement for smart regional routing.
…inter warnings in node s (#1486)

#### Originated from #1483

<!-- Macroscope's pull request summary starts here -->
<!-- Macroscope will only edit the content between these invisible
markers, and the markers themselves will not be visible in the GitHub
rendered markdown. -->
<!-- If you delete either of the start / end markers from your PR's
description, Macroscope will post its summary as a comment. -->
### Prevent overwriting
`ClosestNodeSelectorAlgorithm.updateLatencyCache` when all probes fail
and mark unused `GetNode` parameters to satisfy linters in
[node_selector.go](https://github.com/xmtp/xmtpd/pull/1486/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31)
Add a non-empty guard to
`ClosestNodeSelectorAlgorithm.updateLatencyCache` to retain the previous
cache on probe failure and rename unused `GetNode` parameters to `_`.
Update payer CLI option tag descriptions in
[options.go](https://github.com/xmtp/xmtpd/pull/1486/files#diff-6731fb6f709392ce3e37d3b0c42074cddbce566dad2bab86af24ba7585eeb57c).

#### 📍Where to Start
Start with `ClosestNodeSelectorAlgorithm.updateLatencyCache` in
[node_selector.go](https://github.com/xmtp/xmtpd/pull/1486/files#diff-a0d98f17faa5188d8ba6083374857ecc702fd64dccf92facabbe4da3afc13a31).

----
<!-- Macroscope's review summary starts here -->

<a href="https://app.macroscope.com">Macroscope</a> summarized 7917b40.
<!-- Macroscope's review summary ends here -->

<!-- macroscope-ui-refresh -->
<!-- Macroscope's pull request summary ends here -->

---------

Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Comment on lines +102 to +104
if cacheExpired {
c.updateLatencyCache(nodesToConsider)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refreshing on the hot path might cause unpredictable delays when a client thread pulls the short straw and has to refresh all nodes.

Comment on lines +135 to +153
func (c *ClosestNodeSelectorAlgorithm) updateLatencyCache(nodes []registry.Node) {
newCache := make(map[uint32]time.Duration)

for _, node := range nodes {
latency, err := c.measureLatency(node.HTTPAddress)
if err == nil && latency > 0 {
newCache[node.NodeID] = latency
}
}

// Only update cache if at least one latency measurement was successful
// This prevents wiping out the previous cache when all probes fail
if len(newCache) > 0 {
c.cacheMutex.Lock()
c.latencyCache = newCache
c.lastUpdate = time.Now()
c.cacheMutex.Unlock()
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple clients can refresh at the same time, wasting effort

Comment on lines +643 to +648
if options.NodeSelectorStrategy != "" && !validStrategies[options.NodeSelectorStrategy] {
return fmt.Errorf(
"invalid node-selector-strategy: %s (must be one of: stable, manual, ordered, random, closest)",
options.NodeSelectorStrategy,
)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty --payer.node-selector-strategy currently passes validation. Consider checking against validStrategies unconditionally so "" is rejected (or document if "" is intentionally treated as default).

-    if options.NodeSelectorStrategy != "" && !validStrategies[options.NodeSelectorStrategy] {
+    if !validStrategies[options.NodeSelectorStrategy] {

🚀 Want me to fix this? Reply ex: "fix it for me".

@mkysel mkysel enabled auto-merge (squash) December 26, 2025 15:20
@mkysel
Copy link
Collaborator

mkysel commented Dec 26, 2025

Approved. We should probably refresh the latencies in a background thread. Not on the client hot path. Can be done in a separate PR.

@mkysel mkysel merged commit 4dda604 into main Dec 26, 2025
12 checks passed
@mkysel mkysel deleted the hypernova/1475 branch December 26, 2025 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants