Skip to content

fix(discov): prevent unbounded memory growth on duplicate etcd PUT events#5580

Open
kevwan wants to merge 1 commit into
zeromicro:masterfrom
kevwan:fix/discov-duplicate-put-memory-leak
Open

fix(discov): prevent unbounded memory growth on duplicate etcd PUT events#5580
kevwan wants to merge 1 commit into
zeromicro:masterfrom
kevwan:fix/discov-duplicate-put-memory-leak

Conversation

@kevwan
Copy link
Copy Markdown
Contributor

@kevwan kevwan commented May 13, 2026

Problem

Two related bugs in core/discov cause unbounded memory growth in long-running services using zRPC with etcd service discovery.

Bug 1 — handleWatchEvents fires OnAdd on every PUT unconditionally

When etcd re-emits a PUT event for a key that already exists with the same value (lease refresh, watch reconnect), handleWatchEvents called l.OnAdd() without checking whether the value actually changed. This triggered addKv on every duplicate event.

Bug 2 — addKv appends key to slice without deduplication

container.values maps a server address (string) to a slice of etcd keys ([]string). addKv unconditionally appended the key on every OnAdd call — so a single etcd key could appear thousands of times in the slice after repeated lease refreshes, consuming unbounded memory.

Bonus: when a key's value changed (etcd key moved to a different server address), the stale c.values[oldVal] slice entry was never cleaned up, and OnDelete(old) was never fired — leaving listeners with a stale view.

Fix

core/discov/internal/registry.gohandleWatchEvents

  • Duplicate PUT (same key, same value) → continue; no listener notification.
  • Value change (same key, different value) → fire OnDelete(old) then OnAdd(new) to keep listeners consistent.

core/discov/subscriber.goaddKv

  • Duplicate add (key already maps to same value) → early return, no slice append, no dirty-flag churn.
  • Value change (key moves to new server) → call doRemoveKey first to clean up the stale c.values[oldVal] entry before inserting the new one.

Tests

Four regression tests added:

Test File What it verifies
TestCluster_handleWatchEvents_DuplicatePut registry_test.go OnAdd called exactly once for two identical PUT events
TestCluster_handleWatchEvents_ValueChange registry_test.go OnDelete(old) fired before OnAdd(new) when value changes
TestContainer_DuplicateAdd subscriber_test.go Internal slice length stays at 1 after 100 duplicate adds
TestContainer_KeyValueChange subscriber_test.go Key moving to new value leaves correct state after delete

Related issues

…ents

When etcd re-emits a PUT for a key that already exists with the same
value (lease refresh, watch reconnect), handleWatchEvents previously
called l.OnAdd unconditionally, which caused addKv to append the same
etcd key to c.values[value] on every duplicate event. Over time this
grew the slice without bound.

Two fixes:

1. core/discov/internal/registry.go - handleWatchEvents:
   - Skip OnAdd/OnDelete notifications for duplicate PUTs (same key,
     same value) to prevent cascading growth.
   - When a key's value changes (key moves to a different server),
     fire OnDelete(old) before OnAdd(new) so listeners stay consistent.

2. core/discov/subscriber.go - addKv:
   - Return early when the key already maps to the same value, avoiding
     the redundant append and dirty-flag churn.
   - Call doRemoveKey before re-inserting when a key changes value,
     cleaning up the stale c.values[oldVal] entry that was previously
     orphaned.

Regression tests added for both duplicate-PUT and value-change cases.
Copilot AI review requested due to automatic review settings May 13, 2026 15:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two related bugs in core/discov that caused unbounded memory growth in long-running zRPC services using etcd service discovery. When etcd re-emits a PUT event for an unchanged key (lease refresh, watch reconnect), the old code unconditionally appended the key to an internal slice and fired OnAdd listeners, growing memory without bound. Additionally, when a key's value changed, the stale entry was leaked and OnDelete was never emitted for the old value.

Changes:

  • handleWatchEvents now skips duplicate same-value PUTs and emits OnDelete(old) before OnAdd(new) when a key's value changes.
  • container.addKv short-circuits duplicate adds and calls doRemoveKey to clean up stale state when a key is reassigned to a new value.
  • Four regression tests added (two in registry_test.go, two in subscriber_test.go).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
core/discov/internal/registry.go Dedup duplicate PUTs and emit OnDelete on value change in handleWatchEvents; trailing whitespace cleanup.
core/discov/internal/registry_test.go New tests for duplicate-PUT and value-change behavior; trailing whitespace cleanup.
core/discov/subscriber.go addKv now skips duplicates and cleans up stale mapping on value change.
core/discov/subscriber_test.go New tests verifying internal slice doesn't grow on duplicates and cleanup on value change.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants