Skip to content

feat(zero-cache)!: automatic replication-manager discovery / routing#4335

Merged
darkgnotic merged 5 commits into
mainfrom
darkgnotic/rep-mgr-discovery
May 6, 2025
Merged

feat(zero-cache)!: automatic replication-manager discovery / routing#4335
darkgnotic merged 5 commits into
mainfrom
darkgnotic/rep-mgr-discovery

Conversation

@darkgnotic
Copy link
Copy Markdown
Contributor

⚠️ BREAKING CHANGE

Fargate / most multi-node configurations (e.g. host / awsvpc networking)

To update a multi-node configuration without disruption:

  1. roll out new replication-manager
  2. roll out new view-syncer, replacing the:
    • ZERO_CHANGE_STREAMER_URI=http://{host} option with
    • ZERO_CHANGE_STREAMER_MODE=discover
  3. when view-syncers are rollback-safe, remove the internal load balancer that was previously used for view-syncer to replication-manager routing

Single-node configurations

Single-node configurations are unaffected

Uncommon multi-node configurations

For container setups in which the process does not have access to the externally visible ip address or port (e.g. using docker a.k.a. "bridge" mode networking), an external routing or proxying mechanism is still needed. In such configurations, add the ZERO_CHANGE_STREAMER_ADDRESS={host} option to the replication-manager, where {host} is the hostname that was formerly part of ZERO_CHANGE_STREAMER_URI, e.g.

Before:

  • view-syncer: ZERO_CHANGE_STREAMER_URI=http://internal-prod-repmgr-125468.us-east-1.elb.amazonaws.com

After:

  • view-syncer: ZERO_CHANGE_STREAMER_MODE=discover
  • replication-manager: ZERO_CHANGE_STREAMER_ADDRESS=internal-prod-repmgr-125468.us-east-1.elb.amazonaws.com

Note: For configurations that continue to use an explicit load-balancing mechanism, the replication-manager health check should be configured to on the /keepalive path, and not the root / path.

Feature

The discovery and routing of the replication-manager is now facilitated by the Postgres Change DB, using the same row-level locking mechanism used to enforce single-writer access to the change log.

This obviates the need for an external addressing or proxying mechanism such as Service Discovery, Service Connect, or an Internal Load Balancer.

Screenshot 2025-04-30 at 17 57 06

@darkgnotic darkgnotic requested a review from arv May 6, 2025 15:41
@vercel
Copy link
Copy Markdown

vercel Bot commented May 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
replicache-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 6, 2025 3:48pm
zbugs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 6, 2025 3:48pm

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2025

🐰 Bencher Report

Branchdarkgnotic/rep-mgr-discovery
TestbedLinux
Click to view all benchmark results
BenchmarkFile SizeBenchmark Result
kilobytes (KB)
(Result Δ%)
Upper Boundary
kilobytes (KB)
(Limit %)
zero-package.tgz📈 view plot
🚷 view threshold
1,158.75 KB
(+0.09%)Baseline: 1,157.73 KB
1,180.88 KB
(98.13%)
zero.js📈 view plot
🚷 view threshold
194.79 KB
(0.00%)Baseline: 194.79 KB
198.68 KB
(98.04%)
zero.js.br📈 view plot
🚷 view threshold
54.60 KB
(0.00%)Baseline: 54.60 KB
55.69 KB
(98.04%)
🐰 View full continuous benchmarking report in Bencher

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2025

🐰 Bencher Report

Branchdarkgnotic/rep-mgr-discovery
TestbedLinux
Click to view all benchmark results
BenchmarkThroughputBenchmark Result
operations / second (ops/s)
(Result Δ%)
Lower Boundary
operations / second (ops/s)
(Limit %)
src/client/custom.bench.ts > big schema📈 view plot
🚷 view threshold
339,168.11 ops/s
(+345.28%)Baseline: 76,169.69 ops/s
-110,558.45 ops/s
(-32.60%)
src/client/zero.bench.ts > basics > All 1000 rows x 10 columns (numbers)📈 view plot
🚷 view threshold
1,582.73 ops/s
(+325.08%)Baseline: 372.34 ops/s
-443.29 ops/s
(-28.01%)
src/client/zero.bench.ts > pk compare > pk = N📈 view plot
🚷 view threshold
31,774.00 ops/s
(+198.54%)Baseline: 10,643.00 ops/s
-3,271.53 ops/s
(-10.30%)
src/client/zero.bench.ts > with filter > Lower rows 500 x 10 columns (numbers)📈 view plot
🚷 view threshold
2,495.00 ops/s
(+339.48%)Baseline: 567.72 ops/s
-727.57 ops/s
(-29.16%)
🐰 View full continuous benchmarking report in Bencher

@darkgnotic darkgnotic merged commit f8f4768 into main May 6, 2025
11 checks passed
@darkgnotic darkgnotic deleted the darkgnotic/rep-mgr-discovery branch May 6, 2025 15:51
Copy link
Copy Markdown
Contributor

@arv arv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Very nice

}

export function getPreferredIp(
interfaces: NodeJS.Dict<NetworkInterfaceInfo[]>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeJS.Dict seems a bit odd to use here but 🤷🏼

Comment on lines +78 to +81
// Check if start() was already called.
if (this.#fastify.addresses().length === 0) {
await this.start();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If start was already called, this will stop? Is that the intended behavior?

): Promise<string | null> {
const result = await sql<{ownerAddress: string | null}[]>/*sql*/ `
SELECT "ownerAddress" FROM ${sql(cdcSchema(shard))}."replicationState"`;
return result[0].ownerAddress;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point it would make sense to start using values() more.

await db`
UPDATE ${db(schema)}."replicationConfig"
await sql`
UPDATE ${sql(schema)}."replicationConfig"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOC, how do you get VSCode to syntax highlight these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this question!

I'm using this (on Matt's recommendation):

https://marketplace.visualstudio.com/items?itemName=frigus02.vscode-sql-tagged-template-literals-syntax-only

tjenkinson added a commit to tjenkinson/mono that referenced this pull request Jun 30, 2025
Adds a `changeStreamer.protocol` (`ZERO_CHANGE_STREAMER_PROTOCOL`) option, which can be set to `https`

Before rocicorp#4335 we were able to use a https url.
github-merge-queue Bot pushed a commit that referenced this pull request Dec 2, 2025
…er task registration (#5250)

Restore the original `replication-manager` behavior of delaying the
replication stream takeover to allow the task to be registered as a
healthy target by the load balancer (i.e. after a minimum number of
health checks). This fixes the temporary unreachability of the
replication-manager when the handoff happens before the load-balancer
has recognized the new replication-manager as healthy.

This original functionality was simplified away with the introduction of
auto-discovery (#4335), since that
replaced the dns and proxying component, but never restored when
proxy-based routing was reintroduced in
#4584 (and is now the recommended
configuration).

This new implementation is more compartmentalized than the original
implementation, encapsulating all of the logic in the
ChangeStreamerHttpService, so that the ChangeStreamerService itself is
agnostic to the details of health checks and startup delays.
darkgnotic added a commit that referenced this pull request Dec 2, 2025
…er task registration (#5250)

Restore the original `replication-manager` behavior of delaying the
replication stream takeover to allow the task to be registered as a
healthy target by the load balancer (i.e. after a minimum number of
health checks). This fixes the temporary unreachability of the
replication-manager when the handoff happens before the load-balancer
has recognized the new replication-manager as healthy.

This original functionality was simplified away with the introduction of
auto-discovery (#4335), since that
replaced the dns and proxying component, but never restored when
proxy-based routing was reintroduced in
#4584 (and is now the recommended
configuration).

This new implementation is more compartmentalized than the original
implementation, encapsulating all of the logic in the
ChangeStreamerHttpService, so that the ChangeStreamerService itself is
agnostic to the details of health checks and startup delays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants