Add support for delayed replica in PostgreSQL Operator

### Community Note

* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment



**Tell us about the feature**
Add the ability to designate one or more replica instance sets as delayed standbys — replicas that apply WAL changes with a configurable time lag. A transaction committed at time X becomes visible on a delayed replica only at time X + delay.

This provides a fast, always-ready safety net for accidental data changes (DROP TABLE, mass DELETE, logical corruption) without requiring a full PITR restore from backup.

**Which product(s) is this request for?**
Operators, PostgreSQL

**Tell us about the problem**
Currently, recovering from accidental data changes in a Percona PG Operator cluster requires a full Point-in-Time Recovery from pgBackRest backup, which involves:
- Restoring a full base backup (potentially hundreds of GB)
- Replaying WAL segments up to the target time
- End-to-end recovery time: minutes to hours, depending on database size and WAL volume

A delayed replica solves this differently — it is a live standby that is perpetually behind the primary by a fixed interval (e.g., 1 hour). If a destructive change is detected within that window:

1. Promote the delayed replica (or extract the data from it while it's still a standby)
2. Recovery time: seconds to minutes, regardless of database size

**How Delayed Replication Works in PostgreSQL + Patroni**
PostgreSQL's recovery_min_apply_delay GUC parameter causes a standby to hold received WAL in its buffer and only apply it after the specified interval has passed. It is:
- Set in postgresql.conf (or via Patroni)
- Ignored on the primary — only meaningful on standbys
- Expressed as a PostgreSQL interval string: '30min', '1h', '2h'

Patroni supports per-node PostgreSQL parameter overrides via each node's local configuration file. A delayed node must also be tagged:

- `tags.nofailover: true` — must never be automatically promoted (it holds stale data by design)
- `tags.noloadbalance: true` — should not receive regular read traffic
- `tags.nosync: true` — must not be used as a synchronous standby

**Acceptance Criteria**

- [ ]  replicationDelay field accepted in PGInstanceSetSpec
- [ ]  Pods in a delayed instance set have recovery_min_apply_delay set correctly in their Patroni local config
- [ ]  Delayed pods are tagged nofailover: true, noloadbalance: true, nosync: true in Patroni
- [ ]  Delayed replica is excluded from automatic failover (verified with Patroni API)
- [ ]  walApplyLag visible in cluster status
- [ ]  Validation error for invalid interval string values
- [ ]  Promoting a delayed replica to primary works (manual switchover via kubectl/operator command)
- [ ]  Normal replica instance sets are not affected when a delayed set is present
- [ ]  E2E test: simulate DROP TABLE → verify table still exists on delayed replica within delay window
- [ ]  E2E test: promote delayed replica → cluster healthy
- [ ]  Documentation: use cases, delay sizing guidance, promotion procedure

[Jira Link](https://perconadev.atlassian.net/browse/K8SPG-606)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for delayed replica in PostgreSQL Operator #171

Community Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for delayed replica in PostgreSQL Operator #171

Description

Community Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions