Skip to content

Conversation

@jmg-duarte
Copy link
Contributor

@jmg-duarte jmg-duarte commented Dec 1, 2025

Description

We're getting autopilot errors due to running ANALYZE on our read replica, the PR disables it for the read replica only

2025-12-01T03:02:50.723Z ERROR database_metrics: autopilot::database: failed to update large tables stats err=Database(PgDatabaseError { severity: Error, code: "25006", message: "cannot execute ANALYZE during recovery", detail: None, hint: None, position: None, where: None, schema: None, table: None, column: None, data_type: None, constraint: None, file: Some("utility.c"), line: Some(455), routine: Some("PreventCommandDuringRecovery") })

https://aws-es.cow.fi/_dashboards/app/discover#/context/86e4a5a0-4e4b-11ef-85c5-3946a99ed1a7/Lhrc15oBNcYyVCDI7-L5?_g=(filters:!())&_a=(columns:!(timestamp,log,log_level,kubernetes.pod_name),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'86e4a5a0-4e4b-11ef-85c5-3946a99ed1a7',key:kubernetes.container_name,negate:!f,params:(query:polygon-autopilot-prod),type:phrase),query:(match_phrase:(kubernetes.container_name:polygon-autopilot-prod))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:'86e4a5a0-4e4b-11ef-85c5-3946a99ed1a7',key:log_level,negate:!f,params:!(ERROR,FATAL),type:phrases,value:'ERROR,%20FATAL'),query:(bool:(minimum_should_match:1,should:!((match_phrase:(log_level:ERROR)),(match_phrase:(log_level:FATAL))))))))

The following types of administration commands are not accepted during recovery mode:

Data Definition Language (DDL): e.g., CREATE INDEX

Privilege and Ownership: GRANT, REVOKE, REASSIGN

Maintenance commands: ANALYZE, VACUUM, CLUSTER, REINDEX

Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary.

As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby.

https://www.postgresql.org/docs/current/hot-standby.html

Changes

  • Check if the autopilot is connected to a read replica and do not issue ANALYZE commands

How to test

Tested in staging, change was issued around 10:40, since the task first runs ANALYZE then sleeps, no errors = no command issued

image

@jmg-duarte jmg-duarte marked this pull request as ready for review December 3, 2025 11:21
@jmg-duarte jmg-duarte requested a review from a team as a code owner December 3, 2025 11:21
Copy link
Contributor

@MartinquaXD MartinquaXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a bit of context this logic is only used to make table sizes show up in grafana. Since then we introduced a postgres metrics exporter. Maybe the better solution would be to drop this logic altogether from the services and instead solve the issue in infra. This should clearly only happen if it's very easy to do as these metrics are low prio overall.

@squadgazzz
Copy link
Contributor

squadgazzz commented Dec 4, 2025

IIRC, we had this error with the primary DB as well and the devops team fixed it somehow, but it was a long time ago. I would rather do the opposite: switch those queries to the read replica(if connected) and don't touch the primary DB at all.

@jmg-duarte
Copy link
Contributor Author

jmg-duarte commented Dec 5, 2025

switch those queries to the read replica(if connected) and don't touch the primary DB at all.

That's the current and unsupported behavior

@squadgazzz
Copy link
Contributor

That's the current and unsupported behavior

Ah, true, sorry 😅
From the PR description, I didn't get why we can't use ANALYZE with the read replica. Isn't that just a DB config issue?

@jmg-duarte
Copy link
Contributor Author

From the PR description, I didn't get why we can't use ANALYZE with the read replica. Isn't that just a DB config issue?

The read-only replica runs in recovery mode, under which you can't issue the commands stated in the PR description.

As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby.

@m-sz
Copy link
Contributor

m-sz commented Dec 8, 2025

Please make sure to notify the team when the DB_READ_URL env var is removed from the vault.

@jmg-duarte jmg-duarte enabled auto-merge December 8, 2025 11:45
@jmg-duarte jmg-duarte added this pull request to the merge queue Dec 8, 2025
@MartinquaXD MartinquaXD removed this pull request from the merge queue due to a manual request Dec 8, 2025
@jmg-duarte jmg-duarte added this pull request to the merge queue Dec 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 8, 2025
@jmg-duarte jmg-duarte added this pull request to the merge queue Dec 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 8, 2025
@jmg-duarte jmg-duarte added this pull request to the merge queue Dec 8, 2025
Merged via the queue into main with commit f28cb49 Dec 8, 2025
18 checks passed
@jmg-duarte jmg-duarte deleted the jmgd/analyze branch December 8, 2025 15:15
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants