feat(autopilot): reduce excessive looping of main loop #3938

kaze-cow · 2025-12-01T08:36:27Z

Description

Currently the run_fovever loop in autopilot will continue looping without waiting as long as there is no auction to process and the block phase is still early enough. This is ofc not efficient and makes it difficult to view debug logs in the playground as they are spammed every time a block is mined. Technically an event listener is likely to be able to process events faster than a 50ms database scanning loop (as it is now)

Changes

Add postgresql notification on new order added to order table
Listen to notification and trigger the main loop if it
Update playground logs verbosity configuration to make it easier to track loop status. Now that the logs don't spam as a result of infinite looping, these changes make sense!
Verify that the loop is being executed on block and on order in the playground

How to test

Run the playground as normal docker compose -f playground/docker-compose.fork.yml up --build

you may have to delete and recreate your existing playground with docker compose -f playground/docker-compose.fork.yml down --volumes

When all services stabilize, run playground/test_playground.sh
Observe logs indicating that the order even was received

github-actions · 2025-12-01T08:36:40Z

Reminder: Please update the DB Readme and comment whether migrations are reversible (include rollback scripts if applicable).

If creating new tables, update the tables list.
When adding a new index, consider using CREATE INDEX CONCURRENTLY for tables involved in the critical execution path.
For breaking changes, remember that during rollout k8s starts the new autopilot, runs the Flyway migration, and only then shuts down the old pod. That overlap means the previous version can still be processing requests on the migrated schema, so make it compatible first and ship the breaking DB change in the following release.

Caused by:

database/sql/V094__add_order_notification_trigger.sql

crates/autopilot/src/run_loop.rs

database/sql/V094__add_order_notification_trigger.sql

MartinquaXD · 2025-12-01T10:21:44Z

crates/autopilot/src/run_loop.rs

+                            }
+                        }
+                    }
+                    Err(err) => {


Putting everything in one huge nested statement makes it quite hard to read. Instead you could just smaller match statements with early returns like this:

let mut listener = match sqlx::postgres::PgListener::connect_with(&pool).await { Ok(listener) => listener, Err(err) => { tracing::error!(?err, "failed to create PostgreSQL listener"); tokio::time::sleep(Duration::from_secs(5)).await; continue; } }

it looks a bit better with the early return as you suggest, but it still not that great.

I'm torn. do you think it would be best to create a nested function for the inner loop and use ? to resolve the Results? but then we don't get specific error messages for each different place an error could happen. I wonder if there is some other way to make this nice for loops in rust...

database/sql/V094__add_order_notification_trigger.sql

crates/autopilot/src/run_loop.rs

* use Notify from tokio instead of channel * move the notification function into the persistence module * update the trigger to return the newly created order ID so that it can be immediately accessed/updated into a cache

and try to make the notify loop more beautiful with an early return

…tocol/services into fix/autopilot-proper-wait

jmg-duarte

LGTM

fafk · 2025-12-03T11:44:13Z

crates/autopilot/src/run_loop.rs


        while !control.should_shutdown() {
+            // Wait for a new block or order before proceeding
+            self_arc.wake_notify.notified().await;


I would place it after line 175, after

// caches are warmed up, we're ready to do leader work if let Some(startup) = self_arc.probes.startup.as_ref() { startup.store(true, Ordering::Release); }

so it doesn't wait on startup before it warms up the caches, which takes a lot of time and should be done right away. LGMT otherwise, 👌

ooh good idea. but then wouyldn't the caches potentially be stale by the time the auction starts?

For networks which are not mainnet it doesn't really matter, because they have super short block times and little traffic. And for mainnet you are right, but the query on startup takes 10+ seconds, so I think we're actually increasing the likelihood of order getting into the auction earlier. 🤔 Under normal circumstances it doesn't matter, because an autopilot instance which is on standby will become leader and it will have its caches warmed up. tl;dr my suggestion is a actually a nit and not very relevant in the grand scheme of things.

ok so if on all networks except network it is not important, and on mainnet its unclear if it would help or not... your explanation makes sense though, especially considering auctions take more than one block to get going anyway. I will go head and move it to after cache warm up as you suggest.

squadgazzz · 2025-12-05T12:22:02Z

Should we also account for ethflow orders then? ethflow_orders table.

kaze-cow · 2025-12-05T13:29:58Z

Should we also account for ethflow orders then? ethflow_orders table.

there was a previous comment about this, and it was discovered that all orders end up in the same table, including ethflow/onchain orders. I just double checked and the specific table you mentioned appears to be updated in the same loop as the orders table inserts https://github.com/cowprotocol/services/blob/main/crates/autopilot/src/database/onchain_order_events/mod.rs#L368 in append_custom_order_info_to_db

…mment

…tocol/services into fix/autopilot-proper-wait

squadgazzz

LGTM

…icks up the cancelled order for the next auction

kaze-cow · 2025-12-08T04:51:44Z

the test was failing because with the recent change to put the loop wait after the cache generation, a new block is only minted once before the test verifies the condition (in this case, of an order being cancelled after being placed), and auction order cache does not repopulated. for now I have added additional block minting to the wait_for_condition loop.

(btw I wasn't able to replicate this test failure on my local machine, so seems it could be considered flaky)

kaze-cow added 3 commits December 1, 2025 14:16

feat(autopilot): reduce excessive looping of main loop

97460d2

improve logging and ensure the message is delivered

aea34d2

increase playground log verbosity (from other branch)

bd670e1

kaze-cow self-assigned this Dec 1, 2025

kaze-cow requested a review from a team as a code owner December 1, 2025 08:36

just fmt

36871d2

MartinquaXD reviewed Dec 1, 2025

View reviewed changes

kaze-cow added 6 commits December 2, 2025 23:17

changes from feedback

b8626ce

* use Notify from tokio instead of channel * move the notification function into the persistence module * update the trigger to return the newly created order ID so that it can be immediately accessed/updated into a cache

Merge branch 'main' into fix/autopilot-proper-wait

d8da582

add notification for onchain-orders table

1316c2b

and try to make the notify loop more beautiful with an early return

Merge branch 'fix/autopilot-proper-wait' of https://github.com/cowpro…

073bc0f

…tocol/services into fix/autopilot-proper-wait

undo sql change

43d0847

Merge branch 'main' into fix/autopilot-proper-wait

f27a22b

kaze-cow requested a review from MartinquaXD December 3, 2025 07:36

jmg-duarte approved these changes Dec 3, 2025

View reviewed changes

fafk reviewed Dec 3, 2025

View reviewed changes

move the notify wait to after the cache population as suggested by co…

3f7b7d2

…mment

fafk approved these changes Dec 5, 2025

View reviewed changes

Merge branch 'fix/autopilot-proper-wait' of https://github.com/cowpro…

32eed66

…tocol/services into fix/autopilot-proper-wait

kaze-cow enabled auto-merge December 5, 2025 13:35

squadgazzz approved these changes Dec 5, 2025

View reviewed changes

kaze-cow added 2 commits December 8, 2025 13:09

Merge branch 'main' into fix/autopilot-proper-wait

6b671ad

add an additional block mint and extra time to ensure the autopilot p…

5712c36

…icks up the cancelled order for the next auction

kaze-cow added 3 commits December 8, 2025 14:01

try creating a block every loop

a945efc

its working, just need to apply to the second cancellation set

3fd8a9f

try to fix another test with similar looking issue

3d073d7

kaze-cow added this pull request to the merge queue Dec 8, 2025

Merged via the queue into main with commit 182ae20 Dec 8, 2025
18 checks passed

kaze-cow deleted the fix/autopilot-proper-wait branch December 8, 2025 07:20

github-actions bot locked and limited conversation to collaborators Dec 8, 2025

feat(autopilot): reduce excessive looping of main loop #3938

feat(autopilot): reduce excessive looping of main loop #3938

Uh oh!

Conversation

kaze-cow commented Dec 1, 2025

Description

Changes

How to test

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MartinquaXD Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

kaze-cow Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jmg-duarte left a comment

Choose a reason for hiding this comment

Uh oh!

fafk Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaze-cow Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

fafk Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

kaze-cow Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

squadgazzz commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaze-cow commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

squadgazzz left a comment

Choose a reason for hiding this comment

Uh oh!

kaze-cow commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kaze-cow Dec 2, 2025 •

edited

Loading

fafk Dec 3, 2025 •

edited

Loading

squadgazzz commented Dec 5, 2025 •

edited

Loading

kaze-cow commented Dec 5, 2025 •

edited

Loading

kaze-cow commented Dec 8, 2025 •

edited

Loading