Skip to content

Conversation

@kaze-cow
Copy link
Contributor

@kaze-cow kaze-cow commented Dec 1, 2025

Description

Currently the run_fovever loop in autopilot will continue looping without waiting as long as there is no auction to process and the block phase is still early enough. This is ofc not efficient and makes it difficult to view debug logs in the playground as they are spammed every time a block is mined. Technically an event listener is likely to be able to process events faster than a 50ms database scanning loop (as it is now)

Changes

  • Add postgresql notification on new order added to order table
  • Listen to notification and trigger the main loop if it
  • Update playground logs verbosity configuration to make it easier to track loop status. Now that the logs don't spam as a result of infinite looping, these changes make sense!
  • Verify that the loop is being executed on block and on order in the playground

How to test

  1. Run the playground as normal docker compose -f playground/docker-compose.fork.yml up --build
  • you may have to delete and recreate your existing playground with docker compose -f playground/docker-compose.fork.yml down --volumes
  1. When all services stabilize, run playground/test_playground.sh
  2. Observe logs indicating that the order even was received

@kaze-cow kaze-cow self-assigned this Dec 1, 2025
@kaze-cow kaze-cow requested a review from a team as a code owner December 1, 2025 08:36
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Reminder: Please update the DB Readme and comment whether migrations are reversible (include rollback scripts if applicable).

  • If creating new tables, update the tables list.
  • When adding a new index, consider using CREATE INDEX CONCURRENTLY for tables involved in the critical execution path.
  • For breaking changes, remember that during rollout k8s starts the new autopilot, runs the Flyway migration, and only then shuts down the old pod. That overlap means the previous version can still be processing requests on the migrated schema, so make it compatible first and ship the breaking DB change in the following release.

Caused by:

}
}
}
Err(err) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting everything in one huge nested statement makes it quite hard to read. Instead you could just smaller match statements with early returns like this:

let mut listener = match sqlx::postgres::PgListener::connect_with(&pool).await {
    Ok(listener) => listener,
    Err(err) => {
        tracing::error!(?err, "failed to create PostgreSQL listener");
        tokio::time::sleep(Duration::from_secs(5)).await;
        continue;
    }
}

Copy link
Contributor Author

@kaze-cow kaze-cow Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks a bit better with the early return as you suggest, but it still not that great.

I'm torn. do you think it would be best to create a nested function for the inner loop and use ? to resolve the Results? but then we don't get specific error messages for each different place an error could happen. I wonder if there is some other way to make this nice for loops in rust...

* use Notify from tokio instead of channel
* move the notification function into the persistence module
* update the trigger to return the newly created order ID so that it can be immediately accessed/updated into a cache
and try to make the notify loop more beautiful with an early return
@kaze-cow kaze-cow requested a review from MartinquaXD December 3, 2025 07:36
Copy link
Contributor

@jmg-duarte jmg-duarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


while !control.should_shutdown() {
// Wait for a new block or order before proceeding
self_arc.wake_notify.notified().await;
Copy link
Contributor

@fafk fafk Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would place it after line 175, after

            // caches are warmed up, we're ready to do leader work
            if let Some(startup) = self_arc.probes.startup.as_ref() {
                startup.store(true, Ordering::Release);
            }

so it doesn't wait on startup before it warms up the caches, which takes a lot of time and should be done right away. LGMT otherwise, 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh good idea. but then wouyldn't the caches potentially be stale by the time the auction starts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For networks which are not mainnet it doesn't really matter, because they have super short block times and little traffic. And for mainnet you are right, but the query on startup takes 10+ seconds, so I think we're actually increasing the likelihood of order getting into the auction earlier. 🤔 Under normal circumstances it doesn't matter, because an autopilot instance which is on standby will become leader and it will have its caches warmed up. tl;dr my suggestion is a actually a nit and not very relevant in the grand scheme of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so if on all networks except network it is not important, and on mainnet its unclear if it would help or not... your explanation makes sense though, especially considering auctions take more than one block to get going anyway. I will go head and move it to after cache warm up as you suggest.

@squadgazzz
Copy link
Contributor

squadgazzz commented Dec 5, 2025

Should we also account for ethflow orders then? ethflow_orders table.

@kaze-cow
Copy link
Contributor Author

kaze-cow commented Dec 5, 2025

Should we also account for ethflow orders then? ethflow_orders table.

there was a previous comment about this, and it was discovered that all orders end up in the same table, including ethflow/onchain orders. I just double checked and the specific table you mentioned appears to be updated in the same loop as the orders table inserts https://github.com/cowprotocol/services/blob/main/crates/autopilot/src/database/onchain_order_events/mod.rs#L368 in append_custom_order_info_to_db

@kaze-cow kaze-cow enabled auto-merge December 5, 2025 13:35
Copy link
Contributor

@squadgazzz squadgazzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kaze-cow
Copy link
Contributor Author

kaze-cow commented Dec 8, 2025

the test was failing because with the recent change to put the loop wait after the cache generation, a new block is only minted once before the test verifies the condition (in this case, of an order being cancelled after being placed), and auction order cache does not repopulated. for now I have added additional block minting to the wait_for_condition loop.

(btw I wasn't able to replicate this test failure on my local machine, so seems it could be considered flaky)

@kaze-cow kaze-cow added this pull request to the merge queue Dec 8, 2025
Merged via the queue into main with commit 182ae20 Dec 8, 2025
18 checks passed
@kaze-cow kaze-cow deleted the fix/autopilot-proper-wait branch December 8, 2025 07:20
@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants