Mempool improvements by obycode · Pull Request #3337 · stacks-network/stacks-core

obycode · 2022-10-12T00:59:50Z

This is the cleaned up version of simple-iterator discussed in #3326 and #3313.

To-do before checking in:

verify that a snapshot chainstate creates a DB with the right query plan
- can do as part of bringing up mock miner from snapshot.. then can check the query plan

@gregorycoppola

These changes take lessons learned from experiments by @gregorycoppola and myself, as well as feedback on various related PRs. It uses the following techniques to improve the speed and scalability of the mempool walk. * Uses rusqlite's `Rows` iterator to read one row at a time * Caches the nonces in memory to avoid repeated lookups * Restarts search from the highest fee-rate transactions after every executed transaction * Caches potential transactions in memory to retry on next pass With this implementation, miners can reliably fill a block in <30s, regardless of how large the mempool gets.

codecov · 2022-10-12T01:27:17Z

Codecov Report

Merging #3337 (36690d0) into develop (5123caa) will decrease coverage by 0.07%.
The diff coverage is 21.53%.

@@             Coverage Diff             @@
##           develop    #3337      +/-   ##
===========================================
- Coverage    32.02%   31.95%   -0.08%     
===========================================
  Files          261      261              
  Lines       208687   209709    +1022     
===========================================
+ Hits         66830    67008     +178     
- Misses      141857   142701     +844

Impacted Files	Coverage Δ
src/core/tests/mod.rs	`0.00% <0.00%> (ø)`
src/chainstate/stacks/miner.rs	`13.00% <1.21%> (-0.51%)`	⬇️
testnet/stacks-node/src/config.rs	`48.64% <40.00%> (-0.08%)`	⬇️
src/core/mempool.rs	`71.39% <88.64%> (+0.10%)`	⬆️
src/chainstate/stacks/db/accounts.rs	`28.58% <100.00%> (+0.23%)`	⬆️
stacks-common/src/deps_common/bitcoin/util/hash.rs	`36.86% <0.00%> (-2.19%)`	⬇️
src/net/dns.rs	`16.90% <0.00%> (-1.72%)`	⬇️
src/burnchains/bitcoin/mod.rs	`36.61% <0.00%> (-1.41%)`	⬇️
...-node/src/burnchains/bitcoin_regtest_controller.rs	`86.22% <0.00%> (-0.20%)`	⬇️
... and 27 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

obycode · 2022-10-12T12:14:46Z

~~Re-running my benchmarks today, this PR now looks good when compared to #3326 in benchmarking.~~

~~I was previously seeing better numbers for the other implementation, so it might be good for someone else to try benchmarking these and see if it is reproducible.~~

The better numbers from before were when the caching was applied on top of the changes in #3326.

Version	Commit	Num Tx	%-Full	Fees Collected (uSTX)	Time Spent (ms)
master	`378fc1b`	105	4.97%	25175605	33983.1
#3326	`6052447`	1034	99.97%	144993355	16822.8
#3337	`9a62096`	952	99.96%	155684975	14121.9

This was a typo when refactoring, and caused worse ordering for the transactions.

obycode · 2022-10-12T16:58:23Z

I made a mistake when refactoring this code and the retry list was being processed in the wrong order. The updated benchmarking numbers are here:

Version	Commit	Num Tx	%-Full	Fees Collected (uSTX)	Time Spent (ms)
master	`378fc1b`	105	4.97%	25175605	33983.1
#3326	`6052447`	1034	99.97%	144993355	16822.8
#3337 (with bug)	`9a62096`	952	99.96%	155684975	14121.9
#3337	`9190093`	972	99.95%	131784486	15563.8

It's surprising that the fees collected gets worse when this bug is fixed. That must indicate that the cost estimate (and thus the fee rate) is off.

obycode · 2022-10-12T17:08:09Z

Ah, I found another bug in this refactoring. Will fix and rerun the experiment.

obycode · 2022-10-12T17:21:07Z

Ok, after the bug fixes, the benchmark results look better:

Version	Commit	Num Tx	%-Full	Fees Collected (uSTX)	Time Spent (ms)
master	`378fc1b`	105	4.97%	25175605	33983.1
#3326	`6052447`	1034	99.97%	144993355	16822.8
#3337 (with bugs)	`9a62096`	952	99.96%	155684975	14121.9
#3337	`1586e86`	1097	100%	189810290	15624.7

Now we get a 100% full block (read count is 15,000), and we get 44.82 STX more fees than the alternate implementation.

obycode · 2022-10-12T20:22:43Z

The failing unit test, core::tests::mempool_walk_over_fork is failing because it is relying on behavior internal to the old design, the "last known nonces" in the mempool table. This design does not use those, so the test fails. I will look into whether it is worth re-writing the test, or if it should just be skipped.

obycode · 2022-10-12T20:48:16Z

Updated table with new numbers from the latest version from #3326:

Version	Commit	Num Tx	%-Full	Fees Collected (uSTX)	Time Spent (ms)
master	`378fc1b`	105	4.97%	25175605	33983.1
#3326	`faffe42`	1034	99.97%	144993355	16604.9
#3337	`1586e86`	1097	100%	189810290	15624.7

jcnelson · 2022-10-13T00:05:23Z

src/core/mempool.rs

+                // Simple size cap to the cache -- once it's full, all nonces
+                // will be looked up every time. This is bad for performance
+                // but is unlikely to occur due to the typical number of
+                // transactions processed before filling a block.


How often is this cache cleared? Is it once per block?

Also, it is knowable how many addresses can be loaded per block -- we could, in theory, calculate the maximum number of transactions that could be mined in a block, and use that to derive a maximum number of addresses that block could touch.

Given how this cache gets used, it might make sense to just use a simple LRU strategy for now. This cache is meant to help minimize the number of times we have to read the MARF to load up a nonce, so eliminating the most-common cases would be a good first attempt.

One day, subsequent refinement might consider the number of nodes that must be visited in the MARF to load the nonce. If the address's nonce was recently changed, then there are fewer tries to visit. It would make sense then to cache a nonce with probability proportional to how long ago it was last changed. It doesn't have to be in this PR, but I'm flagging it here for consideration.

Should we separate out the cache changes from the non-caching part of this PR?

@gregorycoppola I believe that because this cache is key to mempool iteration performance, we should get it checked in with this PR.

@obycode On cache miss, could you instead write the last-known nonce to the mempool DB so we don't hit the MARF more than once per candidate on a call to iterate_candidates()? Then, on a subsequent cache miss, you could first check the mempool DB for the nonce, and then check the MARF it it's not there. You'd probably want to clear all last_known_nonces at the beginning of iterate_candidates().

On cache miss, could you instead write the last-known nonce to the mempool DB so we don't hit the MARF more than once per candidate on a call to iterate_candidates()?

Yes, we can do this. I'll benchmark again to see what kind of effect this has on performance in the normal case.

LRU or no, I think you can bound transactions at around 256k using:

pub const MEMPOOL_MAX_TRANSACTION_AGE: u64 = 256;

And a calculation that around 1000 transactions fill a block.

So, the most number of transactions you would "expect" to crawl through before hitting a full block would be 256*1000=256k.

That accounts for transactions that have already been mined, but not pending transactions that can't yet be mined because of their nonces. There's no way to cap that number.

i was assuming that, on average, 1/256 transactions would have the right nonce.. i think that in practice the "proportion ready to mine" would actually go up and down a bit.

also... the height that's used for eviction is independent of chaining, MemPoolDB::tx_submit... it just goes by 256 blocks since the height the tx was submitted, afaiu.

either way, this is just a model.

src/core/mempool.rs

jcnelson

Thanks for this PR @obycode!

One thing that I think we'll want to see before merging is some test coverage that verifies the following:

All transactions are visited at least once on iteration, assuming no time-outs and no out-of-space events occur.
We need to see what happens when there are more mempool transactions than the caches have space (maybe the cache sizes could be configurable?). In particular, the test should verify that the caches reduce the number of I/O operations predictably, given their size. The cache could track this information in some internal accounting state.
We'll want to know what a good default cache size is for when the chain is under load. I think this could be obtained with live-testing with the mock miner, but it would be ideal if there was a unit test that could show us how to deduce this (or possibly a way for mempool to figure out for us what a good size would be).
We'll need to verify in a unit test that the RAM usage does not increase beyond a configured constant. I don't think this is currently happening in the code -- I think you have at least one instance of unbound memory usage. The mempool can have an unbound number of unmineable transactions, so we'll want to make sure the cache doesn't accidentally eat up all the RAM in the process of iteration.

src/core/mempool.rs

jcnelson

So, one thing about this PR that still gives me pause is that once the caches are full, there's no eviction strategy. This means that it's possible for the miner to exhibit a pathological behavior where the first NonceCache::MAX_SIZE transactions are considered but are unmineable. Then, once we find the first mineable transaction, we'll always be encountering a cache miss on NonceCache::get(), which incurs a MARF read or (if you agree with my comment above) a database read.

As an easy-to-implement stop-gap to avoid thrashing, could you make the cache sizes configurable, and then plumb through that configuration from the node's config file? Then, at least miners could set higher MAX_SIZE values if they had enough RAM for it.

gregorycoppola · 2022-10-17T14:31:11Z

I think this PR needs more tests.

I can take it over and write the tests that people want.

I opened a discussion #3345 that depicts the mempool walk as a pipeline of transformations as follows. These are some levels at which we can add tests, in whichever style.

Clarified the "best effort", based on Greg's feedback.

jcnelson · 2022-10-19T14:08:38Z

@obycode Should the nonces get bumped when the result is Skipped or Problematic?

I don't think so. If a transaction from origin address A at nonce N can't be mined, then neither can any transaction from A with nonce N+k.

obycode · 2022-10-19T14:11:33Z

I don't think so. If a transaction from origin address A at nonce N can't be mined, then neither can any transaction from A with nonce N+k.

Thanks, that was my thought as well. What about ProcessingError? Does that indicate that the transaction was included in the block with an error or something else? The comment says "It may succeed later depending on the error" which makes me unsure without digging into the code.

obycode · 2022-10-19T14:18:54Z

Looks like we should only bump the nonces on Success.

On `TransactionEvent`s other than `Success`, the nonces should not be bumped because they indicate that the transaction is not included in the block.

src/core/mempool.rs

gregorycoppola · 2022-10-19T16:10:07Z

src/core/mempool.rs

+
+        #[cfg(test)]
+        {
+            assert!(self.cache.len() <= self.max_cache_size + 1);


does this really need to be +1? i might have just put that when hacking.

No, that should not need to be +1. I'll remove it in my next commit.

jcnelson · 2022-10-19T16:15:05Z

src/chainstate/stacks/miner.rs

        let deadline = get_epoch_time_ms() + (self.settings.max_miner_time_ms as u128);
        let mut block_limit_hit = BlockLimitFunction::NO_LIMIT_HIT;

-        mem_pool.reset_last_known_nonces()?;


We're not using any of the methods that manipulate the last_nonces columns, right? If so, then can you delete those as well, and add a comment to the schema that added these columns that they are no longer used?

Sure. Just to be clear, you want to remove those columns from the DB in the latest schema in addition to deleting all of the related code?

If it's not too much trouble -- i.e. if it can be done with a DROP COLUMN. Sqlite has a bunch of constraints on when you can and cannot do this, however, so don't worry about it if sqlite is preventing you.

I'm seeing:

thread 'main' panicked at 'Failed to open mempool db: SqliteError(SqliteFailure(Error { code: Unknown, extended_code: 1 }, Some("near \"DROP\": syntax error")))', src/main.rs:739:14

But strangely, I can run the same query from the command line and it works.

ALTER TABLE mempool DROP COLUMN last_known_origin_nonce;

Maybe sqlite allows it but rusqlite does not?

jcnelson

Overall this LGTM! Thank you for seeing this through @obycode @gregorycoppola!

Do you have new benchmark numbers? Can you instantiate this on a mock miner and see how it does?

gregorycoppola

assuming the mock miner and benchmarks are as expected, i approve.

thanks everyone!

obycode · 2022-10-19T16:25:53Z

Do you have new benchmark numbers?

I've run informally, but will collect new numbers now.

Can you instantiate this on a mock miner and see how it does?

Yes, it is running now. Last block:

INFO [1666196520.482998] [src/chainstate/stacks/miner.rs:2413] [relayer] Miner: mined anchored block, block_hash: bdb4cae9809c7021c4bf1b6b2b7ce498543a2018
e0724dea5bb24c4e9c14caf3, height: 80211, tx_count: 72, parent_stacks_block_hash: f7d4afe1744f540899486d1c8811a3e42f22aace754dab19b59664e29722444e, parent_
stacks_microblock: 051d171d080744f083318cde80ab998774756632ee68da6d91cc86604374d3ca, parent_stacks_microblock_seq: 0, block_size: 21224, execution_consume
d: {"runtime": 64230390, "write_len": 32388, "write_cnt": 626, "read_len": 12103538, "read_cnt": 4363}, %-full: 29, assembly_time_ms: 2175, tx_fees_micros
tacks: 1054503

obycode · 2022-10-19T18:22:15Z

Latest benchmarking numbers:

Version	Commit	Num Tx	%-Full	Fees Collected (uSTX)	Time Spent (ms)
#3337	`35b38cb`	1097	100%	189810290	10736.7

obycode · 2022-10-19T19:34:22Z

@jcnelson those failing unit tests are due to the fact that they were depending on the fact that the old iterate_candidates was incrementing the nonces on a Skipped result. I will update the tests, but I want to make sure that the new behavior is correct. It seems correct to me.

These tests were dependent on the old implementations increment of nonces when a transaction was skipped. The correct response is to only increment the nonces when the transaction is successfully included. Therefore these tests need to simulate a success event in order to get the expected behavior to be tested.

obycode · 2022-10-19T19:47:08Z

Tests updated in 1a65f0f.

obycode · 2022-10-19T20:10:00Z

I am adding some unit tests to test the behavior with a skipped or problematic transaction.

The old implementation incremented nonces when there was an error, problematic, or skipped transaction, which would cause it to then attempt to consider later nonces from the same addresses, incorrectly. Three new unit tests are added to check for these cases.

obycode · 2022-10-19T20:30:52Z

New unit tests added in 36690d0. master fails these unit tests which I believe would've caused poor behavior -- repeatedly selecting bad transactions as candidates.

jcnelson · 2022-10-19T23:31:39Z

src/core/tests/mod.rs


+#[test]
+/// This test verifies that when a transaction is skipped, other transactions
+/// from the same address with higher nonces are not included in a block.


Specifically, we want it to be the case that these higher-nonce transactions aren't even considered.

No action needed on this, btw. I'm happy to take it once I merge this to #3335

That is what the test is checking, but you're right, the comment should be more clear. Thanks for handling that!

blockstack-devops · 2024-11-16T00:21:55Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

obycode requested review from gregorycoppola and jcnelson October 12, 2022 01:00

fix: candidate cache should pop from the front

9190093

This was a typo when refactoring, and caused worse ordering for the transactions.

fix: resolve problems with candidate cache

1586e86

jcnelson reviewed Oct 13, 2022

View reviewed changes

src/core/mempool.rs Show resolved Hide resolved

jcnelson reviewed Oct 13, 2022

View reviewed changes

src/core/mempool.rs Show resolved Hide resolved

jcnelson reviewed Oct 13, 2022

View reviewed changes

src/core/mempool.rs Show resolved Hide resolved

jcnelson reviewed Oct 13, 2022

View reviewed changes

src/core/mempool.rs Show resolved Hide resolved

jcnelson reviewed Oct 13, 2022

View reviewed changes

gregorycoppola reviewed Oct 13, 2022

View reviewed changes

src/core/mempool.rs Show resolved Hide resolved

fix: candidate cache size limitation

0d8d205

gregorycoppola mentioned this pull request Oct 14, 2022

feat: Improved Mempool Iteration for Miner Loop #3326

Closed

2 tasks

kusai90 mentioned this pull request Oct 14, 2022

<s>Re-running my benchmarks today, this PR now looks good when compared to #3326 in benchmarking. #3341

Closed

jcnelson added 2.05.0.5.0 L1 Working Group Issue or PR related to improving L1 labels Oct 15, 2022

jcnelson reviewed Oct 15, 2022

View reviewed changes

gregorycoppola mentioned this pull request Oct 17, 2022

Case Study: Borrow checker prevents a re-writing of #3337 to use Iterator #3344

Closed

jcnelson mentioned this pull request Oct 17, 2022

[DRAFT] [miner] Reduce fork and orphan rate with an interruptable miner #3335

Merged

Greg Coppola and others added 2 commits October 17, 2022 14:35

add one unit test: test_fee_order_mismatch_nonce_order

7043e54

docs: document the CandidateCache

27ee2f8

docs: improved iterate_candidates comment

8ec4066

Clarified the "best effort", based on Greg's feedback.

obycode and others added 2 commits October 19, 2022 10:27

fix: only bump nonces when a tx is included

b9b9bed

On `TransactionEvent`s other than `Success`, the nonces should not be bumped because they indicate that the transaction is not included in the block.

Merge branch 'develop' into feat/mempool-improvements

66aefb5

jcnelson reviewed Oct 19, 2022

View reviewed changes

src/core/mempool.rs Show resolved Hide resolved

gregorycoppola reviewed Oct 19, 2022

View reviewed changes

jcnelson reviewed Oct 19, 2022

View reviewed changes

jcnelson approved these changes Oct 19, 2022

View reviewed changes

test: fix mempool_walk_ tests

f4b328b

gregorycoppola approved these changes Oct 19, 2022

View reviewed changes

chore: cleanup last_known_nonces

35b38cb

jcnelson reviewed Oct 19, 2022

View reviewed changes

jcnelson merged commit 56061ef into develop Oct 19, 2022

obycode mentioned this pull request Oct 20, 2022

Improve miner transaction selection process #3313

Closed

obycode mentioned this pull request Nov 9, 2022

Improve mempool data hirosystems/stacks-blockchain-api#1428

Closed

gregorycoppola mentioned this pull request Jan 19, 2023

feat: Import miner changes from stacks-blockchain/3337 into subnets hirosystems/stacks-subnets#203

Merged

2 tasks

blockstack-devops added the locked label Nov 16, 2024

stacks-network locked as resolved and limited conversation to collaborators Nov 16, 2024

wileyj deleted the feat/mempool-improvements branch March 11, 2025 21:30

Conversation

obycode commented Oct 12, 2022 • edited by gregorycoppola Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

obycode commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

obycode commented Oct 12, 2022

Uh oh!

obycode commented Oct 12, 2022

Uh oh!

obycode commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

obycode commented Oct 12, 2022

Uh oh!

obycode commented Oct 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcnelson Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcnelson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jcnelson left a comment

Choose a reason for hiding this comment

Uh oh!

gregorycoppola commented Oct 17, 2022

Uh oh!

jcnelson commented Oct 19, 2022

Uh oh!

obycode commented Oct 19, 2022

Uh oh!

obycode commented Oct 19, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcnelson left a comment

Choose a reason for hiding this comment

obycode commented Oct 12, 2022 •

edited by gregorycoppola

Loading

codecov bot commented Oct 12, 2022 •

edited

Loading

obycode commented Oct 12, 2022 •

edited

Loading

obycode commented Oct 12, 2022 •

edited

Loading

jcnelson Oct 15, 2022 •

edited

Loading