diff --git a/docs/.vuepress/components/IRCWRAPPER.vue b/docs/.vuepress/components/IRCWRAPPER.vue
new file mode 100644
index 0000000000..ab7ccd76e7
--- /dev/null
+++ b/docs/.vuepress/components/IRCWRAPPER.vue
@@ -0,0 +1,125 @@
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/.vuepress/components/ReviewClubHeader.vue b/docs/.vuepress/components/ReviewClubHeader.vue
new file mode 100644
index 0000000000..4b2748a5a7
--- /dev/null
+++ b/docs/.vuepress/components/ReviewClubHeader.vue
@@ -0,0 +1,58 @@
+
+ {{ date| formatDate }} The PR branch HEAD was {{ commit }} at the time of this review club meeting.
+12:59 < afig> sup yall +13:00 < jb55> yo +13:00 < jnewbery> hi +13:00 < pinheadmz> hi +13:00 < Lightlike> hello +13:00 < merehap> hi +13:00 < amiti> hi +13:00 < b10c> hi +13:01 < jonatack> Hi +13:01 < jnewbery> Hi folks! Thanks for joining PR review club today. +13:01 < jnewbery> Before we start talking about #15834, I wanted to poll for feedback on these meetings. Is there any way we could make them more useful? +13:02 < jnewbery> I'm also taking suggestions for PRs to cover in future weeks. We don't have anything on the agenda yet. +13:02 < JulioBarros> I think the chat interface is interesting but wonder if an audio or video format would be helpful. +13:03 < peevsie> hi +13:03 < Lightlike> just a small thing, could you update the old gist page (with the old agenda) to link to the new page, for the people who have bookmarked that. +13:03 < aj> heyo +13:04 < jnewbery> The reason I chose IRC is that it makes it easier for people who don't have somewhere to take a video/audio call to join, and for people who want to just dip in and out +13:04 < merehap> I definitely prefer IRC over audio/video. Easier to reference the code for a few minutes without falling too far behind in the conversation. +13:04 < jnewbery> also means we have a text log for people who can't join in real time +13:05 < jnewbery> ok, I don't want to take too long on this. If you think of anything, please raise it here or just message me directly +13:05 < jnewbery> OK, onto #15834 - Fix NOTFOUND bug and expire getdata requests for transactions. As usual, I'll give a quick summary before we open up for questions. +13:05 < jb55> my only suggestion is that it might be neat to have a site where people can propose reviews and have people signup to them, with a calendar and everything +13:05 < jb55> perhaps when this gets more popular... +13:06 < jnewbery> jb55: until then, feel free to propose reviews in this channel +13:06 < jnewbery> This PR is for changes to the "Net Processing" layer. That layer sits between "Net" (which manages connections to peers and is responsible for reading/writing bytes off the wire) and "Validation" (which is responsible for our node's view of the state of the blockchain and mempool). +13:06 < jnewbery> "Net Processing" receives application level messages from peers, which can either be data messages (eg tx and block inventory) or control messages (eg gossiping addresses of peers or establishing new connections). The developer reference here: https://btcinformation.org/en/developer-reference#p2p-network is probably the best place for documentation on p2p messages. +13:07 < jnewbery> Data messages are parsed and then passed up to validation for consideration for us to update mempool/chainstate. +13:07 < jnewbery> This change is in the handling of the INV/GETDATA/TX message exchange, and is an attempt to fix a bug introduced in #14897 where our node could potentially stop asking peers to send us new transactions. +13:08 < jnewbery> As it says in the PR description, this is a smaller alternative to https://github.com/bitcoin/bitcoin/pull/15776 , who's author is in this chat +13:08 < jnewbery> aj was actually the first person to report the bug a couple of months ago +13:09 < jnewbery> as a result of that, we reverted #14897 in the 0.18 release, but kept it in master with the aim of fixing the bug while still keeping the good things about 14897 +13:09 < jnewbery> I'll shut up now. Any questions? +13:09 < iiiiiiiiiii> >a bug introduced in #14897 https://github.com/bitcoin/bitcoin/pull/14897 +13:09 < jb55> how would you find a bug like this? is there tracing diagnostics you can turn on to see these messages? +13:10 < jnewbery> aj: do you want to comment on how you found the bug? +13:10 < aj> so what happened is i ran "getmempoolinfo" and noticed that it was empty -- figured something weird was going on and mentioned it on irc +13:10 < aj> at which point people said "no, nothing weird is going on, your node is just having problems" or something similar +13:11 < aj> so i tried working out what was going on at that point, because my mempool had just like 12 tx's and wasn't getting any more +13:11 < aj> eventually figured out how to turn net message debugging on to the live system, bitcoin-cli logging '["net"]' or similar +13:12 < jnewbery> It's a bit scary that this was only discovered by chance. I think there's more testing we could be doing to increase our chances of catching things like this earlier on. +13:12 < aj> that then showed that i was getting lots of INV messages from peers, telling me that new tx's were around, but i was never actually sending the GETDATA back to get the details of those transactions +13:12 < jnewbery> that's a good tip. We have different categories of logging that can be enabled and disabled at run time. Use the `logging` RPC. +13:12 < aj> i then tried staring at code to figure out what was wrong, and made guesses. also tried attaching gdb to the running process, but that caused it to crash, and when i restarted everything was fine +13:13 < aj> talking about it on irc, people were wondering if i was connected to hostile peers that were deliberately attacking me; so i checked against gmaxwell's banlist, but didn't get any matches, and noted down the peers i was connected to +13:13 < aj> (i run my node behind NAT and not on tor, so i've only got 8 outgoing peers, no incoming peers) +13:14 < aj> once it restarted and was working fine, i manually reconnected to the peers that didn't work before, to check they did work now, and they did so i took that to mean it wasn't a case of hostile peers, but a bug on my end +13:14 < jonatack> aj, did the custom logging you mentioned in your PR shine light too? https://github.com/bitcoin/bitcoin/pull/15776#issuecomment-483916240 +13:14 < Lightlike> In 15776, there is a rather special situation, with the parent of the tx already expired from the relay set, leading to NOTFOUND msg being sent. But wouldn't you need to have 100 different children tx, each with expired parents to reach MAX_PEER_IN_FLIGHT, multiplied by the number of peers (8) to be cut off completely. Is that not extremely unlikely? +13:14 < aj> in the log of #15776 i think you can see some of my guesses and corrections to that +13:15 < aj> Lightlike: it's cumulative, so you use up 1 slot with one tx, then it happens against and you use up another slot +13:16 < aj> Lightlike: and it's not that the parent expires from the relay set, it's that you receive a child you don't have a parent for (because you've just restarted your node after being down for a while, eg), and then you ask the peer for the parent even though they never told you they had the parent, so it's not in the collection of tx's they'll relay to you even though it's in the mempool +13:16 < merehap> Beyond more testing, is there something like using better concurrency primitives or better code structuring to reduce the likelihood of concurrency problems like these? I feel like I've seen several different concurrency bugs file against core. Since concurrency bugs are so difficult to discover and debug, it seems scary if there isn't a more systematic approach to preventing them. +13:17 < merehap> filed* against +13:17 < aj> Lightlike: (also, if the parent isn't in the mempool, but is in your utxo set, but the first two outputs from the transaction have already been spent, then you'll redundantly ask for the tx and get a notfound back) +13:18 < jnewbery> Good context to know is that if a peer asks you for a tx with GETDATA, you'll only send that tx if it's in a data structure called MapRelay. You won't fish the tx out of the mempool if it's not in MapRelay +13:18 < aj> jnewbery: (except you will if you're working with an SPV peer doing bloom filters) +13:19 < jnewbery> ah, yes. Thanks! +13:19 < aj> jnewbery: (there's always an exception. except when there isn't...) +13:19 < jnewbery> merehap: I wouldn't really categorise this as a concurrency bug. This is more like a resource leak (where the resource is txs in flight from a peer) +13:20 < MarcoFalke> Yeah, it is a logic bug. The code was written under the wrong assumptions (missing some "edge" cases) +13:21 < pinheadmz> so the story is: INV(tx)--> <--GETDATA(tx) (tx is dropped from relayset) NOTFOUND(tx)--> ? +13:21 < jnewbery> pinheadmz: yes +13:21 < pinheadmz> all three, the same tx - not GETDATA(parent of tx) +13:21 < jnewbery> (although of course we can't rely on the peer sending a NOTFOUND) +13:22 < aj> INV(tx) GETDATA(tx) TX(txdata) is normal +13:22 < pinheadmz> so is the tx not really in the maprealy from thebeginning? or is it actually dropped in the time between INV and GETDATA ? +13:23 < aj> INV(tx) GETDATA(tx) TX(txdata) GETDATA(tx.parent) NOTFOUND is common though, because our side thinks "if they have tx, they must have the parent" and their side, running the same software, thinks "i didn't INV tx.parent, so i won't send it" +13:23 < jnewbery> INV(tx_child) ---> +13:23 < jnewbery> <--- GETDATA(tx_child) +13:23 < jnewbery> TX(tx_child) ---> +13:23 < jnewbery> I don't have this child's parents! +13:23 < jnewbery> INV(tx_parent) ---> +13:23 < jnewbery> <--- GETDATA(tx_parent) +13:23 < jnewbery> TX(tx_parent) ---> +13:23 < aj> jnewbery: the special logging mostly proved the guesses i was making was wrong, fwiw +13:24 < jb55> I found reading the struct TxDownloadState comment helpful for anyone new to this part of the codebase +13:24 < jnewbery> oh sorry, that second INV shouldn't be there +13:24 < jnewbery> correction: +13:24 < jnewbery> INV(tx_child) ---> +13:24 < jnewbery> <--- GETDATA(tx_child) +13:24 < jnewbery> TX(tx_child) ---> +13:24 < jnewbery> I don't have this child's parents! +13:24 < jnewbery> <--- GETDATA(tx_parent) +13:24 < jnewbery> TX(tx_parent) ---> +13:25 < jnewbery> that last TX should actually be a NOTFOUND +13:25 < pinheadmz> thanks! visual aids are so helpful :-) +13:25 < pinheadmz> and why wouldnt the first node have the parent? bc it jsut expired? time-wise? +13:25 < jnewbery> expired from the MapRelay but still in the mempool, or cofirmed in a block +13:26 < Lightlike> but if it was confirmed in a block, wouldn't we know about this and not request it? +13:27 < aj> so one thing i wonder is about sdaftuar's question, "switch the uses of GetTimeMicros() in the logic to GetTime() so that we can mock it"; it doesn't seem to make a lot of sense to measure timeouts in the seconds-minutes range with microseconds; but using non-mocktime for "peer has been idle" in general (see #9606) certainly makes sense i think +13:27 < jnewbery> Lightlike: The peer might have it confirmed in a block, but we don't have the block yet? Not sure +13:28 < aj> Lightlike: there's a bug in the handling of missing parents. when actually testing the transaction the logic is done right (in AcceptToMempool), but when that fails, it doesn't say what tx's exactly are missing +13:29 < b10c> jnewbery: and the "expired from the MapRelay but still in the mempool" was introduced in #14897? +13:29 < aj> Lightlike: so when the net_processing code works that out to try to ask for the parents (which generally won't work anyway) it tests to see if the parent is in the mempool, or txid:0 or txid:1 are in the UTXO set +13:29 < aj> Lightlike: but if the output we're looking for is txid:3 and txid:0 and txid:1 are spent, we'll get a false positive +13:30 < jnewbery> b10c: no, I believe that's been normal behaviour since earlier than that +13:30 < aj> i think the tx's in maprelay are kept around so even if the tx expires from the mempool (by being mined, or because the mempool is full) it's still able to be relayed for the 15m's since being INV announced +13:31 < jnewbery> the problem introduced by 14897 is adding txs to the node's m_tx_in_flight and then there being edge-cases where they're never removed +13:32 < b10c> jnewbery: ah I see +13:33 < aj> so previously, it would've just been "we sent a GETDATA and got a NOTFOUND? not big deal, moving on". but now it means a data structure doesn't get cleared, and eventually a limit gets hit, and behaviour changes +13:34 < jb55> p2p code is scary +13:34 < jnewbery> jb55: yes! +13:34 < jb55> so much state +13:34 < jonatack> can you expand on the missing testing and mocktime? +13:35 < jnewbery> we talk a lot about consensus code being subtle and difficult, but problems in p2p code can be just as dangerous +13:35 * jb55 considering spending more time looking at p2p PRs +13:36 < jnewbery> luckily, I think there are more parallels between p2p code and other software domains (eg networking, telecoms), so for people with experience in those areas then p2p might be somewhere you want to look at more. +13:37 < jonatack> * jb55: there are todos in https://github.com/bitcoin/bitcoin/pull/9606/files +13:37 < jnewbery> jonatack: there are various calls for getting the current time used by bitcoind. GetTime() can be overridden by setmocktime but GetTimeMicros() always uses your system time +13:38 < jnewbery> I *think* that's just historic. I can't think of any good reason for it +13:38 < jnewbery> but it does mean that code that uses GetTimeMicros() can't be tested using mocktime +13:39 < aj> jnewbery: i think if you used mocktime for everything you might get a bug where "setup test, connect peers. mocktime += 1 hour. oh all the peers have been non-responsive for an hour, mass disconnect. tests fail" +13:39 < aj> jnewbery: GetTimeMicros() is used in bunches of places for "here's how long this piece of code took to run" logs, which is pretty reasonable too +13:39 < jonatack> thanks! wdyt about sdaftuar's last comment/question +13:40 < jonatack> could this be tested in compacted mocktime +13:41 < Lightlike> A question concerning the fix in #15834: Now we clear m_tx_inflight_it after our peer first sends an INV(tx), but is totally unresponsive after GETDATA(tx) (not even NOTFOUND). At the same time we accept other transactions from this peer. is there a legit reason for our peer to behave like this besides censoring certain transactions (so that we don't disconnect instead)? +13:42 < aj> i think it'd make sense to switch it to mockable GetTime,interested in what other people think, or if anyone sees any problems with that +13:42 < jnewbery> on the face of it, it seems reasonable. I can't immediately think of any reason that it'd be a problem +13:44 < aj> Lightlike: legit reasons: could be buggy, or maybe there's bugs in our logic where we're taking much longer to send the GETDATA than we think and it's actually totally reasonable for it to have expired, could be others +13:44 < jnewbery> Lightlike: unreliable connection? +13:45 < aj> jnewbery: unreliable with the bounds TCP/tor provides? +13:45 < jnewbery> In general, I think we want to be careful about introducing behaviour that could turn a relatively small bug into something that causes a bunch of peers to disconnect +13:45 < aj> yeah, that^^^ +13:47 < Lightlike> yes, that makes sense :-) +13:49 < jnewbery> aj: yeah TCP/tor should give us reliable messages +13:49 < jnewbery> jonatack: I hope it can be tested with mocktime, or in some way that doesn't require very long-running tests. The problem with long-running tests is that no-one ever runs them +13:50 < jnewbery> (and we explicitly disable them from our Travis CI because they'd cause the run to time out) +13:50 < jonatack> jnewbery: this +13:51 < jnewbery> A few other random thoughts I had about this PR: I liked aj's update to the getpeerinfo RPC here: https://github.com/bitcoin/bitcoin/pull/15776/files#diff-618d3a4f4fc0dc851ec29218c445c148R109 +13:52 < jnewbery> Because I like monitoring in general +13:53 < jnewbery> I'd also love to see more work on stuff like https://forkmonitor.info/nodes/btc to provide live monitoring of the network +13:53 < jnewbery> I know that james o'beirne is planning to do some work on that, so if it's something that interests anyone here, please reach out to him +13:54 < jnewbery> Another thing that I'd like to see is more stress testing. Resource leaks like this are really difficult to catch in deterministic, short-running tests +13:55 < jnewbery> Any final questions? +13:55 < aj> https://github.com/ajtowns/bitcoin/commits/201904-reduce-notfounds -- is a branch i was playing with to reduce the underlying cause for getting notfounds btw. i haven't opened a PR for it yet, but it might be interesting to look at +13:55 < jnewbery> aj: cool, thanks! +13:56 < jnewbery> ok, seems like there aren't any more questions +13:56 < jnewbery> This was quite a step up from the previous two weeks in terms of complexity - I'd be interested in what people thought about reviewing. +13:57 < jnewbery> Net Processing is a tricky area. Probably second only to Validation in terms of subtlety and complexity. It'd be great if we had more eyes on it. +13:58 < jnewbery> ok, thanks everyone, and thanks especially to aj! +13:58 < jonatack> Good choice of PR -- I'm learning with these forays into new (for me) places in the codebase +13:58 < Lightlike> I found this very interesting (the p2p code was new to me), but also definitely harder to understand than the one last week. Thanks! +13:58 < merehap> jnewbery, aj: Thanks! +13:58 < jonatack> Thanks aj and jnewbery! +13:59 < aj> the p2p code was new to me too when trying to debug this :) +13:59 < jnewbery> let's wrap it up there. As always, feel free to ping me if you have any thoughts about these. +13:59 < pinheadmz> jnewbery: thank you! this is great, informative, fun +13:59 < pinheadmz> to the moon! ++