Pre-Submission Checklist
LND Version
v0.20.1-beta
LND Configuration
(Minimal config - nothing health check related)
Backend Version
Bitcoin Core
Backend Configuration
(Minimal config - nothing relevant)
OS/Distribution
Debian x86
Bug Details & Steps to Reproduce
ca3b36c changed from using getblockchaininfo to uptime for bitcoind health checks to avoid the bitcoind csmain lock which can cause very long response times (several minutes) even though bitcoind is in fact healthy.
df9f148 then tacked on a call to getpeerinfo in the the same function to close #8487. This means the getpeerinfo call is now in practice part of the health check and covered by the same timeout checking logic as the uptime call is. In other words if getpeerinfo would take a very long time, the health check would fail. It seems getpeerinfo does sometimes take a very long time. I've seen it happen more then five times within less than a month with bitcoind 30.2 running on a perhaps slightly under powered but basically idle system, causing lnd to shutdown due to [CRT] SRVR: Health check: chain backend failed after 3 calls. I started capturing packets when I first noticed this so I have full packet captures for most of these occasions and they confirm that getpeerinfo is the reason for the health check triggering, not the uptime. I have in fact not seen a single instance of uptime taking a long time, even at times when various other RPC calls are still hung waiting for a reply.
My understanding of the bitcoind source and C++ is too limited to know why this happens even though it seems csmain is in fact not used for getpeerinfo (no LOCK(cs_main) call in getpeerinfo() in bitcoin/src/rpc/net.cpp). But it actually seems to be one of the worst calls when it comes to response times, even compared to ones like getblock, gettxout and getblockhash.
A second problem is that getpeerinfo seems to have automatic re-request logic (maybe because it does not use RawRequest like the uptime call does) so as long as there is no reply, additional requests with the same jsonrpc request ID will keep stacking up, one every minute, causing even more for bitcoind to handle for no real gain.
My suggestion is to change back to just using uptime for the health check and do the getpeerinfo check separately.
For whats causing the long response times in my case, well the blockchain is stored on an external USB HDD. Maybe it can get too slow for bitcoind at times.
Expected Behavior
For lnd not to stop itself when bitcoind is in fact healthy
Debug Information
No response
Environment
No response
Pre-Submission Checklist
LND Version
v0.20.1-beta
LND Configuration
(Minimal config - nothing health check related)
Backend Version
Bitcoin Core
Backend Configuration
(Minimal config - nothing relevant)
OS/Distribution
Debian x86
Bug Details & Steps to Reproduce
ca3b36c changed from using
getblockchaininfotouptimefor bitcoind health checks to avoid the bitcoind csmain lock which can cause very long response times (several minutes) even though bitcoind is in fact healthy.df9f148 then tacked on a call to
getpeerinfoin the the same function to close #8487. This means thegetpeerinfocall is now in practice part of the health check and covered by the same timeout checking logic as theuptimecall is. In other words ifgetpeerinfowould take a very long time, the health check would fail. It seemsgetpeerinfodoes sometimes take a very long time. I've seen it happen more then five times within less than a month with bitcoind 30.2 running on a perhaps slightly under powered but basically idle system, causing lnd to shutdown due to[CRT] SRVR: Health check: chain backend failed after 3 calls. I started capturing packets when I first noticed this so I have full packet captures for most of these occasions and they confirm thatgetpeerinfois the reason for the health check triggering, not theuptime. I have in fact not seen a single instance ofuptimetaking a long time, even at times when various other RPC calls are still hung waiting for a reply.My understanding of the bitcoind source and C++ is too limited to know why this happens even though it seems csmain is in fact not used for
getpeerinfo(noLOCK(cs_main)call ingetpeerinfo()inbitcoin/src/rpc/net.cpp). But it actually seems to be one of the worst calls when it comes to response times, even compared to ones likegetblock,gettxoutandgetblockhash.A second problem is that
getpeerinfoseems to have automatic re-request logic (maybe because it does not useRawRequestlike theuptimecall does) so as long as there is no reply, additional requests with the same jsonrpc request ID will keep stacking up, one every minute, causing even more for bitcoind to handle for no real gain.My suggestion is to change back to just using
uptimefor the health check and do thegetpeerinfocheck separately.For whats causing the long response times in my case, well the blockchain is stored on an external USB HDD. Maybe it can get too slow for bitcoind at times.
Expected Behavior
For lnd not to stop itself when bitcoind is in fact healthy
Debug Information
No response
Environment
No response