Skip to content

[bug]: bitcoind (chainbackend) health check needlessly failing #10641

@jogc

Description

@jogc

Pre-Submission Checklist

  • I have searched the existing issues and believe this is a new bug.
  • I am not asking a question about how to use lnd, but reporting a bug (otherwise open a discussion).

LND Version

v0.20.1-beta

LND Configuration

(Minimal config - nothing health check related)

Backend Version

Bitcoin Core

Backend Configuration

(Minimal config - nothing relevant)

OS/Distribution

Debian x86

Bug Details & Steps to Reproduce

ca3b36c changed from using getblockchaininfo to uptime for bitcoind health checks to avoid the bitcoind csmain lock which can cause very long response times (several minutes) even though bitcoind is in fact healthy.

df9f148 then tacked on a call to getpeerinfo in the the same function to close #8487. This means the getpeerinfo call is now in practice part of the health check and covered by the same timeout checking logic as the uptime call is. In other words if getpeerinfo would take a very long time, the health check would fail. It seems getpeerinfo does sometimes take a very long time. I've seen it happen more then five times within less than a month with bitcoind 30.2 running on a perhaps slightly under powered but basically idle system, causing lnd to shutdown due to [CRT] SRVR: Health check: chain backend failed after 3 calls. I started capturing packets when I first noticed this so I have full packet captures for most of these occasions and they confirm that getpeerinfo is the reason for the health check triggering, not the uptime. I have in fact not seen a single instance of uptime taking a long time, even at times when various other RPC calls are still hung waiting for a reply.

My understanding of the bitcoind source and C++ is too limited to know why this happens even though it seems csmain is in fact not used for getpeerinfo (no LOCK(cs_main) call in getpeerinfo() in bitcoin/src/rpc/net.cpp). But it actually seems to be one of the worst calls when it comes to response times, even compared to ones like getblock, gettxout and getblockhash.

A second problem is that getpeerinfo seems to have automatic re-request logic (maybe because it does not use RawRequest like the uptime call does) so as long as there is no reply, additional requests with the same jsonrpc request ID will keep stacking up, one every minute, causing even more for bitcoind to handle for no real gain.

My suggestion is to change back to just using uptime for the health check and do the getpeerinfo check separately.

For whats causing the long response times in my case, well the blockchain is stored on an external USB HDD. Maybe it can get too slow for bitcoind at times.

Expected Behavior

For lnd not to stop itself when bitcoind is in fact healthy

Debug Information

No response

Environment

No response

Metadata

Metadata

Assignees

Labels

backendRelated to the node backend software/interface (e.g. btcd, bitcoin-core)bugUnintended code behaviourchain handling

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions