Skip to content

itest: tests randomly failed with neutrino backend #4376

@yyforyongyu

Description

@yyforyongyu

Background

With neutrino as backend, when running integration tests, a single run will pass(make itest backend=neutrino icase=XXX), while running all at once(make itest backend=neutrino) will fail.

Your environment

Running tests on the master branch(40c9d62c6fc0) inside a docker.

Linux 40c9d62c6fc0 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 GNU/Linux

Steps to reproduce

The tests randomly failed, while outputting a similar error. For instance, testNodeAnnouncement, in which it creates a new node, Dave, and tries to open a channel between Bob and Dave. When running alone, make itest backend=neutrino icase=node_announcement, everything was fine. The problem came when running everything together(make itest backend=neutrino), which gave me the following,

--- FAIL: TestLightningNetworkDaemon/node_announcement (19.53s)
    lnd_test.go:109: Failed: (node announcement): exited with error:
        *errors.errorString unable to assert channel existence: channel e6a17994ccea3dfc11fb2cf0c63263fded62ac48c3b8e9685237ac6a8ae23661:0 inactive
        /go/src/github.com/lnd/lntest/itest/lnd_test.go:273 (0xdb9027)
            openChannelAndAssert: t.Fatalf("unable to assert channel existence: %v", err)
        /go/src/github.com/lnd/lntest/itest/lnd_test.go:9757 (0xdec811)
            testNodeAnnouncement: chanPoint := openChannelAndAssert(
        /go/src/github.com/lnd/lntest/itest/lnd_test.go:132 (0xdb834a)
            (*harnessTest).RunTestCase: testCase.test(h.lndHarness, h)
        /go/src/github.com/lnd/lntest/itest/lnd_test.go:14802 (0xe2593c)
            TestLightningNetworkDaemon.func4: ht.RunTestCase(testCase)
        /usr/local/go/src/testing/testing.go:909 (0x508cf9)
            tRunner: fn(t)
        /usr/local/go/src/runtime/asm_amd64.s:1357 (0x461091)
            goexit: BYTE    $0x90   // NOP

The channel became inactive, and here's Bob's log, it was unable to read peer message at 05:21:34.599, it looks like Dave is down.

STARTING ============ node announcement ============
...
2020-06-11 05:21:34.599 [INF] PEER: unable to read message from 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231@127.0.0.1:19841: EOF
2020-06-11 05:21:34.600 [INF] PEER: disconnecting 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231@127.0.0.1:19841, reason: read handler closed
2020-06-11 05:21:34.701 [DBG] SRVR: Peer 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231@127.0.0.1:19841 has been disconnected
2020-06-11 05:21:34.707 [DBG] FNDG: Cancelling all reservations for peer 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231
2020-06-11 05:21:34.707 [DBG] FNDG: No active reservations for node: 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231
2020-06-11 05:21:34.708 [INF] DISC: Removing GossipSyncer for peer=0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231
2020-06-11 05:21:34.708 [DBG] SRVR: Notifying that peer 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231@127.0.0.1:19841 is offline
2020-06-11 05:21:34.709 [DBG] SRVR: removing peer 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231@127.0.0.1:19841
2020-06-11 05:21:34.709 [DBG] PRNF: PeerNotifier notifying peer: 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231 offline
2020-06-11 05:21:34.709 [DBG] DISC: Requesting online notification for peer=0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231
2020-06-11 05:21:34.710 [DBG] SRVR: Scheduling connection re-establishment to persistent peer 0307dd0d32767d3c3dfb177846a47a3a62297be19e2f2f5084297696758991e231@127.0.0.1:19841 in 1s
2020-06-11 05:21:34.710 [DBG] CHFT: Channel e6a17994ccea3dfc11fb2cf0c63263fded62ac48c3b8e9685237ac6a8ae23661:0 recording event: peer_offline
2020-06-11 05:21:34.839 [INF] LTND: Received shutdown request.
2020-06-11 05:21:34.840 [INF] LTND: Shutting down...
...

And here's Dave's log, which says,
@05:21:15.565: connected to Bob
@05:21:15.571: start syncing the chain
@05:21:18.647: attempting to open the channel with Bob
@05:21:34.555: got kicked out

...
2020-06-11 05:21:15.565 [DBG] BTCN: Peer connected, starting cfHandler.
...
2020-06-11 05:21:15.571 [DBG] BTCN: Received headers (num 1246) from 127.0.0.1:19580 (outbound)
...
2020-06-11 05:21:18.645 [INF] UTXN: Attempting to graduate height=483: num_kids=0, num_babies=0
2020-06-11 05:21:18.646 [DBG] PEER: Received MsgOpenChannel(temp_chan_id=f69ab5ea3b5d9e0aa805b09c91da2e99f3a8f2ed6bb4bd398152fd9dd2719428, chain=0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206, csv=4, amt=0.01 BTC, push_amt=0 mSAT, reserve=0.0001 BTC, flags=1) from 0376fc568f48368203f4794d448d3aa228a0978b053668adec60fa750e6dad604e@127.0.0.1:60024
2020-06-11 05:21:18.647 [INF] FNDG: Recv'd fundingRequest(amt=0.01 BTC, push=0 mSAT, delay=4, pendingId=f69ab5ea3b5d9e0aa805b09c91da2e99f3a8f2ed6bb4bd398152fd9dd2719428) from peer(0376fc568f48368203f4794d448d3aa228a0978b053668adec60fa750e6dad604e)
...
2020-06-11 05:21:34.541 [DBG] SWPR: New block: height=1052, sha=3c3f47bf34afe9d1fff9e11d8b49b28103dfe5033d353ba9395d3a23ab91e8c4
2020-06-11 05:21:34.541 [INF] UTXN: Attempting to graduate height=1052: num_kids=0, num_babies=0
2020-06-11 05:21:34.555 [INF] LTND: Received shutdown request.
2020-06-11 05:21:34.555 [INF] LTND: Shutting down...
2020-06-11 05:21:34.556 [INF] LTND: Gracefully shutting down.
...

Eventually, this boils down to this line, it timed out after 15 seconds, called t.Fatalf, which called the defer shutdownAndAssert(net, t, dave) that shut Dave down, and failed the test. It seems that the block syncing is slower when not using btcd as the backend.

Expected behavior

For a single test case, make itest backend=neutrino should behave the same as if running make itest backend=neutrino icase=XXX.

Diagnostics

Based on the logs, when Dave entered the game, it was started with the task of syncing 1246 blocks informed by node Bob. He never finished it though, as it was shut down by the timeout settings.

To solve this, either we give Dave more time to sync the blocks, or we decrease the number of blocks to be synced.

  1. Increase timeout
    I dramatically increased timeout values in timeouts_darwin.go, and timeout in AssertChannelExists. Also in the testing flag, go test timeout was set to -test.timeout=60m. It did pass, while giving me this output,
PASS
ok      github.com/lightningnetwork/lnd/lntest/itest    3778.268s
lntest/itest/log_check_errors.sh

<time> [ERR] FNDG: Unable to register for confirmation of ChannelPoint(<chan_point>): chain notifier shutting down
<time> [ERR] HSWC: ChannelLink(<chan>): failing link: unable to update commitment: link shutting down with error: internal error
<time> [ERR] NTFN: Unable to update rescan filter: rescan exited

In the itest logs, the log line (patterns) above were detected.
[ERR] lines are generally reserved for internal errors.
Resolve the issue by either changing the log level or adding an 
exception to log_error_whitelist.txt

make: *** [Makefile:146: itest-only] Error 1
  1. Reduce blocks
    According to the logs, most of the time was spent syncing the blockchain. If the number of blocks could be reduced, it could save some time.
    For every case, before any test can happen, 448 blocks need to be generated,

Alice and Bob were initialized during the setup phase. When running a new test case, the 448-block-syncing process is done only once. While it does save time for them, the downside is that the block height will be advanced by other tests. By the time Dave connects to Bob, it now needs to sync over 1000 blocks instead of 448. Furthermore, because test cases are intertwined, they cannot be run in parallel.

This is probably a bit off-topic but now I'm wondering how the integration tests could be speeded up?

Solutions

Increasing timeout values does help, temporarily. With more tests pumping in, it would take more time to complete the tasks. I suggest two alternatives on how to improve the situation.

Initialize Carol and Dave before testing

Just like what's been done to Alice and Bob, so cases not testing blockchain syncing would no longer wait for syncing blocks. The downside is that it increases the complexity of coding. Since all the tests are affecting the same node states, the tests need to strictly refactored into,

- catch entry states
- perform actions
- catch exit states
- check changes in states are as expected

where determining the state values will be a tedious job.

Initialize nodes per test case

That is, for every t.Run(), a new miner and chain backend are created. I'm not sure whether there are intentional dependencies among the test cases or not. If not, this creates the benefit of running tests in parallel. The downside is that the 448 overhead being placed on every node.

Metadata

Metadata

Labels

P2should be fixed if one has timeitestsIssues related to integration tests.neutrinoLightweight neutrino backend-typetest flake

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions