Background
The htlcswitch package uses sync.WaitGroup in a manner that allows parallel Add(1) and Wait() calls. Specifically:
GetAttemptResult calls wg.Add(1) before launching a goroutine,
Stop() calls wg.Done().
This is explicitly prohibited by the WaitGroup documentation.
Note that calls with a positive delta that occur when the counter is zero must happen before a Wait.
Your environment
- version of
lnd: master (fbeab72)
- which operating system (
uname -a on *Nix): Linux host 6.1.75-1.qubes.fc32.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 5 06:49:25 CET 2024 x86_64 GNU/Linux
- version of
btcd, bitcoind, or other backend: not used
- any other relevant environment details
Steps to reproduce
I reproduced this in branch reproduce-race in my fork. I added test TestSwitchGetAttemptResultStress which runs many GetAttemptResult() calls sequentially and a Stop() call in parallel.
How to run it. Checkout the branch and run the following commands:
$ cd htlcswitch
$ go test -race -run TestSwitchGetAttemptResultStress
Expected behaviour
I expect TestSwitchGetAttemptResultStress not to crash, i.e. that there are no race condition between GetAttemptResult and Stop methods.
Actual behaviour
The stress test crashed
htlcswitch$ go test -race -run TestSwitchGetAttemptResultStress
==================
WARNING: DATA RACE
Read at 0x00c0001d4118 by goroutine 21:
runtime.raceread()
<autogenerated>:1 +0x1e
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).GetAttemptResult()
/home/user/lnd/htlcswitch/switch.go:496 +0x1c4
github.com/lightningnetwork/lnd/htlcswitch.TestSwitchGetAttemptResultStress.func1()
/home/user/lnd/htlcswitch/switch_test.go:3211 +0x168
Previous write at 0x00c0001d4118 by goroutine 22:
runtime.racewrite()
<autogenerated>:1 +0x1e
github.com/lightningnetwork/lnd/htlcswitch.(*Switch).Stop()
/home/user/lnd/htlcswitch/switch.go:1995 +0x1e9
github.com/lightningnetwork/lnd/htlcswitch.TestSwitchGetAttemptResultStress.func2()
/home/user/lnd/htlcswitch/switch_test.go:3232 +0xae
Goroutine 21 (running) created at:
github.com/lightningnetwork/lnd/htlcswitch.TestSwitchGetAttemptResultStress()
/home/user/lnd/htlcswitch/switch_test.go:3203 +0x356
testing.tRunner()
/home/user/.goroot/src/testing/testing.go:1690 +0x226
testing.(*T).Run.gowrap1()
/home/user/.goroot/src/testing/testing.go:1743 +0x44
Goroutine 22 (finished) created at:
github.com/lightningnetwork/lnd/htlcswitch.TestSwitchGetAttemptResultStress()
/home/user/lnd/htlcswitch/switch_test.go:3222 +0x45c
testing.tRunner()
/home/user/.goroot/src/testing/testing.go:1690 +0x226
testing.(*T).Run.gowrap1()
/home/user/.goroot/src/testing/testing.go:1743 +0x44
==================
--- FAIL: TestSwitchGetAttemptResultStress (0.08s)
testing.go:1399: race detected during execution of test
FAIL
exit status 1
FAIL github.com/lightningnetwork/lnd/htlcswitch 0.380s
Background
The
htlcswitchpackage usessync.WaitGroupin a manner that allows parallelAdd(1)andWait()calls. Specifically:GetAttemptResultcallswg.Add(1)before launching a goroutine,Stop()callswg.Done().This is explicitly prohibited by the WaitGroup documentation.
Your environment
lnd: master (fbeab72)uname -aon *Nix):Linux host 6.1.75-1.qubes.fc32.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 5 06:49:25 CET 2024 x86_64 GNU/Linuxbtcd,bitcoind, or other backend: not usedSteps to reproduce
I reproduced this in branch reproduce-race in my fork. I added test
TestSwitchGetAttemptResultStresswhich runs manyGetAttemptResult()calls sequentially and aStop()call in parallel.How to run it. Checkout the branch and run the following commands:
Expected behaviour
I expect
TestSwitchGetAttemptResultStressnot to crash, i.e. that there are no race condition between GetAttemptResult and Stop methods.Actual behaviour
The stress test crashed