We are seeing some tests hanging randomly in CI on Windows.
I was able to catch a few hangs (~10-15) on my machine as well. Every time the process was hung, it was during loading of MklImports, which was calling LoadLibraryA("libittnotify"), and then on a different thread MklImports was also on the stack. I am unable to get debugging symbols for MKL, but the two stacks look like this:
Thread 1, calling into MKL from ML.NET:
0:024> kn
# Child-SP RetAddr Call Site
00 0000004f`828fb448 00007ffc`2af8c199 ntdll!NtWaitForSingleObject+0x14
01 0000004f`828fb450 00007ffc`2af80e3e ntdll!LdrpDrainWorkQueue+0x15d
02 0000004f`828fb490 00007ffc`2af84387 ntdll!LdrpLoadDllInternal+0xc2
03 0000004f`828fb510 00007ffc`2af8a724 ntdll!LdrpLoadDll+0x10b
04 0000004f`828fb6c0 00007ffc`27352a7b ntdll!LdrLoadDll+0xa4
05 0000004f`828fb7c0 00007ffc`273814f1 KERNELBASE!LoadLibraryExW+0x17b [minkernel\kernelbase\module.c @ 1207]
06 0000004f`828fb830 00007ffc`273814a9 KERNELBASE!LoadLibraryExA+0x31 [minkernel\kernelbase\module.c @ 1249]
*** ERROR: Symbol file could not be found. Defaulted to export symbols for MklImports.DLL -
07 0000004f`828fb870 00007ffb`ba4b8854 KERNELBASE!LoadLibraryA+0x39 [minkernel\kernelbase\module.c @ 878]
08 0000004f`828fb8a0 00007ffb`ba4b908b MklImports+0x8854
09 0000004f`828fb8e0 00007ffb`ba4c64f1 MklImports+0x908b
0a 0000004f`828fb9b0 00007ffb`ba4be343 MklImports!vslDeleteStream+0x4d91
0b 0000004f`828fbbc0 00007ffb`823903d3 MklImports!cblas_sgemm+0x213
0c 0000004f`828fbc60 00007ffb`8238f446 0x00007ffb`823903d3
0d 0000004f`828fbd80 00007ffb`8238d91a 0x00007ffb`8238f446
0e 0000004f`828fbf90 00007ffb`8238d330 0x00007ffb`8238d91a
0f 0000004f`828fc0a0 00007ffb`8238d0b3 0x00007ffb`8238d330
10 0000004f`828fc160 00007ffb`81e10837 0x00007ffb`8238d0b3
11 0000004f`828fc1e0 00007ffb`81e0f88e 0x00007ffb`81e10837
12 0000004f`828fc250 00007ffb`82389736 0x00007ffb`81e0f88e
13 0000004f`828fc520 00007ffb`e0063ba3 0x00007ffb`82389736
14 0000004f`828fc760 00007ffb`dff4f4fe coreclr!CallDescrWorkerInternal+0x83 [E:\A\_work\164\s\src\vm\amd64\CallDescrWorkerAMD64.asm @ 101]
15 0000004f`828fc7a0 00007ffb`dff4f9fb coreclr!CallDescrWorkerReflectionWrapper+0x1a [e:\a\_work\164\s\src\vm\reflectioninvocation.cpp @ 740]
16 0000004f`828fc7f0 00007ffb`d94ee791 coreclr!RuntimeMethodHandle::InvokeMethod+0x4bb [e:\a\_work\164\s\src\vm\reflectioninvocation.cpp @ 1355]
17 0000004f`828fce30 00007ffb`816c173b System_Private_CoreLib!System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)$##60034D3+0xb1
18 0000004f`828fcea0 00007ffb`816c1000 0x00007ffb`816c173b
19 0000004f`828fcee0 00007ffb`816c012f 0x00007ffb`816c1000
1a 0000004f`828fcf70 00007ffb`816bfdd7 0x00007ffb`816c012f
1b 0000004f`828fcff0 00007ffb`816bf9e4 0x00007ffb`816bfdd7
1c 0000004f`828fd060 00007ffb`816bf4f9 0x00007ffb`816bf9e4
1d 0000004f`828fd0d0 00007ffb`816bf47d 0x00007ffb`816bf4f9
1e 0000004f`828fd140 00007ffb`816bec15 0x00007ffb`816bf47d
1f 0000004f`828fd1b0 00007ffb`816be679 0x00007ffb`816bec15
20 0000004f`828fd210 00007ffb`816be31d 0x00007ffb`816be679
21 0000004f`828fd280 00007ffb`816bde59 0x00007ffb`816be31d
22 0000004f`828fd2e0 00007ffb`816bd01f 0x00007ffb`816bde59
23 0000004f`828fd3d0 00007ffb`816bccfc 0x00007ffb`816bd01f
24 0000004f`828fd450 00007ffb`816ba722 0x00007ffb`816bccfc
25 0000004f`828fd4d0 00007ffb`816ba2af 0x00007ffb`816ba722
26 0000004f`828fd5d0 00007ffb`816ba1b7 0x00007ffb`816ba2af
27 0000004f`828fd650 00007ffb`816b9f18 0x00007ffb`816ba1b7
28 0000004f`828fd6d0 00007ffb`816b9da9 0x00007ffb`816b9f18
29 0000004f`828fd780 00007ffb`816b9d2d 0x00007ffb`816b9da9
2a 0000004f`828fd7f0 00007ffb`816b91f6 0x00007ffb`816b9d2d
2b 0000004f`828fd850 00007ffb`816b8ff9 0x00007ffb`816b91f6
2c 0000004f`828fd8e0 00007ffb`816b8f7d 0x00007ffb`816b8ff9
2d 0000004f`828fd950 00007ffb`816b8cde 0x00007ffb`816b8f7d
2e 0000004f`828fd9c0 00007ffb`816b89bf 0x00007ffb`816b8cde
2f 0000004f`828fda30 00007ffb`816b88a9 0x00007ffb`816b89bf
30 0000004f`828fdab0 00007ffb`816b83d7 0x00007ffb`816b88a9
31 0000004f`828fdb30 00007ffb`816b78ef 0x00007ffb`816b83d7
32 0000004f`828fdc20 00007ffb`816b77e7 0x00007ffb`816b78ef
33 0000004f`828fdca0 00007ffb`816b629f 0x00007ffb`816b77e7
34 0000004f`828fdd20 00007ffb`816b5eff 0x00007ffb`816b629f
35 0000004f`828fdde0 00007ffb`816b5df7 0x00007ffb`816b5eff
36 0000004f`828fde60 00007ffb`816b58b5 0x00007ffb`816b5df7
37 0000004f`828fdee0 00007ffb`816b5518 0x00007ffb`816b58b5
38 0000004f`828fdf60 00007ffb`816b4eef 0x00007ffb`816b5518
39 0000004f`828fdfd0 00007ffb`816b4de7 0x00007ffb`816b4eef
3a 0000004f`828fe050 00007ffb`816b4879 0x00007ffb`816b4de7
3b 0000004f`828fe0d0 00007ffb`816b459f 0x00007ffb`816b4879
3c 0000004f`828fe180 00007ffb`816b4497 0x00007ffb`816b459f
3d 0000004f`828fe200 00007ffb`816b2917 0x00007ffb`816b4497
3e 0000004f`828fe270 00007ffb`816b211f 0x00007ffb`816b2917
3f 0000004f`828fe300 00007ffb`816b2017 0x00007ffb`816b211f
40 0000004f`828fe380 00007ffb`816b0b29 0x00007ffb`816b2017
41 0000004f`828fe410 00007ffb`816b039f 0x00007ffb`816b0b29
42 0000004f`828fe4c0 00007ffb`816b0297 0x00007ffb`816b039f
43 0000004f`828fe540 00007ffb`816afa66 0x00007ffb`816b0297
44 0000004f`828fe5c0 00007ffb`816af6ef 0x00007ffb`816afa66
45 0000004f`828fe650 00007ffb`816af5e7 0x00007ffb`816af6ef
46 0000004f`828fe6d0 00007ffb`816ade30 0x00007ffb`816af5e7
47 0000004f`828fe750 00007ffb`816ad61f 0x00007ffb`816ade30
48 0000004f`828fe800 00007ffb`816ad427 0x00007ffb`816ad61f
49 0000004f`828fe880 00007ffb`816acb85 0x00007ffb`816ad427
4a 0000004f`828fe900 00007ffb`d9549e3c 0x00007ffb`816acb85
4b 0000004f`828fe930 00007ffb`d945cda9 System_Private_CoreLib!System.Threading.Tasks.Task`1[System.__Canon].InnerInvoke()$##600286D+0x4c
4c 0000004f`828fe980 00007ffb`d950f8e7 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
4d 0000004f`828fea00 00007ffb`d950f674 System_Private_CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)$##600294B+0x197
4e 0000004f`828feaa0 00007ffb`d957c4cb System_Private_CoreLib!System.Threading.Tasks.Task.ExecuteEntry()$##6002948+0x84
4f 0000004f`828feae0 00007ffb`816aca97 System_Private_CoreLib!System.Threading.Tasks.SynchronizationContextTaskScheduler+<>c.<.cctor>b__8_0(System.Object)$##6002AB3+0x2b
50 0000004f`828feb10 00007ffb`d945cda9 0x00007ffb`816aca97
51 0000004f`828feb60 00007ffb`816aabe7 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
52 0000004f`828febe0 00007ffb`8169bc34 0x00007ffb`816aabe7
53 0000004f`828fec30 00007ffb`d945cda9 0x00007ffb`8169bc34
54 0000004f`828fec80 00007ffb`d950f8e7 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
55 0000004f`828fed00 00007ffb`d945cda9 System_Private_CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)$##600294B+0x197
56 0000004f`828feda0 00007ffb`e0063ba3 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
57 0000004f`828fee20 00007ffb`dff29839 coreclr!CallDescrWorkerInternal+0x83 [E:\A\_work\164\s\src\vm\amd64\CallDescrWorkerAMD64.asm @ 101]
58 (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x53 [e:\a\_work\164\s\src\vm\callhelpers.cpp @ 78]
59 0000004f`828fee60 00007ffb`e004fe03 coreclr!MethodDescCallSite::CallTargetWorker+0x2b5 [e:\a\_work\164\s\src\vm\callhelpers.cpp @ 628]
5a (Inline Function) --------`-------- coreclr!StressLog::InlinedStressLogOn+0x5 [e:\a\_work\164\s\src\utilcode\stresslog.cpp @ 594]
5b (Inline Function) --------`-------- coreclr!StressLog::LogOn+0x5 [e:\a\_work\164\s\src\utilcode\stresslog.cpp @ 628]
5c 0000004f`828fefb0 00007ffb`dff293d1 coreclr!ThreadNative::KickOffThread_Worker+0x143 [e:\a\_work\164\s\src\vm\comsynchronizable.cpp @ 260]
5d (Inline Function) --------`-------- coreclr!ManagedThreadBase_DispatchInner+0xf0 [e:\a\_work\164\s\src\vm\threads.cpp @ 8850]
5e 0000004f`828ff120 00007ffb`dff291c3 coreclr!ManagedThreadBase_DispatchMiddle+0x179 [e:\a\_work\164\s\src\vm\threads.cpp @ 8901]
5f 0000004f`828ff250 00007ffb`e005cbb7 coreclr!ManagedThreadBase_DispatchOuter+0xaf [e:\a\_work\164\s\src\vm\threads.cpp @ 9140]
60 0000004f`828ff2f0 00007ffb`dff8c64a coreclr!ManagedThreadBase_FullTransitionWithAD+0x2f [e:\a\_work\164\s\src\vm\threads.cpp @ 9200]
61 (Inline Function) --------`-------- coreclr!ManagedThreadBase::KickOff+0x20 [e:\a\_work\164\s\src\vm\threads.cpp @ 9234]
62 0000004f`828ff350 00007ffb`dff8c516 coreclr!ThreadNative::KickOffThread+0x10a [e:\a\_work\164\s\src\vm\comsynchronizable.cpp @ 380]
63 0000004f`828ff430 00007ffc`2acd3034 coreclr!Thread::intermediateThreadProc+0x86 [e:\a\_work\164\s\src\vm\threads.cpp @ 2255]
64 0000004f`828ffdf0 00007ffc`2afc1461 kernel32!BaseThreadInitThunk+0x14
65 0000004f`828ffe20 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Thread 2, what looks like a background thread running a DllMain (ntdll!LdrpCallInitRoutine is what invokes DllMain):
0:076> kn
# Child-SP RetAddr Call Site
00 0000004f`8487f268 00007ffc`27372f1d ntdll!NtYieldExecution+0x14
01 0000004f`8487f270 00007ffb`ba4b882d KERNELBASE!SwitchToThread+0x1d [minkernel\kernelbase\thread.c @ 3327]
02 0000004f`8487f2a0 00007ffb`ba4b69a8 MklImports+0x882d
03 0000004f`8487f2e0 00007ffb`ba4b1132 MklImports+0x69a8
04 0000004f`8487f7c0 00007ffb`be8679d4 MklImports+0x1132
05 0000004f`8487f7f0 00007ffc`2af84053 MklImports!vslDeleteStream+0x43a6274
06 0000004f`8487f850 00007ffc`2af8167f ntdll!LdrpCallInitRoutine+0x6b
07 0000004f`8487f8c0 00007ffc`2afc14ce ntdll!LdrShutdownThread+0x16f
08 0000004f`8487f9c0 00007ffc`2acd303c ntdll!RtlExitUserThread+0x3e
09 0000004f`8487fa00 00007ffc`2afc1461 kernel32!BaseThreadInitThunk+0x1c
0a 0000004f`8487fa30 00000000`00000000 ntdll!RtlUserThreadStart+0x21
Note that Thread 1 is calling MklImports!cblas_sgemm, and eventually MklImports is calling LoadLibraryA. Printing the variables at LoadLibraryA:
0:024> dv
lpLibFileName = 0x00007ffb`bea0eb2c "libittnotify.dll"
StrLength = <value unavailable>
pszBuffer = 0x00007ffb`bee40148 ""
hMod = 0x00000000`00000000
There are other threads running at this point (and some of those threads are spawning new threads as well). I assume there is some race condition happening with MklImports loading while other threads are doing other work.
To attempt to fix this, I am loading MklImports very early in the tests (in the base test class static initializer, I am calling into MklImports to ensure it is loaded). This appears to fix the issue - I've run the tests 30 times without it hanging on my machine.
I also have a few .dmp files, if anyone wants to investigate themselves, you can ping me.
Repro steps
- Ensure your Windows machine can build ML.NET - https://github.com/dotnet/machinelearning/blob/master/docs/building/windows-instructions.md
- git clone https://github.com/dotnet/machinelearning.git
- cd machinelearning
- git checkout 70b3c3b
- .\build.cmd
- cd test\Microsoft.ML.Tests
..\..\Tools\dotnetcli\dotnet.exe test
a. The tests take a little over a minute. But it doesn’t repro every time, so you need to run it a few times. If the test hangs for over 2 minutes, you know you have a deadlock. Attach a debugger to investigate.
- Or if using powershell:
a. for ($i=0; $i -lt 20;$i++) {..\..\Tools\dotnetcli\dotnet.exe test}
We are seeing some tests hanging randomly in CI on Windows.
I was able to catch a few hangs (~10-15) on my machine as well. Every time the process was hung, it was during loading of MklImports, which was calling
LoadLibraryA("libittnotify"), and then on a different thread MklImports was also on the stack. I am unable to get debugging symbols for MKL, but the two stacks look like this:Thread 1, calling into MKL from ML.NET:
Thread 2, what looks like a background thread running a
DllMain(ntdll!LdrpCallInitRoutineis what invokesDllMain):Note that Thread 1 is calling
MklImports!cblas_sgemm, and eventually MklImports is callingLoadLibraryA. Printing the variables atLoadLibraryA:There are other threads running at this point (and some of those threads are spawning new threads as well). I assume there is some race condition happening with MklImports loading while other threads are doing other work.
To attempt to fix this, I am loading
MklImportsvery early in the tests (in the base test class static initializer, I am calling into MklImports to ensure it is loaded). This appears to fix the issue - I've run the tests 30 times without it hanging on my machine.I also have a few .dmp files, if anyone wants to investigate themselves, you can ping me.
Repro steps
..\..\Tools\dotnetcli\dotnet.exe testa. The tests take a little over a minute. But it doesn’t repro every time, so you need to run it a few times. If the test hangs for over 2 minutes, you know you have a deadlock. Attach a debugger to investigate.
a.
for ($i=0; $i -lt 20;$i++) {..\..\Tools\dotnetcli\dotnet.exe test}