ARROW-12220: [C++][CI] Thread sanitizer failure #9941

westonpace · 2021-04-07T20:39:06Z

The background generator kept reading from the source even after the downstream had given up on it. Other than the obvious memory / resource usage problems this also meant that callback handlers could reference deleted state downstream. Now we block the destructor until the background thread is finished and we stop the background thread early if all consumer references are lost.

westonpace · 2021-04-07T20:43:13Z

I was able to reproduce the original error pretty reliably. The tricky part was that it would only be triggered when the CPU executor was cleaned up and so I had to do the repeat outside of gtest (i.e. no --gtest_repeat) so that the process kept closing/opening. This combined with CPU stress triggered the issue pretty reliably. I used this to verify the fix.

lidavidm

(warning: long tangent)

I think longer term, we should make sure to pass down stop tokens everywhere. If we had a generator backed ultimately by a Flight stream, for instance, we may be waiting an arbitrarily long time if there's no explicit cancellation signal.

Also I think this is why 'modern' libraries (e.g. Trio) have a nursery or some other scope that when exited/dropped, waits for all work to complete/be cancelled - that avoids problems like this, and the one Antoine helped with in #9656, where tasks outlive their context. ARROW-12208 (the 'current thread executor') is working towards that design: you explicitly create an executor at the top level which is used to run your work, and when that's done you clean up everything before returning.

lidavidm · 2021-04-07T21:14:41Z

cpp/src/arrow/util/async_generator.h

I think we need only state->running since that also covers the case that the background task is not currently running, but has run in the past (which this check misses).

I don't think running works. Consider the cleanup happens while the background thread is about to call waiting_future.MarkFinished(next); and the queue has just filled up so running is false. If we don't wait and just return then it is possible the callback accesses deleted state.

Hmm, you're right, I got myself mixed up before. I guess the concern is actually - what if (a) the background task is currently running and (b) it's not the first time it's run - it looks like task_finished will have been completed the first time we ran the background task, so this won't actually wait for the background task to end. Maybe task_finished needs to be reinitialized when the task starts?

task_finished should only be marked complete on the last execution of Task (e.g. Task should never be relaunched once state->finished is true).

I think you're right though that it's not properly taking into account what happens if cleanup runs and task isn't running.

lidavidm · 2021-04-07T21:23:14Z

On a side note, when you mentioned rxcpp, it seems the lead maintainer moved on to libunifex which is an implementation of a C++23 proposal (Unified Executors), wonder if there's any ideas to borrow from there.

github-actions · 2021-04-07T22:12:59Z

https://issues.apache.org/jira/browse/ARROW-12220

westonpace · 2021-04-07T22:18:53Z

First, I was not familiar with libunifex, just scanning through it looks pretty cool. Their AsyncStream looks to be pretty compatible with AsyncGenerator so maybe we can borrow some ideas. I'll have to read through in more detail later.

Second, yes, I had the same thought about the stop token. I'll add a note to ARROW-11875.

Third, I hadn't thought of the serial executor as a nursery but you may be on to something. For example, off the top of my head you could do something like...

Future<T> AsyncBind(std::function<Future<T>(S*)>, std::shared_ptr<S>)

I'll make a JIRA for it.

...it's not exactly a nursery but similar. There is also the oox_var concept mentioned in ARROW-7001 's JIRA comments which is more of a classic nursery (but a bit more complicated).

westonpace · 2021-04-07T22:35:10Z

It seems there is some corner case that leads to deadlock. More work is needed. Also, I created ARROW-12285 for discussion on async state management.

pitrou · 2021-04-08T08:11:07Z

cpp/src/arrow/util/async_generator.h

This is where complexity starts being difficult to reason about. What's the difference between running_at_all and finished? What are the possible combinations?
You may want to define a enum to describe the current state rather than having three different booleans...

finished means that the background thread is done. Any future reads that would require starting the background generator back up should return an end token.

running_at_all means that not only is it finished but the background thread is done delivering the last item (we mark finished and "reserve a spot" in the mutex but can't deliver the item in the mutex) and is not going to be checking to see if it needs to mark the final future complete.

I will consider an enum and see if the 8 possible states collapse down to some reasonable subset.

cpp/src/arrow/util/async_generator.h

westonpace · 2021-04-09T03:31:39Z

Ok. I got lost trying to simplify things with an enum and, in the process, found a bug where a really fast consumer could try and restart the queue multiple times.

@pitrou @lidavidm

So, I have reworked it and I think it is a bit easier to reason with now (a bit).

Every background task gets a future that it marks complete when it finishes.

The background task cannot be restarted (and the cleanup cannot finish) until that future has been marked complete.

There is still a "running" flag so that we know when we add an item to the queue if we have to restart the background task (because it is no longer running and merely cleaning up) or not (because it is still running and will pick it up on the next loop).

lidavidm

This is rather tricky code overall now…

lidavidm · 2021-04-09T13:36:20Z

cpp/src/arrow/util/async_generator.h

This enum seems to be unused in favor of the booleans still?

Ah, whoops. Missed removing that when I was abandoning the enum approach. I've removed it.

westonpace · 2021-04-09T17:46:52Z

I'm not too sure what to do about the remaining complexity.

I could change the background thread so it blocks instead of releasing the thread although I think that would introduce the need for a proper condition variable.

I could change the polling thread so that a restart that is blocked by a cleaning up background thread blocks instead of using callbacks. But then this is blocking in an async context which is ideally avoided.

Or we could simply go back to the old method of thread task per request. At the end of the day the threads were only adding a small overhead to the cached I/O case.

pitrou · 2021-04-12T13:44:13Z

cpp/src/arrow/util/async_generator.h

Readability suggestion: can this just be a static method on the generator?

…en after the downstream had given up on it. Other than the obvious memory / resource usage problems this also meant that callback handlers could reference deleted state downstream. Now we block the destructor until the background thread is finished and we stop the background thread early if all consumer references are lost.

…could lead to deadlock.

… could fail to cleanup the background thread quickly enough for the unit test's satisfaction

…and reduce race conditions

lidavidm · 2021-04-13T12:11:34Z

It looks like the R 3.5 tests got stuck when running on x64, timing out after 6 hours. The 1 hour timeout didn't apply. Unfortunately there seems to be practically nothing in the logs other than that.

2021-04-13T03:25:44.2272201Z * checking tests ...
2021-04-13T03:25:44.2272689Z 
2021-04-13T03:25:44.4682128Z ** running tests for arch 'i386' ...
2021-04-13T03:25:44.4683020Z 
2021-04-13T03:26:12.7283883Z   Running 'testthat.R' [28s]
2021-04-13T03:26:12.7285110Z 
2021-04-13T03:26:12.7402341Z  OK
2021-04-13T03:26:12.7403034Z 
2021-04-13T03:26:12.9895003Z ** running tests for arch 'x64' ...
2021-04-13T03:26:12.9896008Z 
2021-04-13T09:00:00.6350799Z 
2021-04-13T09:00:00.6360365Z ##[error]Error: 'col_ed' is not an exported object from 'namespace:cli'
2021-04-13T09:00:00.6405966Z Execution halted
2021-04-13T09:00:05.5571291Z ##[error]The operation was canceled.

westonpace · 2021-04-13T12:51:28Z

That seems new. The R 3.5 64 bit builds have been pretty stable (compared to the 32 bit builds). Unfortunately, it's not reliable. The R builds passed locally for me. Later today I'll run a few more local runs on the laptop and see if I can reproduce it.

pitrou · 2021-04-13T12:53:19Z

I'm retrying the R builds. We shouldn't bother if there's a sporadic issue with soon-to-be-obsolete RTools 3.5.

nealrichardson · 2021-04-13T17:39:52Z

All 5 R jobs appear to be hopelessly hung.

pitrou · 2021-04-13T17:48:17Z

They're just waiting to start AFAICT.

nealrichardson · 2021-04-13T19:57:12Z

@github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer

westonpace · 2021-04-14T08:58:12Z

@github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer

westonpace · 2021-04-14T09:08:45Z

@nealrichardson 's crossbow job appears to have failed to launch (hence the thumbs down). So I gave it another shot.

Run archery trigger-bot \
fatal: not a git repository (or any of the parent directories): .git
Cloning into '/tmp/tmpjly6bfbo/arrow'...
From https://github.com/apache/arrow
 * [new ref]             refs/pull/9941/head -> bugfix/arrow-12220
Switched to branch 'bugfix/arrow-12220'
Cloning into '/tmp/tmpjly6bfbo/crossbow'...
ERROR:archery:'arch_alias' is undefined

lidavidm · 2021-04-14T12:47:01Z

Your second attempt also failed :/ https://github.com/apache/arrow/runs/2341429654

I manually submitted it in the meantime: https://github.com/lidavidm/crossbow/runs/2343121554?check_suite_focus=true

Not sure what's going on as crossbow submit --dry-run locally prints out the job fine, and there were some changes to Archery that affected arch_alias in the templates, but that second attempt ran with that change. (I wish Archery logged the full traceback as it's hard to tell what's different, and trigger-bot isn't easy to replicate locally…)

lidavidm · 2021-04-14T13:11:57Z

ThreadSanitizer found issues still :/

arrow-threading-utility-test

[ RUN      ] TestRunSynchronously/TestRunSynchronously.ContinueAfterExternal/0
==================
WARNING: ThreadSanitizer: data race (pid=5400)
  Write of size 8 at 0x7b2c00000498 by main thread:
    #0 pthread_cond_destroy <null> (arrow-threading-utility-test+0x71ae5)
    #1 arrow::internal::SerialExecutor::State::~State() /arrow/cpp/src/arrow/util/thread_pool.cc:48:24 (libarrow.so.400+0x102b639)
    #2 std::default_delete<arrow::internal::SerialExecutor::State>::operator()(arrow::internal::SerialExecutor::State*) const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:81:2 (libarrow.so.400+0x102b466)
    #3 std::unique_ptr<arrow::internal::SerialExecutor::State, std::default_delete<arrow::internal::SerialExecutor::State> >::~unique_ptr() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:292:4 (libarrow.so.400+0x1028823)
    #4 arrow::internal::SerialExecutor::~SerialExecutor() /arrow/cpp/src/arrow/util/thread_pool.cc:56:36 (libarrow.so.400+0x1020aa5)
    #5 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::RunInSerialExecutor<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:220:5 (libarrow.so.400+0x1030965)
    #6 arrow::Result<arrow::detail::Empty> arrow::internal::RunSynchronously<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.h:337:12 (libarrow.so.400+0x10295aa)
    #7 arrow::internal::RunSynchronouslyVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.cc:420:10 (libarrow.so.400+0x10237fb)
    #8 arrow::internal::TestRunSynchronously::RunVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool_test.cc:136:12 (arrow-threading-utility-test+0x2506df)
    #9 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody() /arrow/cpp/src/arrow/util/thread_pool_test.cc:229:3 (arrow-threading-utility-test+0x22d9ec)
    #10 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2433:10 (libgtestd.so+0xabc73)

  Previous read of size 8 at 0x7b2c00000498 by thread T5:
    #0 pthread_cond_signal <null> (arrow-threading-utility-test+0x717c2)
    #1 std::condition_variable::notify_one() <null> (libstdc++.so.6+0xd0e6c)
    #2 arrow::Status arrow::internal::Executor::Spawn<arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)::operator()(arrow::Result<arrow::detail::Empty> const&)::'lambda'()>(arrow::detail::Empty&&, arrow::StopToken) /arrow/cpp/src/arrow/util/thread_pool.h:86:12 (arrow-threading-utility-test+0x2763d0)
    #3 arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)::operator()(arrow::Result<arrow::detail::Empty> const&) /arrow/cpp/src/arrow/util/thread_pool.h:109:27 (arrow-threading-utility-test+0x276138)
    #4 arrow::Future<arrow::detail::Empty>::Callback<arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)>::operator()() && /arrow/cpp/src/arrow/util/future.h:526:7 (arrow-threading-utility-test+0x275ffb)
    #5 arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<arrow::detail::Empty>::Callback<arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)> >::invoke() /arrow/cpp/src/arrow/util/functional.h:122:42 (arrow-threading-utility-test+0x275f67)
    #6 arrow::internal::FnOnce<void ()>::operator()() && /arrow/cpp/src/arrow/util/functional.h:110:17 (libarrow.so.400+0xf9cb80)
    #7 arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) /arrow/cpp/src/arrow/util/future.cc:276:7 (libarrow.so.400+0xf9c86c)
    #8 arrow::ConcreteFutureImpl::DoMarkFinished() /arrow/cpp/src/arrow/util/future.cc:230:27 (libarrow.so.400+0xf98edd)
    #9 arrow::FutureImpl::MarkFinished() /arrow/cpp/src/arrow/util/future.cc:333:60 (libarrow.so.400+0xf952fa)
    #10 arrow::Future<arrow::detail::Empty>::DoMarkFinished(arrow::Result<arrow::detail::Empty>) /arrow/cpp/src/arrow/util/future.h:546:14 (arrow-threading-utility-test+0x186475)
    #11 void arrow::Future<arrow::detail::Empty>::MarkFinished<arrow::detail::Empty, void>(arrow::Status) /arrow/cpp/src/arrow/util/future.h:336:12 (arrow-threading-utility-test+0x16f4ad)
    #12 std::enable_if<(!(std::is_void<arrow::Status>::value)) && (!(is_future<arrow::Status>::value)), void>::type arrow::detail::ContinueFuture::operator()<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&, arrow::Status, arrow::Future<arrow::detail::Empty> >(arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&) const /arrow/cpp/src/arrow/util/future.h:84:10 (arrow-threading-utility-test+0x24c6f6)
    #13 void std::__invoke_impl<void, arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60:14 (arrow-threading-utility-test+0x24c619)
    #14 std::__invoke_result<arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&>::type std::__invoke<arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&>(arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95:14 (arrow-threading-utility-test+0x24c49a)
    #15 void std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'())>::__call<void, 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/functional:400:11 (arrow-threading-utility-test+0x24c403)
    #16 void std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'())>::operator()<void>() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/functional:482:17 (arrow-threading-utility-test+0x24c359)
    #17 arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'())> >::invoke() /arrow/cpp/src/arrow/util/functional.h:122:42 (arrow-threading-utility-test+0x24c2f7)
    #18 arrow::internal::FnOnce<void ()>::operator()() && /arrow/cpp/src/arrow/util/functional.h:110:17 (libarrow.so.400+0xf9cb80)
    #19 arrow::internal::WorkerLoop(std::shared_ptr<arrow::internal::ThreadPool::State>, std::_List_iterator<std::thread>) /arrow/cpp/src/arrow/util/thread_pool.cc:167:11 (libarrow.so.400+0x1026cfb)
    #20 arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2::operator()() const /arrow/cpp/src/arrow/util/thread_pool.cc:321:37 (libarrow.so.400+0x1026924)
    #21 void std::__invoke_impl<void, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2>(std::__invoke_other, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60:14 (libarrow.so.400+0x1026890)
    #22 std::__invoke_result<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2>::type std::__invoke<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2>(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95:14 (libarrow.so.400+0x10267a0)
    #23 void std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244:13 (libarrow.so.400+0x1026748)
    #24 std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251:11 (libarrow.so.400+0x10266e8)
    #25 std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195:13 (libarrow.so.400+0x102646f)
    #26 <null> <null> (libstdc++.so.6+0xd6d83)

  Location is heap block of size 176 at 0x7b2c00000420 allocated by main thread:
    #0 operator new(unsigned long) <null> (arrow-threading-utility-test+0xff3a7)
    #1 arrow::internal::SerialExecutor::SerialExecutor() /arrow/cpp/src/arrow/util/thread_pool.cc:55:43 (libarrow.so.400+0x10209a7)
    #2 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::RunInSerialExecutor<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:220:12 (libarrow.so.400+0x1030906)
    #3 arrow::Result<arrow::detail::Empty> arrow::internal::RunSynchronously<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.h:337:12 (libarrow.so.400+0x10295aa)
    #4 arrow::internal::RunSynchronouslyVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.cc:420:10 (libarrow.so.400+0x10237fb)
    #5 arrow::internal::TestRunSynchronously::RunVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool_test.cc:136:12 (arrow-threading-utility-test+0x2506df)
    #6 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody() /arrow/cpp/src/arrow/util/thread_pool_test.cc:229:3 (arrow-threading-utility-test+0x22d9ec)
    #7 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2433:10 (libgtestd.so+0xabc73)

  Thread T5 (tid=7374, running) created by main thread at:
    #0 pthread_create <null> (arrow-threading-utility-test+0x707cb)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xd7048)
    #2 arrow::internal::ThreadPool::LaunchWorkersUnlocked(int) /arrow/cpp/src/arrow/util/thread_pool.cc:321:11 (libarrow.so.400+0x1022311)
    #3 arrow::internal::ThreadPool::SpawnReal(arrow::internal::TaskHints, arrow::internal::FnOnce<void ()>, arrow::StopToken, arrow::internal::FnOnce<void (arrow::Status const&)>&&) /arrow/cpp/src/arrow/util/thread_pool.cc:338:7 (libarrow.so.400+0x1022ae4)
    #4 arrow::Result<arrow::Future<arrow::detail::Empty> > arrow::internal::Executor::Submit<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'(), arrow::Future<arrow::detail::Empty> >(arrow::internal::TaskHints, arrow::StopToken, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&&) /arrow/cpp/src/arrow/util/thread_pool.h:149:5 (arrow-threading-utility-test+0x24b483)
    #5 arrow::Result<arrow::Future<arrow::detail::Empty> > arrow::internal::Executor::Submit<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'(), arrow::Future<arrow::detail::Empty> >(arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&&) /arrow/cpp/src/arrow/util/thread_pool.h:175:12 (arrow-threading-utility-test+0x24b173)
    #6 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const /arrow/cpp/src/arrow/util/thread_pool_test.cc:223:45 (arrow-threading-utility-test+0x24af53)
    #7 arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>::FnImpl<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6>::invoke(arrow::internal::Executor*&&) /arrow/cpp/src/arrow/util/functional.h:122:42 (arrow-threading-utility-test+0x24ae8b)
    #8 arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>::operator()(arrow::internal::Executor*) && /arrow/cpp/src/arrow/util/functional.h:110:17 (libarrow.so.400+0x1030818)
    #9 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::Run<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:232:22 (libarrow.so.400+0x1030b8a)
    #10 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::RunInSerialExecutor<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:220:29 (libarrow.so.400+0x1030938)
    #11 arrow::Result<arrow::detail::Empty> arrow::internal::RunSynchronously<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.h:337:12 (libarrow.so.400+0x10295aa)
    #12 arrow::internal::RunSynchronouslyVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.cc:420:10 (libarrow.so.400+0x10237fb)
    #13 arrow::internal::TestRunSynchronously::RunVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool_test.cc:136:12 (arrow-threading-utility-test+0x2506df)
    #14 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody() /arrow/cpp/src/arrow/util/thread_pool_test.cc:229:3 (arrow-threading-utility-test+0x22d9ec)
    #15 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2433:10 (libgtestd.so+0xabc73)

SUMMARY: ThreadSanitizer: data race (/build/cpp/debug/arrow-threading-utility-test+0x71ae5) in pthread_cond_destroy
==================
[       OK ] TestRunSynchronously/TestRunSynchronously.ContinueAfterExternal/0 (763 ms)

also in arrow-dataset-file-csv-test (too much to paste here)

lidavidm · 2021-04-14T13:42:16Z

Ok, I think this PR fixes the originally reported issue with BackgroundGenerator. But now ThreadSanitizer is finding an issue with SerialExecutor. The nightlies report similar issues: https://github.com/ursacomputing/crossbow/runs/2340248774

So I think we should merge this and start fixing the new issue that's cropped up.

pitrou · 2021-04-14T13:57:11Z

I'm debugging the TSan failure FWIW.

pitrou · 2021-04-14T13:57:39Z

But I agree we can merge this PR already.

… trigger-bot Some earlier attempts at invoking Crossbow failed and the reason wasn't clear. This logs full tracebacks for unexpected exceptions to make it easier to tell what happened. Example where we ran into a mysterious error: #9941 (comment) Closes #10037 from lidavidm/arrow-12376 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

… trigger-bot Some earlier attempts at invoking Crossbow failed and the reason wasn't clear. This logs full tracebacks for unexpected exceptions to make it easier to tell what happened. Example where we ran into a mysterious error: apache/arrow#9941 (comment) Closes #10037 from lidavidm/arrow-12376 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

lidavidm reviewed Apr 7, 2021

View reviewed changes

github-actions bot added the Component: C++ label Apr 7, 2021

pitrou reviewed Apr 8, 2021

View reviewed changes

westonpace force-pushed the bugfix/arrow-12220 branch from 8f25ca5 to 6ffc35f Compare April 9, 2021 03:22

westonpace requested a review from pitrou April 9, 2021 03:31

lidavidm reviewed Apr 9, 2021

View reviewed changes

pitrou reviewed Apr 12, 2021

View reviewed changes

westonpace added 7 commits April 12, 2021 10:51

ARROW-12220: The previous fix didn't properly consider all cases and …

7ab6659

…could lead to deadlock.

ARROW-12220: Lint

0b7628f

ARROW-12220: Need to be a little more lenient in unit tests. Slow PCs…

d04e603

… could fail to cleanup the background thread quickly enough for the unit test's satisfaction

ARROW-12220: Based on PR feedback I am attempting to simplify things …

9f6d766

…and reduce race conditions

ARROW-12220: Forgot to remove an enum leftover from a failed approach

4b49313

ARROW-12220: Moved Task to a static method on BackgroundGenerator

926f799

westonpace force-pushed the bugfix/arrow-12220 branch from df22d51 to 926f799 Compare April 12, 2021 21:10

lidavidm closed this in e99e70d Apr 14, 2021

lidavidm mentioned this pull request Apr 14, 2021

ARROW-12376: [Dev] Log traceback for unexpected exceptions in archery trigger-bot #10037

Closed

westonpace deleted the bugfix/arrow-12220 branch January 6, 2022 08:17

asfimport mentioned this pull request Apr 14, 2021

[C++][CI] Thread sanitizer failure #28036

Closed

ARROW-12220: [C++][CI] Thread sanitizer failure #9941

ARROW-12220: [C++][CI] Thread sanitizer failure #9941

Uh oh!

Conversation

westonpace commented Apr 7, 2021

Uh oh!

westonpace commented Apr 7, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 7, 2021

Uh oh!

github-actions bot commented Apr 7, 2021

Uh oh!

westonpace commented Apr 7, 2021

Uh oh!

westonpace commented Apr 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

westonpace commented Apr 9, 2021

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Apr 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Apr 13, 2021

Uh oh!

westonpace commented Apr 13, 2021

Uh oh!

pitrou commented Apr 13, 2021

Uh oh!

nealrichardson commented Apr 13, 2021

Uh oh!

pitrou commented Apr 13, 2021

Uh oh!

nealrichardson commented Apr 13, 2021

Uh oh!

westonpace commented Apr 14, 2021

Uh oh!

westonpace commented Apr 14, 2021

Uh oh!

lidavidm commented Apr 14, 2021

Uh oh!

lidavidm commented Apr 14, 2021

Uh oh!

lidavidm commented Apr 14, 2021

Uh oh!

pitrou commented Apr 14, 2021

Uh oh!

pitrou commented Apr 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone