Skip to content

Conversation

@westonpace
Copy link
Member

The background generator kept reading from the source even after the downstream had given up on it. Other than the obvious memory / resource usage problems this also meant that callback handlers could reference deleted state downstream. Now we block the destructor until the background thread is finished and we stop the background thread early if all consumer references are lost.

@westonpace
Copy link
Member Author

I was able to reproduce the original error pretty reliably. The tricky part was that it would only be triggered when the CPU executor was cleaned up and so I had to do the repeat outside of gtest (i.e. no --gtest_repeat) so that the process kept closing/opening. This combined with CPU stress triggered the issue pretty reliably. I used this to verify the fix.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(warning: long tangent)

I think longer term, we should make sure to pass down stop tokens everywhere. If we had a generator backed ultimately by a Flight stream, for instance, we may be waiting an arbitrarily long time if there's no explicit cancellation signal.

Also I think this is why 'modern' libraries (e.g. Trio) have a nursery or some other scope that when exited/dropped, waits for all work to complete/be cancelled - that avoids problems like this, and the one Antoine helped with in #9656, where tasks outlive their context. ARROW-12208 (the 'current thread executor') is working towards that design: you explicitly create an executor at the top level which is used to run your work, and when that's done you clean up everything before returning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need only state->running since that also covers the case that the background task is not currently running, but has run in the past (which this check misses).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think running works. Consider the cleanup happens while the background thread is about to call waiting_future.MarkFinished(next); and the queue has just filled up so running is false. If we don't wait and just return then it is possible the callback accesses deleted state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right, I got myself mixed up before. I guess the concern is actually - what if (a) the background task is currently running and (b) it's not the first time it's run - it looks like task_finished will have been completed the first time we ran the background task, so this won't actually wait for the background task to end. Maybe task_finished needs to be reinitialized when the task starts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_finished should only be marked complete on the last execution of Task (e.g. Task should never be relaunched once state->finished is true).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right though that it's not properly taking into account what happens if cleanup runs and task isn't running.

@lidavidm
Copy link
Member

lidavidm commented Apr 7, 2021

On a side note, when you mentioned rxcpp, it seems the lead maintainer moved on to libunifex which is an implementation of a C++23 proposal (Unified Executors), wonder if there's any ideas to borrow from there.

@github-actions
Copy link

github-actions bot commented Apr 7, 2021

@westonpace
Copy link
Member Author

First, I was not familiar with libunifex, just scanning through it looks pretty cool. Their AsyncStream looks to be pretty compatible with AsyncGenerator so maybe we can borrow some ideas. I'll have to read through in more detail later.

Second, yes, I had the same thought about the stop token. I'll add a note to ARROW-11875.

Third, I hadn't thought of the serial executor as a nursery but you may be on to something. For example, off the top of my head you could do something like...

Future<T> AsyncBind(std::function<Future<T>(S*)>, std::shared_ptr<S>)

I'll make a JIRA for it.

...it's not exactly a nursery but similar. There is also the oox_var concept mentioned in ARROW-7001 's JIRA comments which is more of a classic nursery (but a bit more complicated).

@westonpace
Copy link
Member Author

It seems there is some corner case that leads to deadlock. More work is needed. Also, I created ARROW-12285 for discussion on async state management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where complexity starts being difficult to reason about. What's the difference between running_at_all and finished? What are the possible combinations?
You may want to define a enum to describe the current state rather than having three different booleans...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished means that the background thread is done. Any future reads that would require starting the background generator back up should return an end token.

running_at_all means that not only is it finished but the background thread is done delivering the last item (we mark finished and "reserve a spot" in the mutex but can't deliver the item in the mutex) and is not going to be checking to see if it needs to mark the final future complete.

I will consider an enum and see if the 8 possible states collapse down to some reasonable subset.

@westonpace westonpace force-pushed the bugfix/arrow-12220 branch from 8f25ca5 to 6ffc35f Compare April 9, 2021 03:22
@westonpace
Copy link
Member Author

Ok. I got lost trying to simplify things with an enum and, in the process, found a bug where a really fast consumer could try and restart the queue multiple times.

@pitrou @lidavidm

So, I have reworked it and I think it is a bit easier to reason with now (a bit).

Every background task gets a future that it marks complete when it finishes.

The background task cannot be restarted (and the cleanup cannot finish) until that future has been marked complete.

There is still a "running" flag so that we know when we add an item to the queue if we have to restart the background task (because it is no longer running and merely cleaning up) or not (because it is still running and will pick it up on the next loop).

@westonpace westonpace requested a review from pitrou April 9, 2021 03:31
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rather tricky code overall now…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enum seems to be unused in favor of the booleans still?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, whoops. Missed removing that when I was abandoning the enum approach. I've removed it.

@westonpace
Copy link
Member Author

I'm not too sure what to do about the remaining complexity.

I could change the background thread so it blocks instead of releasing the thread although I think that would introduce the need for a proper condition variable.

I could change the polling thread so that a restart that is blocked by a cleaning up background thread blocks instead of using callbacks. But then this is blocking in an async context which is ideally avoided.

Or we could simply go back to the old method of thread task per request. At the end of the day the threads were only adding a small overhead to the cached I/O case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability suggestion: can this just be a static method on the generator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

…en after the downstream had given up on it. Other than the obvious memory / resource usage problems this also meant that callback handlers could reference deleted state downstream. Now we block the destructor until the background thread is finished and we stop the background thread early if all consumer references are lost.
… could fail to cleanup the background thread quickly enough for the unit test's satisfaction
@lidavidm
Copy link
Member

It looks like the R 3.5 tests got stuck when running on x64, timing out after 6 hours. The 1 hour timeout didn't apply. Unfortunately there seems to be practically nothing in the logs other than that.

2021-04-13T03:25:44.2272201Z * checking tests ...
2021-04-13T03:25:44.2272689Z 
2021-04-13T03:25:44.4682128Z ** running tests for arch 'i386' ...
2021-04-13T03:25:44.4683020Z 
2021-04-13T03:26:12.7283883Z   Running 'testthat.R' [28s]
2021-04-13T03:26:12.7285110Z 
2021-04-13T03:26:12.7402341Z  OK
2021-04-13T03:26:12.7403034Z 
2021-04-13T03:26:12.9895003Z ** running tests for arch 'x64' ...
2021-04-13T03:26:12.9896008Z 
2021-04-13T09:00:00.6350799Z 
2021-04-13T09:00:00.6360365Z ##[error]Error: 'col_ed' is not an exported object from 'namespace:cli'
2021-04-13T09:00:00.6405966Z Execution halted
2021-04-13T09:00:05.5571291Z ##[error]The operation was canceled.

@westonpace
Copy link
Member Author

That seems new. The R 3.5 64 bit builds have been pretty stable (compared to the 32 bit builds). Unfortunately, it's not reliable. The R builds passed locally for me. Later today I'll run a few more local runs on the laptop and see if I can reproduce it.

@pitrou
Copy link
Member

pitrou commented Apr 13, 2021

I'm retrying the R builds. We shouldn't bother if there's a sporadic issue with soon-to-be-obsolete RTools 3.5.

@nealrichardson
Copy link
Member

All 5 R jobs appear to be hopelessly hung.

@pitrou
Copy link
Member

pitrou commented Apr 13, 2021

They're just waiting to start AFAICT.

@nealrichardson
Copy link
Member

@github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer

1 similar comment
@westonpace
Copy link
Member Author

@github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer

@westonpace
Copy link
Member Author

@nealrichardson 's crossbow job appears to have failed to launch (hence the thumbs down). So I gave it another shot.

Run archery trigger-bot \
fatal: not a git repository (or any of the parent directories): .git
Cloning into '/tmp/tmpjly6bfbo/arrow'...
From https://github.com/apache/arrow
 * [new ref]             refs/pull/9941/head -> bugfix/arrow-12220
Switched to branch 'bugfix/arrow-12220'
Cloning into '/tmp/tmpjly6bfbo/crossbow'...
ERROR:archery:'arch_alias' is undefined

@lidavidm
Copy link
Member

Your second attempt also failed :/ https://github.com/apache/arrow/runs/2341429654

I manually submitted it in the meantime: https://github.com/lidavidm/crossbow/runs/2343121554?check_suite_focus=true

Not sure what's going on as crossbow submit --dry-run locally prints out the job fine, and there were some changes to Archery that affected arch_alias in the templates, but that second attempt ran with that change. (I wish Archery logged the full traceback as it's hard to tell what's different, and trigger-bot isn't easy to replicate locally…)

@lidavidm
Copy link
Member

ThreadSanitizer found issues still :/

arrow-threading-utility-test
[ RUN      ] TestRunSynchronously/TestRunSynchronously.ContinueAfterExternal/0
==================
WARNING: ThreadSanitizer: data race (pid=5400)
  Write of size 8 at 0x7b2c00000498 by main thread:
    #0 pthread_cond_destroy <null> (arrow-threading-utility-test+0x71ae5)
    #1 arrow::internal::SerialExecutor::State::~State() /arrow/cpp/src/arrow/util/thread_pool.cc:48:24 (libarrow.so.400+0x102b639)
    #2 std::default_delete<arrow::internal::SerialExecutor::State>::operator()(arrow::internal::SerialExecutor::State*) const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:81:2 (libarrow.so.400+0x102b466)
    #3 std::unique_ptr<arrow::internal::SerialExecutor::State, std::default_delete<arrow::internal::SerialExecutor::State> >::~unique_ptr() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/unique_ptr.h:292:4 (libarrow.so.400+0x1028823)
    #4 arrow::internal::SerialExecutor::~SerialExecutor() /arrow/cpp/src/arrow/util/thread_pool.cc:56:36 (libarrow.so.400+0x1020aa5)
    #5 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::RunInSerialExecutor<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:220:5 (libarrow.so.400+0x1030965)
    #6 arrow::Result<arrow::detail::Empty> arrow::internal::RunSynchronously<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.h:337:12 (libarrow.so.400+0x10295aa)
    #7 arrow::internal::RunSynchronouslyVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.cc:420:10 (libarrow.so.400+0x10237fb)
    #8 arrow::internal::TestRunSynchronously::RunVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool_test.cc:136:12 (arrow-threading-utility-test+0x2506df)
    #9 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody() /arrow/cpp/src/arrow/util/thread_pool_test.cc:229:3 (arrow-threading-utility-test+0x22d9ec)
    #10 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2433:10 (libgtestd.so+0xabc73)

  Previous read of size 8 at 0x7b2c00000498 by thread T5:
    #0 pthread_cond_signal <null> (arrow-threading-utility-test+0x717c2)
    #1 std::condition_variable::notify_one() <null> (libstdc++.so.6+0xd0e6c)
    #2 arrow::Status arrow::internal::Executor::Spawn<arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)::operator()(arrow::Result<arrow::detail::Empty> const&)::'lambda'()>(arrow::detail::Empty&&, arrow::StopToken) /arrow/cpp/src/arrow/util/thread_pool.h:86:12 (arrow-threading-utility-test+0x2763d0)
    #3 arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)::operator()(arrow::Result<arrow::detail::Empty> const&) /arrow/cpp/src/arrow/util/thread_pool.h:109:27 (arrow-threading-utility-test+0x276138)
    #4 arrow::Future<arrow::detail::Empty>::Callback<arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)>::operator()() && /arrow/cpp/src/arrow/util/future.h:526:7 (arrow-threading-utility-test+0x275ffb)
    #5 arrow::internal::FnOnce<void ()>::FnImpl<arrow::Future<arrow::detail::Empty>::Callback<arrow::Future<arrow::detail::Empty> arrow::internal::Executor::Transfer<arrow::detail::Empty>(arrow::Future<arrow::detail::Empty>)::'lambda'(arrow::Result<arrow::detail::Empty> const&)> >::invoke() /arrow/cpp/src/arrow/util/functional.h:122:42 (arrow-threading-utility-test+0x275f67)
    #6 arrow::internal::FnOnce<void ()>::operator()() && /arrow/cpp/src/arrow/util/functional.h:110:17 (libarrow.so.400+0xf9cb80)
    #7 arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) /arrow/cpp/src/arrow/util/future.cc:276:7 (libarrow.so.400+0xf9c86c)
    #8 arrow::ConcreteFutureImpl::DoMarkFinished() /arrow/cpp/src/arrow/util/future.cc:230:27 (libarrow.so.400+0xf98edd)
    #9 arrow::FutureImpl::MarkFinished() /arrow/cpp/src/arrow/util/future.cc:333:60 (libarrow.so.400+0xf952fa)
    #10 arrow::Future<arrow::detail::Empty>::DoMarkFinished(arrow::Result<arrow::detail::Empty>) /arrow/cpp/src/arrow/util/future.h:546:14 (arrow-threading-utility-test+0x186475)
    #11 void arrow::Future<arrow::detail::Empty>::MarkFinished<arrow::detail::Empty, void>(arrow::Status) /arrow/cpp/src/arrow/util/future.h:336:12 (arrow-threading-utility-test+0x16f4ad)
    #12 std::enable_if<(!(std::is_void<arrow::Status>::value)) && (!(is_future<arrow::Status>::value)), void>::type arrow::detail::ContinueFuture::operator()<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&, arrow::Status, arrow::Future<arrow::detail::Empty> >(arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&) const /arrow/cpp/src/arrow/util/future.h:84:10 (arrow-threading-utility-test+0x24c6f6)
    #13 void std::__invoke_impl<void, arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&>(std::__invoke_other, arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60:14 (arrow-threading-utility-test+0x24c619)
    #14 std::__invoke_result<arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&>::type std::__invoke<arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&>(arrow::detail::ContinueFuture&, arrow::Future<arrow::detail::Empty>&, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95:14 (arrow-threading-utility-test+0x24c49a)
    #15 void std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'())>::__call<void, 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/functional:400:11 (arrow-threading-utility-test+0x24c403)
    #16 void std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'())>::operator()<void>() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/functional:482:17 (arrow-threading-utility-test+0x24c359)
    #17 arrow::internal::FnOnce<void ()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture (arrow::Future<arrow::detail::Empty>, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'())> >::invoke() /arrow/cpp/src/arrow/util/functional.h:122:42 (arrow-threading-utility-test+0x24c2f7)
    #18 arrow::internal::FnOnce<void ()>::operator()() && /arrow/cpp/src/arrow/util/functional.h:110:17 (libarrow.so.400+0xf9cb80)
    #19 arrow::internal::WorkerLoop(std::shared_ptr<arrow::internal::ThreadPool::State>, std::_List_iterator<std::thread>) /arrow/cpp/src/arrow/util/thread_pool.cc:167:11 (libarrow.so.400+0x1026cfb)
    #20 arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2::operator()() const /arrow/cpp/src/arrow/util/thread_pool.cc:321:37 (libarrow.so.400+0x1026924)
    #21 void std::__invoke_impl<void, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2>(std::__invoke_other, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:60:14 (libarrow.so.400+0x1026890)
    #22 std::__invoke_result<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2>::type std::__invoke<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2>(arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/invoke.h:95:14 (libarrow.so.400+0x10267a0)
    #23 void std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:244:13 (libarrow.so.400+0x1026748)
    #24 std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:251:11 (libarrow.so.400+0x10266e8)
    #25 std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_2> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/thread:195:13 (libarrow.so.400+0x102646f)
    #26 <null> <null> (libstdc++.so.6+0xd6d83)

  Location is heap block of size 176 at 0x7b2c00000420 allocated by main thread:
    #0 operator new(unsigned long) <null> (arrow-threading-utility-test+0xff3a7)
    #1 arrow::internal::SerialExecutor::SerialExecutor() /arrow/cpp/src/arrow/util/thread_pool.cc:55:43 (libarrow.so.400+0x10209a7)
    #2 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::RunInSerialExecutor<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:220:12 (libarrow.so.400+0x1030906)
    #3 arrow::Result<arrow::detail::Empty> arrow::internal::RunSynchronously<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.h:337:12 (libarrow.so.400+0x10295aa)
    #4 arrow::internal::RunSynchronouslyVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.cc:420:10 (libarrow.so.400+0x10237fb)
    #5 arrow::internal::TestRunSynchronously::RunVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool_test.cc:136:12 (arrow-threading-utility-test+0x2506df)
    #6 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody() /arrow/cpp/src/arrow/util/thread_pool_test.cc:229:3 (arrow-threading-utility-test+0x22d9ec)
    #7 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2433:10 (libgtestd.so+0xabc73)

  Thread T5 (tid=7374, running) created by main thread at:
    #0 pthread_create <null> (arrow-threading-utility-test+0x707cb)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xd7048)
    #2 arrow::internal::ThreadPool::LaunchWorkersUnlocked(int) /arrow/cpp/src/arrow/util/thread_pool.cc:321:11 (libarrow.so.400+0x1022311)
    #3 arrow::internal::ThreadPool::SpawnReal(arrow::internal::TaskHints, arrow::internal::FnOnce<void ()>, arrow::StopToken, arrow::internal::FnOnce<void (arrow::Status const&)>&&) /arrow/cpp/src/arrow/util/thread_pool.cc:338:7 (libarrow.so.400+0x1022ae4)
    #4 arrow::Result<arrow::Future<arrow::detail::Empty> > arrow::internal::Executor::Submit<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'(), arrow::Future<arrow::detail::Empty> >(arrow::internal::TaskHints, arrow::StopToken, arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&&) /arrow/cpp/src/arrow/util/thread_pool.h:149:5 (arrow-threading-utility-test+0x24b483)
    #5 arrow::Result<arrow::Future<arrow::detail::Empty> > arrow::internal::Executor::Submit<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'(), arrow::Future<arrow::detail::Empty> >(arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const::'lambda'()&&) /arrow/cpp/src/arrow/util/thread_pool.h:175:12 (arrow-threading-utility-test+0x24b173)
    #6 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6::operator()(arrow::internal::Executor*) const /arrow/cpp/src/arrow/util/thread_pool_test.cc:223:45 (arrow-threading-utility-test+0x24af53)
    #7 arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>::FnImpl<arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody()::$_6>::invoke(arrow::internal::Executor*&&) /arrow/cpp/src/arrow/util/functional.h:122:42 (arrow-threading-utility-test+0x24ae8b)
    #8 arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>::operator()(arrow::internal::Executor*) && /arrow/cpp/src/arrow/util/functional.h:110:17 (libarrow.so.400+0x1030818)
    #9 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::Run<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:232:22 (libarrow.so.400+0x1030b8a)
    #10 arrow::Result<arrow::detail::Empty> arrow::internal::SerialExecutor::RunInSerialExecutor<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool.h:220:29 (libarrow.so.400+0x1030938)
    #11 arrow::Result<arrow::detail::Empty> arrow::internal::RunSynchronously<arrow::detail::Empty>(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.h:337:12 (libarrow.so.400+0x10295aa)
    #12 arrow::internal::RunSynchronouslyVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>, bool) /arrow/cpp/src/arrow/util/thread_pool.cc:420:10 (libarrow.so.400+0x10237fb)
    #13 arrow::internal::TestRunSynchronously::RunVoid(arrow::internal::FnOnce<arrow::Future<arrow::detail::Empty> (arrow::internal::Executor*)>) /arrow/cpp/src/arrow/util/thread_pool_test.cc:136:12 (arrow-threading-utility-test+0x2506df)
    #14 arrow::internal::TestRunSynchronously_ContinueAfterExternal_Test::TestBody() /arrow/cpp/src/arrow/util/thread_pool_test.cc:229:3 (arrow-threading-utility-test+0x22d9ec)
    #15 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest.cc:2433:10 (libgtestd.so+0xabc73)

SUMMARY: ThreadSanitizer: data race (/build/cpp/debug/arrow-threading-utility-test+0x71ae5) in pthread_cond_destroy
==================
[       OK ] TestRunSynchronously/TestRunSynchronously.ContinueAfterExternal/0 (763 ms)

also in arrow-dataset-file-csv-test (too much to paste here)

@lidavidm
Copy link
Member

Ok, I think this PR fixes the originally reported issue with BackgroundGenerator. But now ThreadSanitizer is finding an issue with SerialExecutor. The nightlies report similar issues: https://github.com/ursacomputing/crossbow/runs/2340248774

So I think we should merge this and start fixing the new issue that's cropped up.

@pitrou
Copy link
Member

pitrou commented Apr 14, 2021

I'm debugging the TSan failure FWIW.

@pitrou
Copy link
Member

pitrou commented Apr 14, 2021

But I agree we can merge this PR already.

@lidavidm lidavidm closed this in e99e70d Apr 14, 2021
kou pushed a commit that referenced this pull request Apr 15, 2021
… trigger-bot

Some earlier attempts at invoking Crossbow failed and the reason wasn't clear. This logs full tracebacks for unexpected exceptions to make it easier to tell what happened.

Example where we ran into a mysterious error: #9941 (comment)

Closes #10037 from lidavidm/arrow-12376

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
… trigger-bot

Some earlier attempts at invoking Crossbow failed and the reason wasn't clear. This logs full tracebacks for unexpected exceptions to make it easier to tell what happened.

Example where we ran into a mysterious error: apache/arrow#9941 (comment)

Closes #10037 from lidavidm/arrow-12376

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@westonpace westonpace deleted the bugfix/arrow-12220 branch January 6, 2022 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants