-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12220: [C++][CI] Thread sanitizer failure #9941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I was able to reproduce the original error pretty reliably. The tricky part was that it would only be triggered when the CPU executor was cleaned up and so I had to do the repeat outside of gtest (i.e. no --gtest_repeat) so that the process kept closing/opening. This combined with CPU stress triggered the issue pretty reliably. I used this to verify the fix. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(warning: long tangent)
I think longer term, we should make sure to pass down stop tokens everywhere. If we had a generator backed ultimately by a Flight stream, for instance, we may be waiting an arbitrarily long time if there's no explicit cancellation signal.
Also I think this is why 'modern' libraries (e.g. Trio) have a nursery or some other scope that when exited/dropped, waits for all work to complete/be cancelled - that avoids problems like this, and the one Antoine helped with in #9656, where tasks outlive their context. ARROW-12208 (the 'current thread executor') is working towards that design: you explicitly create an executor at the top level which is used to run your work, and when that's done you clean up everything before returning.
cpp/src/arrow/util/async_generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need only state->running since that also covers the case that the background task is not currently running, but has run in the past (which this check misses).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think running works. Consider the cleanup happens while the background thread is about to call waiting_future.MarkFinished(next); and the queue has just filled up so running is false. If we don't wait and just return then it is possible the callback accesses deleted state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right, I got myself mixed up before. I guess the concern is actually - what if (a) the background task is currently running and (b) it's not the first time it's run - it looks like task_finished will have been completed the first time we ran the background task, so this won't actually wait for the background task to end. Maybe task_finished needs to be reinitialized when the task starts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task_finished should only be marked complete on the last execution of Task (e.g. Task should never be relaunched once state->finished is true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right though that it's not properly taking into account what happens if cleanup runs and task isn't running.
|
On a side note, when you mentioned rxcpp, it seems the lead maintainer moved on to libunifex which is an implementation of a C++23 proposal (Unified Executors), wonder if there's any ideas to borrow from there. |
|
First, I was not familiar with Second, yes, I had the same thought about the stop token. I'll add a note to ARROW-11875. Third, I hadn't thought of the serial executor as a nursery but you may be on to something. For example, off the top of my head you could do something like... I'll make a JIRA for it. ...it's not exactly a nursery but similar. There is also the |
|
It seems there is some corner case that leads to deadlock. More work is needed. Also, I created ARROW-12285 for discussion on async state management. |
cpp/src/arrow/util/async_generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where complexity starts being difficult to reason about. What's the difference between running_at_all and finished? What are the possible combinations?
You may want to define a enum to describe the current state rather than having three different booleans...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finished means that the background thread is done. Any future reads that would require starting the background generator back up should return an end token.
running_at_all means that not only is it finished but the background thread is done delivering the last item (we mark finished and "reserve a spot" in the mutex but can't deliver the item in the mutex) and is not going to be checking to see if it needs to mark the final future complete.
I will consider an enum and see if the 8 possible states collapse down to some reasonable subset.
8f25ca5 to
6ffc35f
Compare
|
Ok. I got lost trying to simplify things with an enum and, in the process, found a bug where a really fast consumer could try and restart the queue multiple times. So, I have reworked it and I think it is a bit easier to reason with now (a bit). Every background task gets a future that it marks complete when it finishes. The background task cannot be restarted (and the cleanup cannot finish) until that future has been marked complete. There is still a "running" flag so that we know when we add an item to the queue if we have to restart the background task (because it is no longer running and merely cleaning up) or not (because it is still running and will pick it up on the next loop). |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rather tricky code overall now…
cpp/src/arrow/util/async_generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enum seems to be unused in favor of the booleans still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, whoops. Missed removing that when I was abandoning the enum approach. I've removed it.
|
I'm not too sure what to do about the remaining complexity. I could change the background thread so it blocks instead of releasing the thread although I think that would introduce the need for a proper condition variable. I could change the polling thread so that a restart that is blocked by a cleaning up background thread blocks instead of using callbacks. But then this is blocking in an async context which is ideally avoided. Or we could simply go back to the old method of thread task per request. At the end of the day the threads were only adding a small overhead to the cached I/O case. |
cpp/src/arrow/util/async_generator.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability suggestion: can this just be a static method on the generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…en after the downstream had given up on it. Other than the obvious memory / resource usage problems this also meant that callback handlers could reference deleted state downstream. Now we block the destructor until the background thread is finished and we stop the background thread early if all consumer references are lost.
…could lead to deadlock.
… could fail to cleanup the background thread quickly enough for the unit test's satisfaction
…and reduce race conditions
df22d51 to
926f799
Compare
|
It looks like the R 3.5 tests got stuck when running on x64, timing out after 6 hours. The 1 hour timeout didn't apply. Unfortunately there seems to be practically nothing in the logs other than that. |
|
That seems new. The R 3.5 64 bit builds have been pretty stable (compared to the 32 bit builds). Unfortunately, it's not reliable. The R builds passed locally for me. Later today I'll run a few more local runs on the laptop and see if I can reproduce it. |
|
I'm retrying the R builds. We shouldn't bother if there's a sporadic issue with soon-to-be-obsolete RTools 3.5. |
|
All 5 R jobs appear to be hopelessly hung. |
|
They're just waiting to start AFAICT. |
|
@github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer |
1 similar comment
|
@github-actions crossbow submit test-ubuntu-20.04-cpp-thread-sanitizer |
|
@nealrichardson 's crossbow job appears to have failed to launch (hence the thumbs down). So I gave it another shot. |
|
Your second attempt also failed :/ https://github.com/apache/arrow/runs/2341429654 I manually submitted it in the meantime: https://github.com/lidavidm/crossbow/runs/2343121554?check_suite_focus=true Not sure what's going on as |
|
ThreadSanitizer found issues still :/ arrow-threading-utility-testalso in arrow-dataset-file-csv-test (too much to paste here) |
|
Ok, I think this PR fixes the originally reported issue with BackgroundGenerator. But now ThreadSanitizer is finding an issue with SerialExecutor. The nightlies report similar issues: https://github.com/ursacomputing/crossbow/runs/2340248774 So I think we should merge this and start fixing the new issue that's cropped up. |
|
I'm debugging the TSan failure FWIW. |
|
But I agree we can merge this PR already. |
… trigger-bot Some earlier attempts at invoking Crossbow failed and the reason wasn't clear. This logs full tracebacks for unexpected exceptions to make it easier to tell what happened. Example where we ran into a mysterious error: #9941 (comment) Closes #10037 from lidavidm/arrow-12376 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
… trigger-bot Some earlier attempts at invoking Crossbow failed and the reason wasn't clear. This logs full tracebacks for unexpected exceptions to make it easier to tell what happened. Example where we ran into a mysterious error: apache/arrow#9941 (comment) Closes #10037 from lidavidm/arrow-12376 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
The background generator kept reading from the source even after the downstream had given up on it. Other than the obvious memory / resource usage problems this also meant that callback handlers could reference deleted state downstream. Now we block the destructor until the background thread is finished and we stop the background thread early if all consumer references are lost.