-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34391: [C++] Future as-of-join-node hangs on distant times #34392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Marked as a draft because the current version appears to infrequently hit a race condition. For example: occurred at iteration 63 (which can change, even when the seed is fixed) only of the above test-case, after several runs of the tester have passed. |
|
cc @westonpace @icexelloss, in case you have an idea about the apparent race condition. |
|
Coming back to this, I see the same race condition after merging main. One experiment I made is related to this comment:
|
|
I fixed the race condition, at least locally, as I now see this macOS CI failure. In a debug session, I observed that Following this fix, I did not locally observe the race condition in several runs of the tester, whereas before I observed it after one or two runs. There is still the macOS failure, however, I don't have access to a macOS to locally test on it. |
|
For reference, here are the relevant log lines from the macOS CI job failure: |
|
The recent commit allowed the macOS CI job that previously failed to succeed this time (the single-job failure this time is irrelevant). Of course, this does not prove the as-of-join code is now free of race conditions, yet the explanation below may help in reasoning about what's going on. The main idea in this recent fix is to ensure
Note that I believe the as-of-join node's process-thread is not responsible for the race condition. It just processes in the same order of batches received, regardless of this being done in a separate thread. I believe the race condition is due to the non-deterministic order of arrival of batches to the as-of-join node, and that there's an order of arrival that drives the code (before the recent fix) to access an invalid |
|
Thanks @rtpsw I don't have to look this closely Today but I will try to take a look soon |
|
A quick look and a couple of questions:
Why would the code later unable to detect that the current time is invalid? (I assume this is invalid because it hasn't got any input?)
This is surprising - why would this happen in serial execution? And what evidence makes you believe this is happening? |
Detecting is not good enough.
I don't (yet?) have evidence to directly support this. What I noted is evidence against the race condition being due to the process thread, and so I suspect the order of input batches. I suspect that when a node has at least two inputs, the order of batches to that node (i.e., across its inputs) may still be non-deterministic even with serial execution. Perhaps @westonpace can shed some light here, or we could investigate. |
Oh I see. Yeah I can see that happening - there is no guarantees of total ordering if there are multiple source. There is only guarantees of ordering of batches within a single source table. |
If the |
That might be doable, but I doubt it would be simple. The calls to |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor suggestions.
|
|
||
| void UpdateTime(OnType ts) { | ||
| OnType prev_time = current_time_; | ||
| while (prev_time < ts && current_time_.compare_exchange_weak(prev_time, ts)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use a while loop here instead of a single call to compare_exchange_strong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because compare_and_exchange_weak (see doc) may not find the expected value of prev_time from line 260 by the time the loop is executed due to a race condition (expected to be rare). In this case false is returned and prev_time is updated to the value of current_time_. Then, another iteration is tried. This is normal CAS-loop logic.
|
I agree that order between sources is not guaranteed. In other words, the first batch might be Another potential source of contention / race conditions is the interplay between |
Indeed I'm aware. This scenario doesn't cause trouble because the |
|
In the recent commit, I also included another invocation of |
|
I see a failure on macOS and Windows with the same symptom as seen before for the race condition. Need to debug on at least one of these platforms. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @rtpsw ! I can't vouch for the correctness but this PR looks formally fine to me.
|
Benchmark runs are scheduled for baseline = f45a9e5 and contender = dcdeab7. dcdeab7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
icexelloss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rtpsw Unfortunately this is merged before I had the chance to properly review this. Can you take a look at the comments and create a follow up PR if needed,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add comments in the code to explain this. Looks like something could tricky to understand just from the code.
| // when entries with a time less than T are removed, the current time is updated to the | ||
| // time of the next (by-time) and now-current entry or to T if no such entry exists. | ||
| OnType current_time_; | ||
| std::atomic<OnType> current_time_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why do we change this to atomic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to your other question here. There, we need to set the current time on the MemoStore instance given a new batch, and this is done from a different thread (the one handling an incoming input batch) than the one processing the batch (the one running ProcessThread) using the same MemoStore instance. This means we need to synchronize MemoStore.current_time_ between these threads, and so it is made atomic here.
| if (rb->num_rows() > 0) { | ||
| queue_.Push(rb); | ||
| key_hasher_->Invalidate(); // batch changed - invalidate key hasher's cache | ||
| memo_.UpdateTime(GetTime(rb.get(), 0)); // time changed - update in MemoStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why do we add the UpdateTime here. How was memostore time updated before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this post. UpdateTime is synchronized, and before we didn't need this synchronization.
| bool has_entry = opt_entry.has_value(); | ||
| OnType entry_time = has_entry ? (*opt_entry)->time : TolType::kMinValue; | ||
| row_index_t entry_row = has_entry ? (*opt_entry)->row : 0; | ||
| bool accepted = has_entry && tolerance.Accepts(lhs_latest_time, entry_time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "accepted" mean here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, accepted is in the same sense of tolerance.Accepts, meaning that the entry's time must (exist and) be within the time interval defined by the tolerance for lhs_latest_time.
|
Sorry for merging early. If @icexelloss 's comments can be addressed as part of a standalone PR, to ease reviewing, it would be nice. |
Will do. In the meantime, I'll note that we do not remove entries earlier than a time |
|
No worries! Can open a follow up PR to address the remaining issues.
…On Wed, May 31, 2023 at 8:46 AM rtpsw ***@***.***> wrote:
Can you please add comments in the code to explain this. Looks like
something could tricky to understand just from the code.
Will do. In the meantime, I'll note that we do not remove entries earlier
than a time ts when it is not in the past of latest_time, which is the
case when latest_time >= ts is false; otherwise, we would be removing
entries before processing them.
—
Reply to this email directly, view it on GitHub
<#34392 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGBXLEFGGL7IIR2UHXQCJ3XI44UHANCNFSM6AAAAAAVLXLTPQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works but I find it a little confusing. It seems strange to me that the delaying node would be placed after the asof join node. I was thinking it would be placed on the slow input to ensure that it did not deliver any batches.
However, if you were to do that, the entire plan would hang. This is because the delaying node synchronously blocks. This means the CPU thread is taken out of use. When use_threads=false (as it is when we use the asof join node) then that means the only working thread is blocked.
The reason it works today is because the delay happens after the asof join node and so it is actually blocking the processing thread and not a CPU thread. It was surprising to me that the processing thread was the thread that called InputReceived. This means, if you have something like...
Source -> AsofJoinNode -> Project -> Aggregation
Then the processing thread will be the one calling Project and Aggregation which is odd. However, this issue isn't very relevant to the backpressure problem at hand.
So it seems this works because there is enough data initially delivered to release at least one batch. This batch gets caught in the delayed node which hangs the processing thread. Since the processing thread is hung then enough data accumulates in the inputs that it will eventually pause the inputs.
If we want to proceed with this design then I think that is ok. For the sake of completeness I am including an example of what I had in my mind. A "GatedNode" would not block but it would not emit any batches (they would just queue up in the scheduler) until the gate is unlocked.
So, with this idea, the gated node would be placed over the slow source. You could get rid of the entire idea of "delayed" sources (and all the accompanying sleeps). The plan would be started and we wait to ensure backpressure is hit. Once backpressure is hit we release the gate and then confirm that the plan resumes and finishes.
Feel free to take my idea or continue with this one.
|
Ignore that, put this review on the wrong PR. |
See #34391
Note that the
TestBasic7Forwardtest-case included in the PR reproduces the hang in the pre-PR code.