Skip to content

Conversation

@wjones127
Copy link
Member

No description provided.

@github-actions
Copy link

@wjones127
Copy link
Member Author

When I run with a fix in place for hash join paths, I sometimes get /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/util.cc:329: Check failed: (thread_index) < (Capacity()) thread index 9 is out of range [0, 9), but most of the time it succeeds.

@wjones127
Copy link
Member Author

Comment on lines +899 to +909
ThreadLocalState& GetLocalState(size_t thread_index) {
if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
size_t old_size = local_states_.size();
local_states_.resize(thread_index + 1);
for (size_t i = old_size; i < local_states_.size(); ++i) {
local_states_[i].is_initialized = false;
local_states_[i].is_has_match_initialized = false;
}
}
return local_states_[thread_index];
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here's my suggestion for an alternative approach: Instead of trusting that the exact correct number of threads has been created (since that seems hard), gracefully resize the local state vectors as needed. I think there's a few more places I'd need to add this logic (we might even need this in the indexer; see my earlier comment about the occasional failure).

What do you think @westonpace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this thread-unsafe? You're resizing a vector while it could be accessed by other threads concurrently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that needs to be fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curios, given that during the Init the vector is set to the size equal to the number of threads, when does it happen that GetLocalState is invoked with a thread index outside of the already allocated ones? I thought we were using threadpools and thus the amount of threads was stable. Are we recycling them or something like that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thread pools don't include the main thread, for instance. Also it's sized to the CPU thread pool, but something might 'leak' from the IO thread pool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IO threads and CPU threads are separate pools, and if we don't pass an executor the source node runs the downstream nodes on the IO thread:

outputs_[0]->InputReceived(this, std::move(batch));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by the way, the user is allowed to change thread pool capacity at runtime, so static sizing will never be correct even in the simple case of a single thread pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sizing is recomputed for each new exec plan so the failure would only occur on plans that were running when the thread pool was resized. I am planning on taking a look at a better implementation for use_threads=FALSE on Friday using a serial executor which will ensure that an exec plan always has an executor and exec plan steps are always run on a thread belonging to that executor. This will solve all but the issue Antoine mentioned.

That being said, I think your solution is reasonable. I'll have to ping @bkietz and @michalursa as they were the original proponents of the statically sized thread states. I don't know if that was based on speculation, existing literature, or actual benchmark measurements however.

@pitrou
Copy link
Member

pitrou commented Feb 17, 2022

By the way, ARROW-14908 is closed. Should you update this PR to point to a different JIRA id?

@wjones127
Copy link
Member Author

By the way, ARROW-14908 is closed. Should you update this PR to point to a different JIRA id?

Yes, I can create a follow-up Jira.

@wjones127
Copy link
Member Author

Closing this in favor of #12468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants