Skip to content

Atomic merge buffer acquisition for groupBys#3939

Merged
himanshug merged 11 commits intoapache:masterfrom
jihoonson:deadlock
Feb 22, 2017
Merged

Atomic merge buffer acquisition for groupBys#3939
himanshug merged 11 commits intoapache:masterfrom
jihoonson:deadlock

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Feb 15, 2017

This patch fixes #3819.
After this patch, when a group-by query is submitted, all needed merge buffers are first acquired atomically before query execution.


This change is Reviewable

@gianm gianm added this to the 0.10.0 milestone Feb 16, 2017
@gianm gianm added the Bug label Feb 16, 2017
{
final int requiredMergeBufferNum;
if (strategySelector.useStrategyV2(query)) {
final int groupByLayerNum = countGroupByLayers(query, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check could be simplified to just see if the top level query has a table datasource or inner query datasource, without needing to check the number of layers, since the buffer requirement is always 2 if layers > 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since brokers require merge buffers for processing the groupBy layers beyond the inner-most one, a nested groupBy (groupBy -> groupBy -> table) requires only a single merge buffer.
It can be still simplified by early exiting the count recursion (or loop) when the found number of groupBy layers becomes 2, but I wonder how worthwhile it is because most groupBys have very short depth.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson even then we don't need to count beyond 3 layers

@himanshug
Copy link
Copy Markdown
Contributor

it looks like this will solve first case reported in #3819 too, however description seems to be very explicit about case two only. no ?

{
if (!objects.offer(theObject)) {
log.error("WTF?! Queue offer failed, uh oh...");
offer(theObject);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure why simple logic is separated in to a method

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For avoiding the mistake to forget handling the case when offer() is failed. This case means there is a bug in returning unused resources. We are now simply logging an error message, but I think it should be improved.

Supplier<GroupByQueryConfig> configSupplier,
GroupByStrategySelector strategySelector,
@Global StupidPool<ByteBuffer> bufferPool,
GroupByQueryBrokerResourceInitializer brokerResourceInitializer,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding this class, can't we just add prepare(..) method to GroupByQueryStrategy and have that return the resource?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It sounds good. I'll change.

final int requiredMergeBufferNum;
if (strategySelector.useStrategyV2(query)) {
final int groupByLayerNum = countGroupByLayers(query, 1);
requiredMergeBufferNum = Math.min(2, groupByLayerNum - 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note or otherwise make it more clear that this GroupByQueryBrokerResource isn't used when running a non-nested query? It threw me off for a bit when I was looking at what happens when requiredMergeBufferNum is 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note and improve the javadoc.

final ResourceHolder<List<ByteBuffer>> mergeBufferHolders = mergeBufferPool.drain(requiredMergeBufferNum);
if (mergeBufferHolders.get().size() < requiredMergeBufferNum) {
mergeBufferHolders.close();
throw new ResourceLimitExceededException("Cannot acquire enough merge buffers");
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing the query here seems too aggressive to me, e.g.:

  • Suppose there are two merge buffers, with a nested query A and a single-level query B
  • Query B runs first, grabs one of the buffers
  • Query A runs, with this query A would fail right there even though the query could be successfully executed if query A waited for query B to finish

I think the ResourceLimitExceededException should only be thrown if the number of buffers required by a single query exceeds the total number of buffers available, but not for the situation where a buffer is only temporarily unavailable

I think the query that needs more buffers than currently available should wait for the timeout set in the query before failing, for the nested queries that require > 2 this may need an atomic checkSizeAndDrain method that grabs the ArrayBlockingQueue's lock and checks the size before either grabbing resources or waiting for a timeout

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion.
I think the case you mentioned is the problem of query scheduling. With query scheduling, when a query is submitted, it first waits for its turn until the resources are ready and other queries of higher priority complete.
However, timeout seems necessary in anyway. I'll add it soon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei, I simply added a timeout parameter to drain() method. I think it is better for now because adding checkSizeAndDrain method causes the below problems.

  • This method requires for BlockingPool to maintain a lock itself, and thus the type of objects should be changed from BlockingQueue to something another to avoid unnecessary locking.
  • Even with checkSizeAndDrain method, the starvation problem still exists, so additional handling for that problem is required.

I think it would be better to redesign BlockingPool to address both problems, but it's not covered by this issue. So, how about opening a new issue for this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson @jon-wei with the timeout , it looks good to me ... I don't think checkSizeAndDrain is necessary
can we reduce the timeout when sending the query to historical since we already used some of the time that user allowed ?

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, without checking the size and draining atomically, two nested groupbys needing 2 buffers each could still in theory block each other:

  • Suppose there are 2 buffers, and two nested queries are issued simultaneously, the window for both acquiring one buffer and block each other should be much smaller now (from duration of subquery execution to the much shorter drainTo processing time)

  • Suppose there are 2 buffers, with two nested queries, but 1 buffer is currently in use by a non-nested query. Nested query A runs and drains one buffer, but waits for the second one. Nested query B also runs and sees no buffers, so it waits. Now suppose the non-nested query finishes and returns its buffer, but nested query B gets to run before nested query A, and takes the second buffer, leaving both nested queries blocked on each other

I'm okay with using this drain + timeout for now if that's the consensus, and opening a follow on issue about implementing truly atomic buffer acquisition coupled with something to address starvation issues for queries that need > 1 buffers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first case, you mean two queries needing 2 buffers can block each other even when there are 2 available buffers in the pool?

Anyway, yes, drain + timeout is not enough. I'll open a follow-up issue if others agree.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we don't simply use well-known libraries like netty. @gianm, would you share any reasons?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, scratch what I said about the first case, while the BlockingQueue drainTo method contract doesn't specify what happens with concurrent modifications while draining occurs, the ArrayBlockingQueue implementation does have a lock internally

@jihoonson
Copy link
Copy Markdown
Contributor Author

@jon-wei, @himanshug thanks for your review. I addressed your comments.

@himanshug, yes this PR covers the case 1 as well. I updated the pr description.

final long timeout = timeoutAt - System.currentTimeMillis();
if (timeout <= 0 || (mergeBufferHolder = mergeBufferPool.take(timeout)) == null) {
throw new QueryInterruptedException(new TimeoutException());
throw new TimeoutException();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? i think this is used to propagate issues properly from historicals to brokers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockingPool.take() now throws RuntimeException instead of InterruptedException, and the below catch block catches all kinds of exceptions and throws again with wrapping QueryInterruptedException (https://github.com/druid-io/druid/pull/3939/files/4f36f619bdfe5f7084913e77c583cf2f423d304a#diff-852ac93b1541cb9178ad922dc30be4baR176). This line causes QueryInterruptedException is wrapped twice unnecessarily, so I changed.

*/
public class GroupByQueryBrokerResource implements Closeable
{
private static final EmittingLogger log = new EmittingLogger(GroupByQueryBrokerResource.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't really need the EmittingLogger

public ResourceHolder<ByteBuffer> getMergeBuffer()
{
Preconditions.checkState(mergeBuffers != null);
Preconditions.checkState(mergeBuffers.size() > 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preconditions.checkState(mergeBuffer != null && mergeBuffers.size() > 0) ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null mergeBuffers means this resource is initialized with 0 merge buffers, and mergeBuffers of size 0 means there remains no available merge buffers. I would like to make sure this.

return config.withOverrides(query).getDefaultStrategy();
}

public boolean useStrategyV2(GroupByQuery query)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why is this introduced? who is using this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove. Thanks.

public GroupByStrategy strategize(GroupByQuery query)
{
final String strategyString = config.withOverrides(query).getDefaultStrategy();
final String strategyString = getStrategy(query);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, not sure if there is any advantage in separating this into a method

checkInitialized();
final T theObject;
try {
theObject = timeout >= 0 ? objects.poll(timeout, TimeUnit.MILLISECONDS) : objects.take();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout > 0 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 time means no wait.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take() rather than poll()?

checkInitialized();
final T theObject;
try {
theObject = timeout >= 0 ? objects.poll(timeout, TimeUnit.MILLISECONDS) : objects.take();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why take() rather than poll()?

Queues.drain(objects, batch, maxElements, timeout, TimeUnit.MILLISECONDS) :
objects.drainTo(batch, maxElements);
if (n < maxElements) {
if (log.isDebugEnabled()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.isDebugEnabled() is baked in Logger already, so suggested

log.debug("Requested %d elements, but drained %d elements", maxElements, n);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not changed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Changed.

public CloseableGrouperIterator<RowBasedKey, Row> make()
{
final List<Closeable> closeOnFailure = Lists.newArrayList();
final List<Closeable> closeOnExit = Lists.newArrayList();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using Closer for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the order of closing matters. I don't want to change it in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a comment explaining this in the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment.

{
if (mergeBuffersHolder != null) {
if (mergeBuffers.size() != mergeBuffersHolder.get().size()) {
log.warn((mergeBuffersHolder.get().size() - mergeBuffers.size()) + " resources are not returned yet");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log message formatting %d

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private final ResourceHolder<List<ByteBuffer>> mergeBuffersHolder;
private final List<ByteBuffer> mergeBuffers;

public GroupByQueryBrokerResource()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default constructor is used when any merge buffers are not required for groupBy execution.

I renamed to GroupByQueryResource which is more general name because druid's convention generally doesn't distinguish the broker side things and others.
GroupByQueryResource can be used by queryable nodes, i.e., brokers, historicals, and real times. However, currently it is used by only brokers to get merge buffers atomically if necessary. And, even in brokers, merge buffers are not used when the groupBy strategy v1 is used.

public ResourceHolder<ByteBuffer> getMergeBuffer()
{
Preconditions.checkState(mergeBuffers != null, "Resource is initialized with empty merge buffers");
Preconditions.checkState(mergeBuffers.size() > 0, "No available merge buffers");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using ArrayDeque, these two lines could be replaced with buffer = mergeBuffers.pop()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

try {
mergeBufferHolders = mergeBufferPool.drain(requiredMergeBufferNum, timeout.longValue());
if (mergeBufferHolders.get().size() < requiredMergeBufferNum) {
mergeBufferHolders.close();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why TimeoutException? It's not a timeout. Maybe IllegalStateException

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to InsufficientResourcesException

@Override
public GroupByQueryBrokerResource prepareResource(GroupByQuery query, boolean willMergeRunners)
{
if (!willMergeRunners) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this comment to countRequiredMergeBufferNum()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return new GroupByQueryBrokerResource(mergeBufferHolders);
}
}
catch (Exception e) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why QueryInterruptedException?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryInterruptedException is used for all kinds of failed queries.

- Add InsufficientResourcesException
- Renamed GroupByQueryBrokerResource to GroupByQueryResource
@jihoonson
Copy link
Copy Markdown
Contributor Author

@leventov thanks for your review. Addressed some of your comments.

I don't know why I can't add an inline reply for your comment. That part is not what I changed. I just changed that take() method throws a RuntimeException instead of an InterruptedException.

/**
* This exception is thrown when the requested operation cannot be completed due to a lack of available resources.
*/
public class InsufficientResourcesException extends Exception
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make new exceptions to extend RuntimeException. Checked exceptions only force us to write more boilerplate, try-catch-throwables-propagate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

public static final String CTX_KEY_FUDGE_TIMESTAMP = "fudgeTimestamp";
public static final String CTX_KEY_OUTERMOST = "groupByOutermost";

private static final int MAX_MERGE_BUFFER_NUM = 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding at least "see countRequiredMergeBufferNum() for explanation" comment, or move part of countRequiredMergeBufferNum()'s comment to this constant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@leventov
Copy link
Copy Markdown
Member

@jihoonson and about poll vs take() - before you explained, that timeout 0 means "no wait", but take() is blocking.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@leventov ah, I got your point. That's also not what I changed, but I clarified it.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@jon-wei I simply added a new takeBatch() method to BlockingPool which is checkAndDrain() you mentioned. After thinking about this for a while, I decided to add it because this issue focuses on guaranteeing the atomicity when acquiring merge buffers, and it cannot be achieved without takeBatch(). But I still believe that we need to improve and maybe redesign BlockingPool.

Sorry for going back and forth. Would you and other reviewers @himanshug @leventov mind reviewing again please?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 22, 2017

Hmm, travis has been done for a while but the PR hasn't figured that out yet. Going to bounce it to see if that helps.

@gianm gianm closed this Feb 22, 2017
@gianm gianm reopened this Feb 22, 2017
@himanshug
Copy link
Copy Markdown
Contributor

👍

@himanshug himanshug merged commit 7200dce into apache:master Feb 22, 2017
@jihoonson
Copy link
Copy Markdown
Contributor Author

@himanshug thank you for the review and merge.
@gianm @leventov @jon-wei if you have more comments on the newly added part, please feel free to tell me. I'll do in a follow-up pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

groupBy v2: Deadlock on deeply nested subqueries

5 participants