Atomic merge buffer acquisition for groupBys by jihoonson · Pull Request #3939 · apache/druid

jihoonson · 2017-02-15T10:19:36Z

This patch fixes #3819.
After this patch, when a group-by query is submitted, all needed merge buffers are first acquired atomically before query execution.

This change is

jon-wei · 2017-02-16T22:18:58Z

+  {
+    final int requiredMergeBufferNum;
+    if (strategySelector.useStrategyV2(query)) {
+      final int groupByLayerNum = countGroupByLayers(query, 1);


I think this check could be simplified to just see if the top level query has a table datasource or inner query datasource, without needing to check the number of layers, since the buffer requirement is always 2 if layers > 1

Since brokers require merge buffers for processing the groupBy layers beyond the inner-most one, a nested groupBy (groupBy -> groupBy -> table) requires only a single merge buffer.
It can be still simplified by early exiting the count recursion (or loop) when the found number of groupBy layers becomes 2, but I wonder how worthwhile it is because most groupBys have very short depth.

@jihoonson even then we don't need to count beyond 3 layers

himanshug · 2017-02-16T22:36:54Z

it looks like this will solve first case reported in #3819 too, however description seems to be very explicit about case two only. no ?

himanshug · 2017-02-16T22:38:27Z

          {
-            if (!objects.offer(theObject)) {
-              log.error("WTF?! Queue offer failed, uh oh...");
+            offer(theObject);


nit: not sure why simple logic is separated in to a method

For avoiding the mistake to forget handling the case when offer() is failed. This case means there is a bug in returning unused resources. We are now simply logging an error message, but I think it should be improved.

himanshug · 2017-02-16T22:44:57Z

-      Supplier<GroupByQueryConfig> configSupplier,
      GroupByStrategySelector strategySelector,
-      @Global StupidPool<ByteBuffer> bufferPool,
+      GroupByQueryBrokerResourceInitializer brokerResourceInitializer,


instead of adding this class, can't we just add prepare(..) method to GroupByQueryStrategy and have that return the resource?

Thanks. It sounds good. I'll change.

jon-wei · 2017-02-16T22:46:27Z

+    final int requiredMergeBufferNum;
+    if (strategySelector.useStrategyV2(query)) {
+      final int groupByLayerNum = countGroupByLayers(query, 1);
+      requiredMergeBufferNum = Math.min(2, groupByLayerNum - 1);


Can you add a note or otherwise make it more clear that this GroupByQueryBrokerResource isn't used when running a non-nested query? It threw me off for a bit when I was looking at what happens when requiredMergeBufferNum is 0

I'll add a note and improve the javadoc.

jon-wei · 2017-02-16T22:56:10Z

+    final ResourceHolder<List<ByteBuffer>> mergeBufferHolders = mergeBufferPool.drain(requiredMergeBufferNum);
+    if (mergeBufferHolders.get().size() < requiredMergeBufferNum) {
+      mergeBufferHolders.close();
+      throw new ResourceLimitExceededException("Cannot acquire enough merge buffers");


Failing the query here seems too aggressive to me, e.g.:

Suppose there are two merge buffers, with a nested query A and a single-level query B

Query B runs first, grabs one of the buffers

Query A runs, with this query A would fail right there even though the query could be successfully executed if query A waited for query B to finish

I think the ResourceLimitExceededException should only be thrown if the number of buffers required by a single query exceeds the total number of buffers available, but not for the situation where a buffer is only temporarily unavailable

I think the query that needs more buffers than currently available should wait for the timeout set in the query before failing, for the nested queries that require > 2 this may need an atomic checkSizeAndDrain method that grabs the ArrayBlockingQueue's lock and checks the size before either grabbing resources or waiting for a timeout

Thanks for your suggestion.
I think the case you mentioned is the problem of query scheduling. With query scheduling, when a query is submitted, it first waits for its turn until the resources are ready and other queries of higher priority complete.
However, timeout seems necessary in anyway. I'll add it soon.

@jon-wei, I simply added a timeout parameter to drain() method. I think it is better for now because adding checkSizeAndDrain method causes the below problems.

This method requires for BlockingPool to maintain a lock itself, and thus the type of objects should be changed from BlockingQueue to something another to avoid unnecessary locking.

Even with checkSizeAndDrain method, the starvation problem still exists, so additional handling for that problem is required.

I think it would be better to redesign BlockingPool to address both problems, but it's not covered by this issue. So, how about opening a new issue for this?

@jihoonson @jon-wei with the timeout , it looks good to me ... I don't think checkSizeAndDrain is necessary
can we reduce the timeout when sending the query to historical since we already used some of the time that user allowed ?

Hm, without checking the size and draining atomically, two nested groupbys needing 2 buffers each could still in theory block each other:

Suppose there are 2 buffers, and two nested queries are issued simultaneously, the window for both acquiring one buffer and block each other should be much smaller now (from duration of subquery execution to the much shorter drainTo processing time)

Suppose there are 2 buffers, with two nested queries, but 1 buffer is currently in use by a non-nested query. Nested query A runs and drains one buffer, but waits for the second one. Nested query B also runs and sees no buffers, so it waits. Now suppose the non-nested query finishes and returns its buffer, but nested query B gets to run before nested query A, and takes the second buffer, leaving both nested queries blocked on each other

I'm okay with using this drain + timeout for now if that's the consensus, and opening a follow on issue about implementing truly atomic buffer acquisition coupled with something to address starvation issues for queries that need > 1 buffers

In the first case, you mean two queries needing 2 buffers can block each other even when there are 2 available buffers in the pool?

Anyway, yes, drain + timeout is not enough. I'll open a follow-up issue if others agree.

I'm curious why we don't simply use well-known libraries like netty. @gianm, would you share any reasons?

Sorry, scratch what I said about the first case, while the BlockingQueue drainTo method contract doesn't specify what happens with concurrent modifications while draining occurs, the ArrayBlockingQueue implementation does have a lock internally

jihoonson · 2017-02-17T09:48:18Z

@jon-wei, @himanshug thanks for your review. I addressed your comments.

@himanshug, yes this PR covers the case 1 as well. I updated the pr description.

himanshug · 2017-02-17T20:18:48Z

                final long timeout = timeoutAt - System.currentTimeMillis();
                if (timeout <= 0 || (mergeBufferHolder = mergeBufferPool.take(timeout)) == null) {
-                  throw new QueryInterruptedException(new TimeoutException());
+                  throw new TimeoutException();


why this change? i think this is used to propagate issues properly from historicals to brokers

BlockingPool.take() now throws RuntimeException instead of InterruptedException, and the below catch block catches all kinds of exceptions and throws again with wrapping QueryInterruptedException (https://github.com/druid-io/druid/pull/3939/files/4f36f619bdfe5f7084913e77c583cf2f423d304a#diff-852ac93b1541cb9178ad922dc30be4baR176). This line causes QueryInterruptedException is wrapped twice unnecessarily, so I changed.

himanshug · 2017-02-17T20:23:27Z

+ */
+public class GroupByQueryBrokerResource implements Closeable
+{
+  private static final EmittingLogger log = new EmittingLogger(GroupByQueryBrokerResource.class);


nit: don't really need the EmittingLogger

himanshug · 2017-02-17T20:24:21Z

+  public ResourceHolder<ByteBuffer> getMergeBuffer()
+  {
+    Preconditions.checkState(mergeBuffers != null);
+    Preconditions.checkState(mergeBuffers.size() > 0);


Preconditions.checkState(mergeBuffer != null && mergeBuffers.size() > 0) ?

The null mergeBuffers means this resource is initialized with 0 merge buffers, and mergeBuffers of size 0 means there remains no available merge buffers. I would like to make sure this.

himanshug · 2017-02-17T20:25:47Z

+    return config.withOverrides(query).getDefaultStrategy();
+  }
+
+  public boolean useStrategyV2(GroupByQuery query)


not sure why is this introduced? who is using this

I forgot to remove. Thanks.

himanshug · 2017-02-17T20:26:58Z

  public GroupByStrategy strategize(GroupByQuery query)
  {
-    final String strategyString = config.withOverrides(query).getDefaultStrategy();
+    final String strategyString = getStrategy(query);


again, not sure if there is any advantage in separating this into a method

himanshug · 2017-02-17T20:59:51Z

+    checkInitialized();
+    final T theObject;
+    try {
+      theObject = timeout >= 0 ? objects.poll(timeout, TimeUnit.MILLISECONDS) : objects.take();


timeout > 0 ?

0 time means no wait.

Why take() rather than poll()?

leventov · 2017-02-22T00:19:54Z

+    checkInitialized();
+    final T theObject;
+    try {
+      theObject = timeout >= 0 ? objects.poll(timeout, TimeUnit.MILLISECONDS) : objects.take();


Why take() rather than poll()?

leventov · 2017-02-22T00:23:44Z

+                    Queues.drain(objects, batch, maxElements, timeout, TimeUnit.MILLISECONDS) :
+                    objects.drainTo(batch, maxElements);
+      if (n < maxElements) {
+        if (log.isDebugEnabled()) {


log.isDebugEnabled() is baked in Logger already, so suggested

log.debug("Requested %d elements, but drained %d elements", maxElements, n);

Not changed

Sorry. Changed.

leventov · 2017-02-22T00:26:58Z

          public CloseableGrouperIterator<RowBasedKey, Row> make()
          {
-            final List<Closeable> closeOnFailure = Lists.newArrayList();
+            final List<Closeable> closeOnExit = Lists.newArrayList();


Why not using Closer for this?

It seems that the order of closing matters. I don't want to change it in this PR.

Please leave a comment explaining this in the code.

Left a comment.

leventov · 2017-02-22T00:29:39Z

+  {
+    if (mergeBuffersHolder != null) {
+      if (mergeBuffers.size() != mergeBuffersHolder.get().size()) {
+        log.warn((mergeBuffersHolder.get().size() - mergeBuffers.size()) + " resources are not returned yet");


Use log message formatting %d

leventov · 2017-02-22T00:30:45Z

+  private final ResourceHolder<List<ByteBuffer>> mergeBuffersHolder;
+  private final List<ByteBuffer> mergeBuffers;
+
+  public GroupByQueryBrokerResource()


What's the point of this constructor?

This default constructor is used when any merge buffers are not required for groupBy execution.

I renamed to GroupByQueryResource which is more general name because druid's convention generally doesn't distinguish the broker side things and others.
GroupByQueryResource can be used by queryable nodes, i.e., brokers, historicals, and real times. However, currently it is used by only brokers to get merge buffers atomically if necessary. And, even in brokers, merge buffers are not used when the groupBy strategy v1 is used.

leventov · 2017-02-22T00:32:36Z

+  public ResourceHolder<ByteBuffer> getMergeBuffer()
+  {
+    Preconditions.checkState(mergeBuffers != null, "Resource is initialized with empty merge buffers");
+    Preconditions.checkState(mergeBuffers.size() > 0, "No available merge buffers");


If using ArrayDeque, these two lines could be replaced with buffer = mergeBuffers.pop()

leventov · 2017-02-22T00:35:57Z

+        try {
+          mergeBufferHolders = mergeBufferPool.drain(requiredMergeBufferNum, timeout.longValue());
+          if (mergeBufferHolders.get().size() < requiredMergeBufferNum) {
+            mergeBufferHolders.close();


Why TimeoutException? It's not a timeout. Maybe IllegalStateException

Changed to InsufficientResourcesException

leventov · 2017-02-22T00:38:23Z

+  @Override
+  public GroupByQueryBrokerResource prepareResource(GroupByQuery query, boolean willMergeRunners)
+  {
+    if (!willMergeRunners) {


Please move this comment to countRequiredMergeBufferNum()

leventov · 2017-02-22T00:40:10Z

+            return new GroupByQueryBrokerResource(mergeBufferHolders);
+          }
+        }
+        catch (Exception e) {


Why QueryInterruptedException?

QueryInterruptedException is used for all kinds of failed queries.

- Add InsufficientResourcesException - Renamed GroupByQueryBrokerResource to GroupByQueryResource

jihoonson · 2017-02-22T03:26:09Z

@leventov thanks for your review. Addressed some of your comments.

I don't know why I can't add an inline reply for your comment. That part is not what I changed. I just changed that take() method throws a RuntimeException instead of an InterruptedException.

leventov · 2017-02-22T04:56:51Z

+/**
+ * This exception is thrown when the requested operation cannot be completed due to a lack of available resources.
+ */
+public class InsufficientResourcesException extends Exception


Please make new exceptions to extend RuntimeException. Checked exceptions only force us to write more boilerplate, try-catch-throwables-propagate.

leventov · 2017-02-22T04:58:40Z

  public static final String CTX_KEY_FUDGE_TIMESTAMP = "fudgeTimestamp";
  public static final String CTX_KEY_OUTERMOST = "groupByOutermost";

+  private static final int MAX_MERGE_BUFFER_NUM = 2;


Worth adding at least "see countRequiredMergeBufferNum() for explanation" comment, or move part of countRequiredMergeBufferNum()'s comment to this constant

leventov · 2017-02-22T05:01:06Z

@jihoonson and about poll vs take() - before you explained, that timeout 0 means "no wait", but take() is blocking.

jihoonson · 2017-02-22T06:11:32Z

@leventov ah, I got your point. That's also not what I changed, but I clarified it.

jihoonson · 2017-02-22T10:45:43Z

@jon-wei I simply added a new takeBatch() method to BlockingPool which is checkAndDrain() you mentioned. After thinking about this for a while, I decided to add it because this issue focuses on guaranteeing the atomicity when acquiring merge buffers, and it cannot be achieved without takeBatch(). But I still believe that we need to improve and maybe redesign BlockingPool.

Sorry for going back and forth. Would you and other reviewers @himanshug @leventov mind reviewing again please?

gianm · 2017-02-22T19:04:57Z

Hmm, travis has been done for a while but the PR hasn't figured that out yet. Going to bounce it to see if that helps.

himanshug · 2017-02-22T20:49:31Z

👍

jihoonson · 2017-02-23T00:20:32Z

@himanshug thank you for the review and merge.
@gianm @leventov @jon-wei if you have more comments on the newly added part, please feel free to tell me. I'll do in a follow-up pr.

jihoonson added 2 commits February 15, 2017 19:13

Atomic merge buffer acquisition for groupBys

a028bf0

documentation

8243531

gianm added this to the 0.10.0 milestone Feb 16, 2017

gianm added the Bug label Feb 16, 2017

documentation

e4e9b56

jon-wei reviewed Feb 16, 2017

View reviewed changes

himanshug reviewed Feb 16, 2017

View reviewed changes

jon-wei reviewed Feb 16, 2017

View reviewed changes

address comments

4f36f61

himanshug reviewed Feb 17, 2017

View reviewed changes

jihoonson added 4 commits February 18, 2017 09:49

address comments

875a8ac

Merge branch 'master' of https://github.com/druid-io/druid into deadlock

f21a67a

Merge branch 'master' of https://github.com/druid-io/druid into deadlock

cd4114d

fix test failure

f368a5e

leventov requested changes Feb 22, 2017

View reviewed changes

Addressed comments

64b69ee

- Add InsufficientResourcesException - Renamed GroupByQueryBrokerResource to GroupByQueryResource

leventov reviewed Feb 22, 2017

View reviewed changes

addressed comments

933e94f

leventov approved these changes Feb 22, 2017

View reviewed changes

Add takeBatch() to BlockingPool

62e2f58

gianm closed this Feb 22, 2017

gianm reopened this Feb 22, 2017

himanshug merged commit 7200dce into apache:master Feb 22, 2017

Conversation

jihoonson commented Feb 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

himanshug commented Feb 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Feb 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Feb 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 15, 2017 •

edited

Loading

jon-wei Feb 16, 2017 •

edited

Loading

jon-wei Feb 17, 2017 •

edited

Loading