Time Ordering Option on Small-Result-Set Scan Queries by justinborromeo · Pull Request #7024 · apache/druid

justinborromeo · 2019-02-06T23:43:28Z

This PR addresses point 2 of issue #6088 . For scan queries where the limit is less than a configurable threshold (default 100K), users have the option to time-order their results (either ascending or descending). Users also have the option to skip time-ordering by setting timeOrder to none, in which case the existing implementation of scan will run. There's a limit on the number of records that can be ordered because this ordering currently occurs in memory. There are plans to include on-disk sorting for larger result sets in the future but those changes are outside of the scope of this PR.

…ting

…Scans-V2

wasn't returning elements in correct order

…Scans-V2

set > threshold limit

gianm · 2019-02-14T16:21:47Z

  private int limit;

+  @Param({"none", "descending", "ascending"})
+  private static String timeOrdering;


Any reason this is static when the others are not? If not, please adjust it to be in line with the others.

The basicA(), basicB()... query builder functions are static (like in SearchBenchmark). Since timeOrdering is used when building the query, it also needs to be static.

gianm · 2019-02-14T17:13:32Z

+query is that the Scan query does not retain all the returned rows in memory before they are returned to the client
+(except when time-ordering is used).  The Select query _will_ retain the rows in memory, causing memory pressure if too
+many rows are returned.  The Scan query can return all the rows without issuing another pagination query, which is
+extremely useful when directly querying against historical or realtime nodes.


We're trying to harmonize language in this area (see #6916); in that context "Historical processes or streaming ingestion tasks" is more Ministry of Truth approved than "historical or realtime nodes". For clarification I'd add something to call out expected usage. One way to tie it all together is:

In addition to straightforward usage where a Scan query is issued to the Broker, the Scan query can also be issued directly to Historical processes or streaming ingestion tasks. This can be useful if you want to retrieve large amounts of data in parallel.

gianm · 2019-02-14T18:02:55Z

      return this;
    }
+
+    public ScanQueryBuilder timeOrder(String timeOrder)


This'd be better as an enum, like Direction in OrderByColumnSpec. It reduces the likelihood of bugs since it makes invalid values impossible. (And as a minor side benefit, will take up less memory.)

gianm · 2019-02-14T18:03:30Z


+  public static final String TIME_ORDER_ASCENDING = "ascending";
+  public static final String TIME_ORDER_DESCENDING = "descending";
+  public static final String TIME_ORDER_NONE = "none";


You'll be able to get rid of these after changing timeOrder to an enum.

Does it make sense to make result format an enum too?

Yes, I think so.

gianm · 2019-02-14T18:04:08Z

    return this;
  }

+  // int should suffice here because no one should be sorting greater than 2B rows in memory


Also, Java collections can't store more than Integer.MAX_VALUE items anyway.

gianm · 2019-02-15T21:54:18Z

+
+  /**
+   * This iterator supports iteration through any Iterable of unbatched ScanResultValues (1 event/SRV) and aggregates
+   * events into ScanResultValues with {int batchSize} events.  The columns from the first event per ScanResultValue


{@code batchSize} is more javadoc-y.

gianm · 2019-02-15T21:55:46Z

+        ScanResultValue srv = itr.next();
+        // Only replace once using the columns from the first event
+        columns = columns.isEmpty() ? srv.getColumns() : columns;
+        eventsToAdd.add(((List) srv.getEvents()).get(0));


If it is a precondition that srv.getEvents() should only have one element, use Iterables.getOnlyElement((List) srv.getEvents()) instead, which will throw an exception if the precondition is not satisfied. Just doing .get(0) could mask potential bugs.

gianm · 2019-02-15T21:58:05Z

+        columns = columns.isEmpty() ? srv.getColumns() : columns;
+        eventsToAdd.add(((List) srv.getEvents()).get(0));
+      }
+      return new ScanResultValue(null, columns, eventsToAdd);


Please mark String segmentId (and getSegmentId) as @Nullable in ScanResultValue.

gianm · 2019-02-15T22:00:34Z

+   * events into ScanResultValues with {int batchSize} events.  The columns from the first event per ScanResultValue
+   * will be used to populate the column section.
+   */
+  private static class ScanBatchedTimeOrderedIterator implements CloseableIterator<ScanResultValue>


This doesn't seem to be doing anything related to time-ordering; why include "TimeOrdered" in the name?

gianm · 2019-02-15T22:02:51Z

+      } else if ((scanQuery.getTimeOrder().equals(ScanQuery.TIME_ORDER_ASCENDING) ||
+                  scanQuery.getTimeOrder().equals(ScanQuery.TIME_ORDER_DESCENDING))
+                 && scanQuery.getLimit() <= scanQueryConfig.getMaxRowsTimeOrderedInMemory()) {
+        Iterator<ScanResultValue> scanResultIterator = scanQueryLimitRowIteratorMaker.make();


This looks like it will apply the limit before sorting takes place. Am I reading it right? It needs to be the other way around, otherwise we won't be guaranteed to get the very earliest or very latest rows.

Also, for a Scan query with a limit, in order to make sure we read the very earliest or very latest rows, we need to make sure to iterate through segments in the right order. Otherwise we are going to need to read the first (or last) limit rows out of every segment to be sure we got the right ones, which we want to avoid. The idea should be to read segments in ascending (or descending) order, and stop reading once we know that no as-yet-unread segments could possibly offer us any earlier (or later) events, based on those segments' data intervals.

Ah shoot, I misunderstood part of the issue. I wrote this under the assumption that it would act like an ORDER BY operator. So if I understand correctly now, this query:

{ ... "timeOrdering":"descending" "limit":100 }

should return the latest 100 rows?

Conversely, this query:

{ ... "timeOrdering":"ascending" "limit":100 }

should return the earliest 100 rows?

Yes, that's correct. It is how ORDER BY works in SQL, too (ordering happens before limiting).

justinborromeo · 2019-02-20T02:23:13Z

+    final Deque<ScanResultValue> sortedElements = new ArrayDeque<>(q.size());
+    while (q.size() != 0) {
+      // We add at the front of the list because poll removes the tail of the queue.
+      sortedElements.addFirst(q.poll());


Just want to double check but ArrayDeque#addFirst() is O(1), right? Initially used a LinkedList but Forbidden APIs said no.

This is the implementation:

public void addFirst(E e) { if (e == null) throw new NullPointerException(); elements[head = (head - 1) & (elements.length - 1)] = e; if (head == tail) doubleCapacity(); }

It looks O(1) on an amortized basis to me. Most of the function is O(1), except for doubleCapacity(), which is O(n) but it runs at most once every n additions.

justinborromeo · 2019-02-26T23:24:58Z

Closing in favour of the refactored PR: #7133. The updated PR reflects the suggestions in this PR.

justinborromeo added 30 commits February 1, 2019 14:04

Moved Scan Builder to Druids class and started on Scan Benchmark setup

10e57d5

Merge branch 'master' into 6088-Create-Scan-Benchmark

dba6e49

Need to form queries

dd4ec1a

It runs.

26930f8

Stuff for time-ordered scan query

7a6080f

Move ScanResultValue timestamp comparator to a separate class for tes…

79e8319

…ting

Licensing stuff

7b58471

Merge branch '6088-Create-Scan-Benchmark' into 6088-Time-Ordering-On-…

989bd2d

…Scans-V2

Change benchmark

ad731a3

Remove todos

e66339c

Added TimestampComparator tests

12e51a2

Change number of benchmark iterations

432acaf

Added time ordering to the scan benchmark

01b25ed

Changed benchmark params

9e6e716

More param changes

20c3664

Benchmark param change

796083f

Made Jon's changes and removed TODOs

737a833

Merge branch 'master' into 6088-Time-Ordering-On-Scans-V2

b7d3a49

Broke some long lines into two lines

86c5eee

Merge branch '6088-Create-Scan-Benchmark' into 6088-Time-Ordering-On-…

7deb06f

…Scans-V2

nit

d1a1793

Decrease segment size for less memory usage

b6d4df3

Wrote tests for heapsort scan result values and fixed bug where iterator

8b7d5f5

wasn't returning elements in correct order

Wrote more tests for scan result value sort

4f51024

Committing a param change to kick teamcity

60b7684

Merge github.com:apache/incubator-druid into 6088-Create-Scan-Benchmark

5edbe2a

Merge branch '6088-Create-Scan-Benchmark' into 6088-Time-Ordering-On-…

148939e

…Scans-V2

Fixed codestyle and forbidden API errors

dfe4aa9

.

10b5e0c

Improved conciseness

8212a21

justinborromeo added 9 commits February 6, 2019 15:02

nit

305876a

Merge branch 'master' into 6088-Time-Ordering-On-Scans-V2

e8a4b49

Created an error message for when someone tries to time order a result

7e872a8

set > threshold limit

Set to spaces over tabs

85e72a6

Fixing tests WIP

b2c8c77

Fixed failing calcite tests

b432bea

Kicking travis with change to benchmark param

ab00ead

added all query types to scan benchmark

d3b335a

Fixed benchmark queries

2e3577c

justinborromeo mentioned this pull request Feb 8, 2019

[Proposal] K-Way Merge for Time-Ordered Scans #7036

Closed

Renamed sort function

134041c

egor-ryashin requested changes Feb 9, 2019

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/query/scan/ScanResultValueTimestampComparator.java

justinborromeo added 5 commits February 11, 2019 10:03

Added javadoc on ScanResultValueTimestampComparator

93e1636

Unused import

5f92dd7

Added more javadoc

f0eddee

improved doc

ecb0f48

Removed unused import to satisfy PMD check

4e69276

gianm assigned clintropolis Feb 12, 2019

gianm reviewed Feb 15, 2019

View reviewed changes

justinborromeo added 2 commits February 15, 2019 15:57

Small changes

35150fe

Changes based on Gian's comments

7baeade

justinborromeo commented Feb 20, 2019

View reviewed changes

justinborromeo added 2 commits February 20, 2019 00:16

Fixed failing test due to null resultFormat

cd489a0

Merge branch 'master' into 6088-Time-Ordering-On-Scans-V2

c9142e7

justinborromeo mentioned this pull request Feb 23, 2019

Time Ordering On Scans #7133

Merged

justinborromeo closed this Feb 26, 2019

Conversation

justinborromeo commented Feb 6, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinborromeo commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

justinborromeo commented Feb 26, 2019 •

edited

Loading