add parquet support to native batch by clintropolis · Pull Request #8883 · apache/druid

clintropolis · 2019-11-16T08:41:56Z

Description

As a follow-up to #8823, this PR adds Parquet support to Druid native batch indexing, largely re-using existing code from the current Hadoop extension. All of the unit tests have been adapted to also run with the new DruidParquetReader.

Parquet can be used in native batch indexing with any InputSource, for example:

   "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "local",
        "baseDir": "/some/path/to/wikipedia/file/",
        "filter": "wiki.parquet"
      },
      "inputFormat": {
        "type": "parquet"
        "flattenSpec": {
            "useFieldDiscovery": true
        },
        "binaryAsString": false
      },
      "appendToExisting": false
    },

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added or updated version, license, or notice information in licenses.yaml
added unit tests or modified existing tests to cover new code paths.
been tested in a test Druid cluster.

Key changed/added classes in this PR

DruidParquetReader
DruidNativeParquetInputFormat

vogievetsky · 2019-11-17T06:13:02Z

Are you planning to add the .md docs as part of this PR? I am trying to figure out what binaryAsString is

clintropolis · 2019-11-17T22:25:24Z

Are you planning to add the .md docs as part of this PR? I am trying to figure out what binaryAsString is

I marked this as WIP because I haven't yet updated the docs, or added the toJson to support the sampler.

That said, binaryAsString is just an option I was preserving from the Hadoop extension, see http://druid.apache.org/docs/latest/development/extensions-core/parquet.html. ORC and Avro actually have this property too, but it isn't documented for them. In all cases it just converts byte[] columns as a UTF-8 string instead of leaving as byte[] and ending up as base64 serialized binary.

clintropolis · 2019-11-18T05:55:17Z

  {
+    final Configuration conf = new Configuration();
+
+    // Set explicit CL. Otherwise it'll try to use thread context CL, which may not have all of our dependencies.


This is copied from the HDFS module to initialize Hadoopy things... Not sure of a good way to share this because it requires Hadoop libraries that core Druid doesn't have...

Your github comment may be useful to have in the source code as well

Agree it would be good to mention this code is copied from HdfsStorageDruidModule, and why. (And a similar comment in that file.)

That way, if someone modifies it in the future to fix a problem, they'll hopefully remember to fix both places.

ccaominh

LGTM 👍

ccaominh · 2019-11-18T19:02:06Z

+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <scope>compile</scope>
+      <!-- heh -->


Perhaps change the comment to mention why the exclusions are needed?

Yeah, please do.

ccaominh · 2019-11-18T19:05:05Z

  {
+    final Configuration conf = new Configuration();
+
+    // Set explicit CL. Otherwise it'll try to use thread context CL, which may not have all of our dependencies.


Your github comment may be useful to have in the source code as well

ccaominh · 2019-11-18T19:12:09Z

+{
+  private final InputRowSchema inputRowSchema;
+  private final ObjectFlattener<Group> flattener;
+  private final byte[] buffer = new byte[InputEntity.DEFAULT_FETCH_BUFFER_SIZE];


Can be converted to a local variable

ccaominh · 2019-11-18T19:12:35Z

+  private final ParquetGroupJsonProvider jsonProvider;
+
+  private final ParquetReader<Group> reader;
+  private final ParquetMetadata metadata;


Unused? It's updated but never read.

ccaominh · 2019-11-18T19:12:51Z

+  private final ParquetMetadata metadata;
+  private final Closer closer;
+
+  public DruidParquetReader(


Can be package-private

ccaominh · 2019-11-18T19:25:30Z

+      return converter.finalizeConversion(actualList);
+    }
+    // unknown, just pass it through
+    return o;


This case may not be covered by unit tests

ccaominh · 2019-11-18T19:30:56Z

+import java.util.ArrayList;
+import java.util.List;
+
+public class BaseParquetReaderTest


Can be package-private

ccaominh · 2019-11-18T19:35:40Z

+    );
+    List<InputRowListPlusJson> sampled = sampleAllRows(reader);
+    List<InputRowListPlusJson> sampledAsBinary = sampleAllRows(readerNotAsString);
+    final String expectedJson = "{\n"


Do you know why InputEntityReader.DEFAULT_JSON_WRITER uses a pretty print writer instead of writing minified JSON?

I talked with @vogievetsky offline and DEFAULT_JSON_WRITER will be removed in my follow up PR.

IntermediateRowParsingReader will have toMap() instead of toJson().

gianm · 2019-11-19T02:00:34Z

+      <artifactId>hadoop-common</artifactId>
+      <version>${hadoop.compile.version}</version>
+      <scope>compile</scope>
+      <!-- heh -->


Yeah, please do.

gianm · 2019-11-19T02:01:40Z

+import java.util.Objects;
+
+/**
+ * heh, DruidParquetInputFormat already exists, so I need another name


Please nix the 'heh', & consider using first-person plural rather than singular.

gianm · 2019-11-19T02:03:04Z

+/**
+ * heh, DruidParquetInputFormat already exists, so I need another name
+ */
+public class DruidNativeParquetInputFormat extends NestedInputFormat


How about just ParquetInputFormat? There's one in Hadoop, but who cares. It's not the boss of us.

gianm · 2019-11-19T02:07:50Z

+import java.util.Map;
+import java.util.NoSuchElementException;
+
+public class DruidParquetReader extends IntermediateRowParsingReader<Group>


Just ParquetReader should be fine. Everything in here is Druid.

gianm · 2019-11-19T02:13:24Z

  {
+    final Configuration conf = new Configuration();
+
+    // Set explicit CL. Otherwise it'll try to use thread context CL, which may not have all of our dependencies.


Agree it would be good to mention this code is copied from HdfsStorageDruidModule, and why. (And a similar comment in that file.)

That way, if someone modifies it in the future to fix a problem, they'll hopefully remember to fix both places.

gianm · 2019-11-19T02:13:35Z

+      }
+    }

+    binder.requestInjection(TypeLiteral.get(Configuration.class), conf);


What's this for? HdfsStorageDruidModule doesn't do it. (Please include a comment.)

gianm · 2019-11-19T02:14:47Z

+    return DEFAULT_JSON_WRITER.writeValueAsString(converted);
+  }
+
+  private Object convertObject(Object o)


normalizeObjectForJson?

Actually, this is a pretty cool method that seems like it'd be useful for Avro / ORC too. Could you consider putting it in core so all the extensions can use it?

gianm · 2019-11-19T17:16:05Z

This'll need a merge from master to fix LGTM.

…fault impls should be good enough for orc+avro, fixup for merge with latest

ccaominh · 2019-11-21T22:33:03Z

+    for native batch indexing with Parquet files, we require a small number of classes provided by hadoop-common and
+    hadoop-mapreduce-client-core. However, both of these jars have a very large set of dependencies, the majority of
+    which we do not need (and are provided by Hadoop in that environment). hadoop-common is the biggest offender,
+    with things like zookeeper, jetty, just .. so much stuff. These exclusions remove ~60 jars from being unnecessarily
+    bundled with this extension. There might be some alternative arrangement to get what we need, worth looking into if
+    anyone is feeling adventurous.


This comment is very helpful. When I look at the various POMs, I often wonder why there are lots of exclusions. Thanks for adding!

ccaominh · 2019-11-22T01:13:23Z

+  private final org.apache.parquet.hadoop.ParquetReader<Group> reader;
+  private final Closer closer;
+
+  public ParquetReader(


Can be package-private

ccaominh · 2019-11-22T01:14:53Z

 import java.util.concurrent.TimeUnit;

-class ParquetGroupConverter
+public class ParquetGroupConverter


Can be package-private

ccaominh · 2019-11-22T01:19:53Z

+      while (iterator.hasNext()) {
+        rows.add(iterator.next());
+      }


Another option is to do iterator.forEachRemaining(rows::add) instead. Similar for the method below.

gianm

👍 after latest changes, thanks @clintropolis

* add parquet support to native batch * cleanup * implement toJson for sampler support * better binaryAsString test * docs * i hate spellcheck * refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest * add comment, fix some stuff * adjustments * fix accident * tweaks

clintropolis added 2 commits November 16, 2019 00:32

add parquet support to native batch

f2fc34a

cleanup

5c4d1af

clintropolis added Area - Batch Ingestion WIP Area - Extension labels Nov 16, 2019

clintropolis added 3 commits November 17, 2019 16:52

implement toJson for sampler support

968b488

better binaryAsString test

35b4617

docs

f499519

clintropolis removed the WIP label Nov 18, 2019

clintropolis commented Nov 18, 2019

View reviewed changes

i hate spellcheck

d80ca29

ccaominh approved these changes Nov 18, 2019

View reviewed changes

gianm reviewed Nov 19, 2019

View reviewed changes

clintropolis added 5 commits November 20, 2019 22:50

Merge remote-tracking branch 'upstream/master' into parquet-native-batch

7b93e10

refactor toMap conversion so can be shared through flattenerMaker, de…

dccdfdb

…fault impls should be good enough for orc+avro, fixup for merge with latest

add comment, fix some stuff

1002cb8

adjustments

3fd4764

fix accident

addb9d4

ccaominh approved these changes Nov 22, 2019

View reviewed changes

tweaks

be9f745

ccaominh approved these changes Nov 22, 2019

View reviewed changes

gianm approved these changes Nov 22, 2019

View reviewed changes

gianm merged commit 7250010 into apache:master Nov 22, 2019

clintropolis deleted the parquet-native-batch branch November 22, 2019 23:31

jon-wei added this to the 0.17.0 milestone Dec 17, 2019

Conversation

clintropolis commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

vogievetsky commented Nov 17, 2019

Uh oh!

clintropolis commented Nov 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccaominh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Nov 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clintropolis commented Nov 16, 2019 •

edited

Loading