add parquet support to native batch#8883
Conversation
|
Are you planning to add the .md docs as part of this PR? I am trying to figure out what |
I marked this as WIP because I haven't yet updated the docs, or added the That said, |
| { | ||
| final Configuration conf = new Configuration(); | ||
|
|
||
| // Set explicit CL. Otherwise it'll try to use thread context CL, which may not have all of our dependencies. |
There was a problem hiding this comment.
This is copied from the HDFS module to initialize Hadoopy things... Not sure of a good way to share this because it requires Hadoop libraries that core Druid doesn't have...
There was a problem hiding this comment.
Your github comment may be useful to have in the source code as well
There was a problem hiding this comment.
Agree it would be good to mention this code is copied from HdfsStorageDruidModule, and why. (And a similar comment in that file.)
That way, if someone modifies it in the future to fix a problem, they'll hopefully remember to fix both places.
| <artifactId>hadoop-common</artifactId> | ||
| <version>${hadoop.compile.version}</version> | ||
| <scope>compile</scope> | ||
| <!-- heh --> |
There was a problem hiding this comment.
Perhaps change the comment to mention why the exclusions are needed?
| { | ||
| final Configuration conf = new Configuration(); | ||
|
|
||
| // Set explicit CL. Otherwise it'll try to use thread context CL, which may not have all of our dependencies. |
There was a problem hiding this comment.
Your github comment may be useful to have in the source code as well
| { | ||
| private final InputRowSchema inputRowSchema; | ||
| private final ObjectFlattener<Group> flattener; | ||
| private final byte[] buffer = new byte[InputEntity.DEFAULT_FETCH_BUFFER_SIZE]; |
There was a problem hiding this comment.
Can be converted to a local variable
| private final ParquetGroupJsonProvider jsonProvider; | ||
|
|
||
| private final ParquetReader<Group> reader; | ||
| private final ParquetMetadata metadata; |
There was a problem hiding this comment.
Unused? It's updated but never read.
| private final ParquetMetadata metadata; | ||
| private final Closer closer; | ||
|
|
||
| public DruidParquetReader( |
| return converter.finalizeConversion(actualList); | ||
| } | ||
| // unknown, just pass it through | ||
| return o; |
There was a problem hiding this comment.
This case may not be covered by unit tests
| import java.util.ArrayList; | ||
| import java.util.List; | ||
|
|
||
| public class BaseParquetReaderTest |
| ); | ||
| List<InputRowListPlusJson> sampled = sampleAllRows(reader); | ||
| List<InputRowListPlusJson> sampledAsBinary = sampleAllRows(readerNotAsString); | ||
| final String expectedJson = "{\n" |
There was a problem hiding this comment.
Do you know why InputEntityReader.DEFAULT_JSON_WRITER uses a pretty print writer instead of writing minified JSON?
There was a problem hiding this comment.
I talked with @vogievetsky offline and DEFAULT_JSON_WRITER will be removed in my follow up PR.
There was a problem hiding this comment.
IntermediateRowParsingReader will have toMap() instead of toJson().
| <artifactId>hadoop-common</artifactId> | ||
| <version>${hadoop.compile.version}</version> | ||
| <scope>compile</scope> | ||
| <!-- heh --> |
| import java.util.Objects; | ||
|
|
||
| /** | ||
| * heh, DruidParquetInputFormat already exists, so I need another name |
There was a problem hiding this comment.
Please nix the 'heh', & consider using first-person plural rather than singular.
| /** | ||
| * heh, DruidParquetInputFormat already exists, so I need another name | ||
| */ | ||
| public class DruidNativeParquetInputFormat extends NestedInputFormat |
There was a problem hiding this comment.
How about just ParquetInputFormat? There's one in Hadoop, but who cares. It's not the boss of us.
| import java.util.Map; | ||
| import java.util.NoSuchElementException; | ||
|
|
||
| public class DruidParquetReader extends IntermediateRowParsingReader<Group> |
There was a problem hiding this comment.
Just ParquetReader should be fine. Everything in here is Druid.
| { | ||
| final Configuration conf = new Configuration(); | ||
|
|
||
| // Set explicit CL. Otherwise it'll try to use thread context CL, which may not have all of our dependencies. |
There was a problem hiding this comment.
Agree it would be good to mention this code is copied from HdfsStorageDruidModule, and why. (And a similar comment in that file.)
That way, if someone modifies it in the future to fix a problem, they'll hopefully remember to fix both places.
| } | ||
| } | ||
|
|
||
| binder.requestInjection(TypeLiteral.get(Configuration.class), conf); |
There was a problem hiding this comment.
What's this for? HdfsStorageDruidModule doesn't do it. (Please include a comment.)
| return DEFAULT_JSON_WRITER.writeValueAsString(converted); | ||
| } | ||
|
|
||
| private Object convertObject(Object o) |
There was a problem hiding this comment.
Actually, this is a pretty cool method that seems like it'd be useful for Avro / ORC too. Could you consider putting it in core so all the extensions can use it?
|
This'll need a merge from master to fix LGTM. |
…fault impls should be good enough for orc+avro, fixup for merge with latest
| for native batch indexing with Parquet files, we require a small number of classes provided by hadoop-common and | ||
| hadoop-mapreduce-client-core. However, both of these jars have a very large set of dependencies, the majority of | ||
| which we do not need (and are provided by Hadoop in that environment). hadoop-common is the biggest offender, | ||
| with things like zookeeper, jetty, just .. so much stuff. These exclusions remove ~60 jars from being unnecessarily | ||
| bundled with this extension. There might be some alternative arrangement to get what we need, worth looking into if | ||
| anyone is feeling adventurous. |
There was a problem hiding this comment.
This comment is very helpful. When I look at the various POMs, I often wonder why there are lots of exclusions. Thanks for adding!
| private final org.apache.parquet.hadoop.ParquetReader<Group> reader; | ||
| private final Closer closer; | ||
|
|
||
| public ParquetReader( |
| import java.util.concurrent.TimeUnit; | ||
|
|
||
| class ParquetGroupConverter | ||
| public class ParquetGroupConverter |
| while (iterator.hasNext()) { | ||
| rows.add(iterator.next()); | ||
| } |
There was a problem hiding this comment.
Another option is to do iterator.forEachRemaining(rows::add) instead. Similar for the method below.
gianm
left a comment
There was a problem hiding this comment.
👍 after latest changes, thanks @clintropolis
* add parquet support to native batch * cleanup * implement toJson for sampler support * better binaryAsString test * docs * i hate spellcheck * refactor toMap conversion so can be shared through flattenerMaker, default impls should be good enough for orc+avro, fixup for merge with latest * add comment, fix some stuff * adjustments * fix accident * tweaks
Description
As a follow-up to #8823, this PR adds Parquet support to Druid native batch indexing, largely re-using existing code from the current Hadoop extension. All of the unit tests have been adapted to also run with the new
DruidParquetReader.Parquet can be used in native batch indexing with any
InputSource, for example:This PR has:
Key changed/added classes in this PR
DruidParquetReaderDruidNativeParquetInputFormat