discover nested columns when using nested column indexer for schemaless ingestion#13672
Conversation
imply-cheddar
left a comment
There was a problem hiding this comment.
I don't think we need the interface change.
| { | ||
| JsonProvider getJsonProvider(); | ||
| /** | ||
| * List all "root" primitive properties and primitive lists (no nested objects, no lists of objects) | ||
| * List all "root" fields, optionally filtering to include only fields that contain primitive and lists of primitive values | ||
| */ | ||
| Iterable<String> discoverRootFields(T obj); | ||
| Iterable<String> discoverRootFields(T obj, boolean discoverNestedFields); |
There was a problem hiding this comment.
I don't think you need to change the interface at all? Can't you add the config as a property on the actual FlattenerMaker objects themselves? Doing that will definitely shrink this PR down and avoid interface changes?
There was a problem hiding this comment.
yeah, i guess I was thinking doing it this way would force implementors to be aware that the contract is a bit different (or at least will be), though maybe it is nicer to push into the FlattenerMaker constructors instead and we just mention in the release notes that implementors should try to honor the mode for their readers.
| this.flattener = ObjectFlatteners.create( | ||
| flattenSpec, | ||
| new JSONFlattenerMaker(keepNullColumns), | ||
| false |
There was a problem hiding this comment.
I'm not certain I know why this is hard-coding to false. Is that covered in a comment or javadoc or something?
There was a problem hiding this comment.
this and other Parser based implementations are hard coded to false to retain existing behavior for Hadoop ingestion. Parser are not created with a InputRowSchema the same way as InputEntityReader are, so I don't have easy access to a TuningConfig, and I wasn't very motivated to fix this up for Hadoop.
| } | ||
|
|
||
| return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig); | ||
| return inputSourceSampler.sample(inputSource, inputFormat, dataSchema, samplerConfig, tuningConfig != null ? tuningConfig.convertToTaskTuningConfig() : null); |
There was a problem hiding this comment.
can you invert the != here pls.
| # The image will intialize the user and DB upon first start. | ||
| metadata: | ||
| # platform: linux/x86_64 - Add when running on M1 Macs | ||
| platform: linux/x86_64 #- Add when running on M1 Macs |
There was a problem hiding this comment.
Should this be checked in?
There was a problem hiding this comment.
idk, i guess it seemed to not make anything fail, and it was sort of annoying to have to hunt this down to make the new integration tests work on an M1 mac... but I can revert just in case
|
|
||
| /** | ||
| * @param fromPigAvroStorage boolean to specify the data file is stored using AvroStorage | ||
| * @param binaryAsString boolean to encode the byte[] as a string. |
There was a problem hiding this comment.
javadocs need an update.
| @Override | ||
| public Iterable<String> discoverRootFields(final JsonNode obj) | ||
| { | ||
| if (discoverNestedFields) { |
There was a problem hiding this comment.
can you drop a one line comment here? I am assuming its like this since each top-level field is a field of its own if we are allowing nested columns.
There was a problem hiding this comment.
do you want similar comments for all the other FlattenerMaker implementations?
There was a problem hiding this comment.
went ahead and added comments to all
| @Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions, | ||
| @JsonProperty("includeAllDimensions") boolean includeAllDimensions | ||
| @JsonProperty("includeAllDimensions") boolean includeAllDimensions, | ||
| @JsonProperty("useNestedColumnIndexerForSchemaDiscovery") Boolean useNestedColumnIndexerForSchemaDiscovery |
There was a problem hiding this comment.
how about a shorter name? "discoverNested"?
| OrcStructFlattenerMaker(boolean binaryAsString) | ||
| private final boolean discoverNestedFields; | ||
|
|
||
| OrcStructFlattenerMaker(boolean binaryAsString, boolean disocverNestedFields) |
There was a problem hiding this comment.
nit: typo in name of parameter
Description
Following up on #13653, this PR improves the flattener machinery to allow discovering nested columns when using druid schemaless ingestion powered by the nested column indexer for discovered columns, and moves the flag from
AppendableIndexSpectoDimensionsSchema.Effectively, whenever
is set, this value is pushed down to the
FlattenerMakerimplementations which power the column discovery.InputEntityReaderare fed anInputRowSchemawhich has access to theDimensionsSchemaand so this value which can be passed intoFlattenerMakerimplementations, which have been updated to honor this setting.This PR also adds a set of integration tests to test schemaless ingestion using
useNestedColumnIndexerForSchemaDiscoveryset to true with a variety of input formats. This test does not actually exercise the changes in this PR since the batch tests contain no nested data, but does at least cover string and numbers. I plan to add streaming integration tests in the future once the streaming tests are moved over to the new integration framework, and since those datas are generated I should be able to add some nested structure and provide integration test coverage for the full set of functionality.This PR has: