Preserve dimension order across indexes during ingestion#2006
Preserve dimension order across indexes during ingestion#2006fjy merged 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
loadDimensionOrder looks a bit hacky way to do the initialization of dimensionsOrder
can we pass in the schema with the correct dimensionsOrder ? ,
e.g newIndex = new OnheapIncrementalIndex(schema.withDimensionsSpec(dimensionsSpec with custom order from oldIndex),... )
There was a problem hiding this comment.
I considered implementing it that way initially, but it would require a change to druid-api.
The presence of a dimensions list within the DimensionsSpec currently determines the result of hasCustomDimensions(), which controls if schema-less ingestion is used.
Perhaps there should be a later PR that changes the DimensionsSpec semantics so that the user can specify a dimensions list but still have schema-less dimension discovery enabled. Then the index initialization could be changed to use DimensionsSpec to initialize the dimensions order; this would also give the user some control over the dimension ordering during schema-less ingestion.
Thoughts?
There was a problem hiding this comment.
I think it would be nice to have ability to specify ordering for a subset of dimensions which are known prior to ingestion and rest of the dimensions can be discovered during ingestion,
can you submit a github issue to track this and add a comment to the issue here in code indicating that it can be cleaned up once the issue is resolved ?
c7c3a79 to
d6e1693
Compare
There was a problem hiding this comment.
if (rowValue.size() == 1 && rowValue.get(0) == ) {
..
} ?
|
besides minor comments, looks good to me overall. however, this might increase the size of segments for some users. |
There was a problem hiding this comment.
why is this change needed?
There was a problem hiding this comment.
@himanshug It handles the case where the only entry in dimSet is the null string; the seek needs to stop and return if the dimSet has been exhausted. The index == size check is there to distinguish between the initial state of the seeker and when it has retrieved the null string's entry (in both cases currVal is null)
eebb452 to
b0fa2f3
Compare
There was a problem hiding this comment.
This set gets populated by inputRow.getDimensions() which depends on the order in which rows appear. Are we sue the order gets preserved here? If I ask for [a,b,c] in my indexing spec, and if my first row only has dimension c, my second row only has dimension b, and my third row only has dimension a, then this set may end up with [c,b,a] instead of [a,b,c]
There was a problem hiding this comment.
@xvrl In the case where the user has specified a list of dimensions in the indexing spec, the MapInputRowParser initializes all of the MapBasedInputRows that it creates with the specified dimension list, so inputRow.getDimensions() would return the same order every time.
There was a problem hiding this comment.
@jon-wei does it always return the same dimensions for every single row, and is that also the case for other InputRowParsers?
There was a problem hiding this comment.
@xvrl yes, every row returns the same dimensions if the user specifies the dimension list in the ingestion spec. There's only StringInputRowParser and MapInputRowParser which are coupled together.
There was a problem hiding this comment.
in that case, why do we need to add the dimensions to the set for every single row?
There was a problem hiding this comment.
@xvrl for the schemaless case where dimensions are discovered on the fly
There was a problem hiding this comment.
@jon-wei do we have a test for hadoop indexing that covers both:
- schema-less indexing to make sure that dimension are persisted in the order they were seen
- schema-full indexing where we ensuer the order in which dimensions appear when read is different than the ones specified in the spec, and we test that the persisted order corresponds to the spec ?
There was a problem hiding this comment.
@xvrl The test I added to IndexGeneratorJobTest covers the first case, I'll add another test that covers the second
There was a problem hiding this comment.
@xvrl Added a test to IndexGeneratorJobTest that checks original order is maintained for schema-full indexing
0abe702 to
76375e6
Compare
There was a problem hiding this comment.
synchronize on dimensionOrder, return a copy? (dimensionOrder is not a thread safe map)
There was a problem hiding this comment.
Actually… we don't need dimensionOrder at all for this, we can just return dimensions itself. So this can return a List<String> that is actually just ImmutableList.copyOf(dimensions).
That also makes the loading simpler since you can pass that list directly to the one that takes an Iterable.
|
I think it should be possible to do this without writing out the null columns. To avoid that we need to be sure that:
|
293cc52 to
1116f4d
Compare
|
@jon-wei Looks like some more conflicts |
280b9c3 to
37c4b34
Compare
|
merging, also updating IndexMergerV9 with similar logic (still WIP) |
3a501fb to
c05058c
Compare
|
Finished the merge and updated IndexMergerV9. I've also updated IndexMergerV9Test such that the use of V9/legacy IndexMerger is controlled by a test parameter since the tests are identical to those in IndexMergerTest (now deprecated in the PR). |
|
@jon-wei is this a blocker for 0.9.0? |
|
@drcrallen I don't think it's a blocker but I think we should try to get it in. |
There was a problem hiding this comment.
Is dim val 0 guaranteed to be a null/emptystr?
Even if it is, this last check should be dims[i].length == 1 && dims[i][0] == 0 as the dim value could potentially be an empty str + something else.
There was a problem hiding this comment.
Looks like it is guaranteed due to the block below. but, the other check should still be modified, I think.
|
@jon-wei new changes look good other than the one comment I had |
c05058c to
cefb536
Compare
cefb536 to
747343e
Compare
|
rebased again |
|
👍 |
Preserve dimension order across indexes during ingestion
|
Sorry for late incursion but this makes invalid cardinality for segment meta query. If this could not be fixed we might rollback this. |
|
@navis Can you elaborate on the issue you are seeing? |
|
with druid.sample.tsv, before after |
|
I can see |
Partially addresses issue #658
Related to druid-io/druid-api#68
This changes the RealtimePlumber and IndexGeneratorJob so that when an index is persisted, the next index created will inherit the dimension order built up by the previous index.
This is to better support arbitrary dimension orders when schema-less ingestion is used.
When merging indexes, the index merger will now attempt to find the largest common dimension ordering across the indexes, e.g.:
Index 1 dims: DimA, DimB
Index 2 dims: DimC
Index 3 dims: DimC, DimA, DimB
The ordering that will be used will be DimC, DimA, DimB as that encompasses all of the shorter orderings.
If no common order is found, the merger will fall back to the original lexicographic dimension ordering.
To support this change, the persisted indexes now include null dimensions and the dimension encoding dictionaries will now always have an entry for the empty string. This is done so that previously seen dimensions are not dropped in case an index does not receive any events with values for that dimension.
LIMITATIONS: