Add DruidInputSource (replacement for IngestSegmentFirehose)#8982
Add DruidInputSource (replacement for IngestSegmentFirehose)#8982jon-wei merged 13 commits intoapache:masterfrom
Conversation
|
Marking WIP, need to update the docs, and will add an integration test to |
|
Tested this (as a user) from the console (via #8828 ) and it works |
|
Added a doc entry for the new input source. We'll need a larger rework of the docs for the new input sources/input formats, but that should be handled in another PR. Adjusted ITParallelIndexTaskTest so that it also runs reingestion using DruidInputSource |
| { | ||
| "type": "druid", | ||
| "dataSource": "wikipedia", | ||
| "interval": "2013-01-01/2013-01-02" |
There was a problem hiding this comment.
nit: It would be nice if either this example or a 2nd example included dimensions, metrics, and filter
There was a problem hiding this comment.
Added a second example with more explanation of what the example specs do
| |property|description|required?| | ||
| |--------|-----------|---------| | ||
| |type|This should be "druid".|yes| | ||
| |dataSource|A String defining the data source to fetch rows from, very similar to a table in a relational database|yes| |
There was a problem hiding this comment.
nit: this part 'very similar to a table in a relational database' feels sort of out of place to me
There was a problem hiding this comment.
Reworded this part and other parts of this table
| ), | ||
| source.getIntervalFilter() | ||
| ); | ||
| final Sequence<Map<String, Object>> sequence = Sequences.concat( |
There was a problem hiding this comment.
This logic is pretty beastly to follow and I think could definitely stand to be broken down into a couple of methods. I suggest one that makes a Sequence from a Cursor, and one that makes a CloseableIterator from a Sequence. Then this part can be simplified to something like:
final Sequence<Cursor> cursors = storageAdapter.getAdapter().makeCursors(
Filters.toFilter(dimFilter),
storageAdapter.getInterval(),
VirtualColumns.EMPTY,
Granularities.ALL,
false,
null
);
final Sequence<Map<String, Object>> sequence = Sequences.concat(Sequences.map(cursors, this::cursorToSequence));
return sequenceToCloseableIterator(segmentFile, sequence);which makes this much more approachable imo
There was a problem hiding this comment.
Adjusted this to use the suggested structure
| public void close() | ||
| { | ||
| if (!segmentFile.delete()) { | ||
| // log |
There was a problem hiding this comment.
is this a placeholder for a log message?
There was a problem hiding this comment.
Filled in a warning log message
| Interval lastInterval = null; | ||
| for (Interval interval : timeline.keySet()) { | ||
| if (lastInterval != null) { | ||
| if (interval.overlaps(lastInterval)) { |
There was a problem hiding this comment.
nit: i know this isn't really new code, but these if statements can be collapsed
| // segments to olders. | ||
|
|
||
| // timelineSegments are sorted in order of interval | ||
| int[] index = {0}; |
There was a problem hiding this comment.
hmm... so ugly, though I guess better than AtomicInteger 😢. This should be final at least I think? Alternatively, should this just be implemented how getUniqueDimensions is so that they are more consistent?
There was a problem hiding this comment.
I tried changing it to be the same as the dimensions version, but because there is only a presence check for the metrics version (no "exclusions" like with dimensions), the map.put becomes a style failure (tells me to use computeIfAbsent instead)
Made this final
There was a problem hiding this comment.
It could be MutableInt, but I think the original code is ok since it's just reusing the existing code.
| { | ||
| Preconditions.checkNotNull(interval); | ||
|
|
||
| // This call used to use the TaskActionClient, so for compatibility we use the same retry configuration |
There was a problem hiding this comment.
Consider this a non-blocking discussion point, but should this maybe switch to using RetryUtils?
There was a problem hiding this comment.
I don't know why this has its own retry logic too.. I think it's worth to consider it for future refactoring.
| } | ||
|
|
||
| @VisibleForTesting | ||
| private static List<String> getUniqueDimensions( |
There was a problem hiding this comment.
nit: Would this method and getUniqueMetrics maybe more appropriately live in VersionedIntervalTimeline?
There was a problem hiding this comment.
This method does not seem to be covered by any of the tests in org.apache.druid.indexing
There was a problem hiding this comment.
I moved these to VersionedIntervalTimeline, IngestSegmentFirehoseFactory.testGetUniqueDimensionsAndMetrics should be covering these now
| } | ||
|
|
||
| @VisibleForTesting | ||
| private static List<String> getUniqueDimensions( |
There was a problem hiding this comment.
This method does not seem to be covered by any of the tests in org.apache.druid.indexing
| } | ||
|
|
||
| @VisibleForTesting | ||
| private static List<String> getUniqueMetrics(List<TimelineObjectHolder<String, DataSegment>> timelineSegments) |
There was a problem hiding this comment.
This method does not seem to be covered by any of the tests in org.apache.druid.indexing
There was a problem hiding this comment.
I moved these to VersionedIntervalTimeline, IngestSegmentFirehoseFactory.testGetUniqueDimensionsAndMetrics should be covering these now
| List<String> dimVals = new ArrayList<>(valsSize); | ||
| for (int i = 0; i < valsSize; ++i) { | ||
| dimVals.add(selector.lookupName(vals.get(i))); | ||
| } | ||
| theEvent.put(dim, dimVals); |
There was a problem hiding this comment.
This part does not seem to be covered by any of the tests in org.apache.druid.indexing
There was a problem hiding this comment.
I adjusted CompactionRunTaskTest so that there's a multivalued dimension (this case handles that), and added a row correctness check to testRun there
| // This segment won't fit in the current non-empty split, so this split is done. | ||
| splits.add(new InputSplit<>(currentSplit)); | ||
| currentSplit = new ArrayList<>(); | ||
| bytesInCurrentSplit = 0; |
There was a problem hiding this comment.
May be worth having a unit test for this method. Looks like this part is not covered by any of the tests in org.apache.druid.indexing.
There was a problem hiding this comment.
I added testDruidInputSourceCreateSplitsWithIndividualSplits to CompactionTaskParallelRunTest
| |dataSource|A String defining the Druid datasource to fetch rows from|yes| | ||
| |interval|A String representing the ISO-8601 interval, which defines the time range to fetch the data over.|yes| | ||
| |dimensions|A list of Strings containing the names of dimension columns to select from the Druid datasource. If the list is empty, no dimensions are returned. If null, all dimensions are returned. |no| | ||
| |metrics|The list of Strings containg the names of metric columns to select. If the list is empty, no metrics are returned. If null, all metrics are returned.|no| |
There was a problem hiding this comment.
typo: containg -> containing
(travis failed on docs spellcheck)
| } | ||
|
|
||
| @VisibleForTesting | ||
| public static List<String> getUniqueDimensions( |
There was a problem hiding this comment.
The place of these methods seems a bit unintuitive to me. Their first parameter is a list of TimelineObjectHolders because I think it's easier to use in DruidInputSource and IngestSegmentFirehose. But, in fact, a list of DataSegments would be enough. Is there any other place that needs to use these methods other than DruidInputSource or IngestSegmentFirehose? If not, I guess a new util class such as CompactionUtils or DataSegments might be better?
There was a problem hiding this comment.
I moved these to a new class ReingestionTimelineUtils and added some javadocs
There was a problem hiding this comment.
The existing comments within the method mention ordering based on recency of segments, so I kept the input as TimelineObjectHolders
There was a problem hiding this comment.
Hm, so I had suggested moving them to VersionedIntervalTimeline because they seem like generic timeline operations, not specific to re-ingesting segments. Is having a ReingestionTimelineUtils better than their original home if they are still going to live somewhere specific to re-ingesting segments?
There was a problem hiding this comment.
Hmm, my only view on this is it should be somewhere other than IngestSegmentFirehoseFactory since that's headed for eventual deprecation.
It could go in VersionedIntervalTimeline although that class is a bit weird (it accepts generic types but also has methods that specifically use DataSegment). I feel like the utils class is fine for now
There was a problem hiding this comment.
My stance is similar to @jon-wei. It could be used for more general use cases, but I don't see them now. Maybe we can move them from ReingestionTimelineUtils to somewhere else if we need in the future.
| // segments to olders. | ||
|
|
||
| // timelineSegments are sorted in order of interval | ||
| int[] index = {0}; |
There was a problem hiding this comment.
It could be MutableInt, but I think the original code is ok since it's just reusing the existing code.
| { | ||
| Preconditions.checkNotNull(interval); | ||
|
|
||
| // This call used to use the TaskActionClient, so for compatibility we use the same retry configuration |
There was a problem hiding this comment.
I don't know why this has its own retry logic too.. I think it's worth to consider it for future refactoring.
| } | ||
|
|
||
| @Override | ||
| public void remove() |
There was a problem hiding this comment.
nit: the Iterator interface has the default implementation for remove() throwing UnsupportedOperationException.
There was a problem hiding this comment.
Removed the redundant override
| ), | ||
| source.getIntervalFilter() | ||
| ); | ||
| final Sequence<Map<String, Object>> sequence = Sequences.concat( |
Based on an initial patch by @jihoonson
This PR adds
DruidInputSourceand associated classes, a new replacement forIngestSegmentFirehose. This input source always has a fixed input format (DruidInputFormat). The row reading class isDruidSegmentReaderand the input entity isDruidInputEntity.The new input source has the same parameters as the existing IngestSegmentFirehose, and has similar code.
Additionally:
inputSource.needsFormat()was not being respected.For testing, this patch relies on the existing CompactionTask tests for checking functionality of the new input source.
A test where an IngestSegmentFirehose is used for reindexing with a native batch task (non-parallel) has also been added to
CompactionRunTaskTest. The existingITParallelIndexTaskTeststill uses a firehose instead of input source.This PR has: