-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Feature for hadoop batch re-ingesion and delta ingestion #1374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4d4aa8b
f1d309a
1ae56f1
45947a1
15fa43d
a3bab5b
cfd81bf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,8 +19,16 @@ | |
|
|
||
| import com.fasterxml.jackson.annotation.JsonCreator; | ||
| import com.fasterxml.jackson.annotation.JsonProperty; | ||
| import com.fasterxml.jackson.databind.ObjectMapper; | ||
| import io.druid.indexer.hadoop.DatasourceIngestionSpec; | ||
| import io.druid.indexer.path.UsedSegmentLister; | ||
| import io.druid.segment.indexing.DataSchema; | ||
| import io.druid.segment.indexing.IngestionSpec; | ||
| import io.druid.timeline.DataSegment; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| /** | ||
| */ | ||
|
|
@@ -91,4 +99,45 @@ public HadoopIngestionSpec withTuningConfig(HadoopTuningConfig config) | |
| config | ||
| ); | ||
| } | ||
|
|
||
| public static HadoopIngestionSpec updateSegmentListIfDatasourcePathSpecIsUsed( | ||
| HadoopIngestionSpec spec, | ||
| ObjectMapper jsonMapper, | ||
| UsedSegmentLister segmentLister | ||
| ) | ||
| throws IOException | ||
| { | ||
| String dataSource = "dataSource"; | ||
| String type = "type"; | ||
| String multi = "multi"; | ||
| String children = "children"; | ||
| String segments = "segments"; | ||
| String ingestionSpec = "ingestionSpec"; | ||
|
|
||
| Map<String, Object> pathSpec = spec.getIOConfig().getPathSpec(); | ||
| Map<String, Object> datasourcePathSpec = null; | ||
| if(pathSpec.get(type).equals(dataSource)) { | ||
| datasourcePathSpec = pathSpec; | ||
| } else if(pathSpec.get(type).equals(multi)) { | ||
| List<Map<String, Object>> childPathSpecs = (List<Map<String, Object>>) pathSpec.get(children); | ||
| for(Map<String, Object> childPathSpec : childPathSpecs) { | ||
| if (childPathSpec.get(type).equals(dataSource)) { | ||
| datasourcePathSpec = childPathSpec; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| if (datasourcePathSpec != null) { | ||
| Map<String, Object> ingestionSpecMap = (Map<String, Object>) datasourcePathSpec.get(ingestionSpec); | ||
| DatasourceIngestionSpec ingestionSpecObj = jsonMapper.convertValue(ingestionSpecMap, DatasourceIngestionSpec.class); | ||
| List<DataSegment> segmentsList = segmentLister.getUsedSegmentsForInterval( | ||
| ingestionSpecObj.getDataSource(), | ||
| ingestionSpecObj.getInterval() | ||
| ); | ||
| datasourcePathSpec.put(segments, segmentsList); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the proliferation of immutable data I'm kind of surprised this works.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmmm, I understand the concern. However, it does work and there are UTs in HadoopIngestionSpecUpdateDatasourcePathSpecSegmentsTest to ensure that if it ever breaks.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I trust the UTs in this matter. |
||
| } | ||
|
|
||
| return spec; | ||
| } | ||
|
|
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can metrics be delta-ingested when the combining aggregator is not the same as the normal aggregator? Wondering because input rows from segments seem to be treated the same as input rows from raw data here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, that sounds true. it gets a lil weird when "name" and "fieldName" in the aggregator are not same. I will need to think how to treat rows read from segment differently.
also it seems DatasourcePathSpec should really get list of metrics from "name" and not "fieldName" .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main complication is in serializing the input rows.
InputRow from raw data will have "fieldName" columns in it and same from segment will have "name" columns in it, so they have to be treated differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option is that DatasourceInputFormat returns a "SegmentInputRow" instead of "InputRow" which would only be a wrapper(and extension of InputRow) and we put ugly things like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cheddar what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't full thought this through, but is there any harm in having the SegmentInputFormat know about the Aggregators such that it can return the "name" value when it is asked for "fieldName" as well?