Enhance Compaction task to be able to write to a different/new datasource#18612
Enhance Compaction task to be able to write to a different/new datasource#18612maytasm wants to merge 1 commit intoapache:masterfrom
Conversation
|
@maytasm , how is this different from the existing re-index capability that reads from a
If we want to be able to leverage the exact same capabilities as a |
jtuglu1
left a comment
There was a problem hiding this comment.
first pass – looks good so far, let's just get some UTs in as well as docs.
Ideally we can add an embedded/IT test as well.
There was a problem hiding this comment.
Let's avoid writing this if possible
| @@ -517,6 +520,7 @@ public TaskStatus runTask(TaskToolbox toolbox) throws Exception | |||
| emitMetric(toolbox.getEmitter(), "ingest/count", 1); | |||
|
|
|||
| final Map<Interval, DataSchema> intervalDataSchemas = createDataSchemasForIntervals( | |||
There was a problem hiding this comment.
Given we are now ingesting into potentially a new datasource, have we checked whether all associated tasks for ingestion (e.g. lock acquisition, segment alloc) are now working?
| this.tuningConfig = tuningConfig != null ? getTuningConfig(tuningConfig) : null; | ||
| this.segmentProvider = new SegmentProvider(dataSource, this.ioConfig.getInputSpec()); | ||
| this.segmentProvider = new SegmentProvider( | ||
| this.ioConfig.getInputSpec().getDataSource() == null ? dataSource : this.ioConfig.getInputSpec().getDataSource(), |
There was a problem hiding this comment.
| this.ioConfig.getInputSpec().getDataSource() == null ? dataSource : this.ioConfig.getInputSpec().getDataSource(), | |
| Configs.getOrDefault(this.ioConfig.getInputSpec().getDataSource(), dataSource) |
| boolean validateSegments(LockGranularity lockGranularityInUse, List<DataSegment> latestSegments); | ||
|
|
||
| /** | ||
| * Return the datasource to be used as input to the compaction task. |
There was a problem hiding this comment.
| * Return the datasource to be used as input to the compaction task. | |
| * Returns the name of the datasource to ingest the compacted segments to. |
| .ofTypeCompact() | ||
| .context("storeCompactionState", true) | ||
| .ioConfig(new CompactionIntervalSpec(Intervals.of("2013-08-31/2013-09-02"), null), false); | ||
| .ioConfig(new CompactionIntervalSpec(Intervals.of("2013-08-31/2013-09-02"), null, null), false); |
There was a problem hiding this comment.
Let's add tests where this isn't null
| @Override | ||
| @Nullable | ||
| @JsonProperty | ||
| public String getDataSource() |
There was a problem hiding this comment.
nit: let's add some javadoc to this
@kfaraz I think the intent here is to take advantage of the schema detection, spec-autofill etc. I agree that this probably makes more sense in an |
Yeah, that's a fair point. Compaction task does provide those useful capabilities out-of-the-box.
👍🏻 |
|
@kfaraz I just added more detail to the PR description. Compaction task doesn't require you to specific all the specs (metricsSpec, etc) and can discover from existing segments. Specifying all the specs of a datasource is a big pain as Druid doesn't have the concept of schema/catalog and the schema can evolve/differs between segments. For example, the metricsSpec requires you to specify the CombiningAggregator instead of the original aggregators. All of this is done for you in the Compaction task |
drive by comment (still digesting the main idea of this PR), you should check out projections for this use case, which allow building aggregated views inside of segments the same datasource to automatically do more performant queries for any query which can match the projection. Projections can be applied to existing datasources via compaction since #17803. Projections are still experimental, and existing released versions have some bugs which will be fixed in the upcoming 35 release, where we plan to document them as part of that release since they should be a lot more stable. |
|
Yes, @maytasm , I agree that we should be able to leverage the auto-discover capabilities of compaction task for re-indexing too. For that reason, it makes sense to extend that feature.
Yeah, more runtime properties and tuning configs often make Druid harder to use and understand. Since the use case here is a new capability altogether, there is no harm in adding a new task type or a new input source type, whichever seems simpler to implement.
Absolutely, this is one of the things we have been discussing, that a
Oh, absolutely. To clarify, I meant that we should try to bring the auto-detection capabilities of Also, to add to the suggestions from @clintropolis , you could also consider using an MSQ INSERT/REPLACE statement. |
Ah. I forgot about Projection. Yep, you are right. Projection can achieve the same result as this point too. Although I think Projection doesn't allow you to have a different queryGranularity though |
I think this actually work and can achieve all the use cases in the PR description. Thanks for the suggestion. |
Enhance Compaction task to be able to write to a different/new datasource
Description
Compaction tasks, unlike reindexing (which is native batch + druid input source), are great as they allow user to only specific specs that is needed and can automatically infer unspecified specs from existing segments of the input datasource. For example, if a user doesn't want to change the metrics, they don't have to set the metricsSpec and the Compaction task will correctly create metricsSpec from existing segments.
A limitation of Compaction tasks is that the datasource the task is writing to and reading from has to be the same datasource. There are multiple use cases where we may want to write to a different datasource. For example:
This PR adds a new Property to the Compaction task
inputSpec. The new field isdataSource. This new field is not required. When not set, the Compaction task will read and write to the same datasource given by the top-most leveldataSourcefield. When set, the Compaction task will read from inputSpec'sdataSourceand write to the top-most leveldataSourcefield. I think it make sense for this new field to be in the Compaction taskinputSpecas theinputSpecpurpose is specifying the input of the task.This PR has: