Add InputSource and InputFormat interfaces#8823
Conversation
|
This is amazing. |
|
This pull request fixes 1 alert when merging d451582 into 3b602da - view on LGTM.com fixed alerts:
|
|
One of the most exciting PRs on Druid ingestion in awhile. Glad we got it out. |
|
This pull request fixes 1 alert when merging b7c8b87 into 5c0fc0a - view on LGTM.com fixed alerts:
|
|
This pull request fixes 1 alert when merging e942a21 into 517c146 - view on LGTM.com fixed alerts:
|
| if (firehoseFactory.isSplittable()) { | ||
| return ((FiniteFirehoseFactory) firehoseFactory).getSplits(splitHintSpec); | ||
| } else { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Is supporting unsplittable Firehoses future work?
There was a problem hiding this comment.
No, only splittable firehose can create splits.
| } | ||
| } | ||
|
|
||
| private static class TestCsvParseSpec extends CSVParseSpec |
There was a problem hiding this comment.
Suggestion: Rename the class to something like UnimplementedInputFormatCsvParseSpec. Currently, looking at just the body of testUnimplementedInputFormat, it's not apparent where the unimplemented input format is coming from.
| public TimestampSpec( | ||
| @JsonProperty("column") String timestampColumn, | ||
| @JsonProperty("format") String format, | ||
| @JsonProperty("column") @Nullable String timestampColumn, |
| Preconditions.checkNotNull(schema.getDataSchema().getParser(), "inputRowParser"); | ||
| Preconditions.checkNotNull(schema.getDataSchema().getParser().getParseSpec(), "parseSpec"); | ||
| Preconditions.checkNotNull(schema.getDataSchema().getParser().getParseSpec().getTimestampSpec(), "timestampSpec"); | ||
| Preconditions.checkNotNull(schema.getDataSchema().getNonNullTimestampSpec(), "timestampSpec"); |
There was a problem hiding this comment.
Checking this one for null seems redundant
| public File getFirehoseTemporaryDir() | ||
| public File getIndexingTmpDir() | ||
| { | ||
| return new File(taskWorkDir, "firehose"); |
There was a problem hiding this comment.
Perhaps rename the temporary directory as well
There was a problem hiding this comment.
Renamed to indexing-tmp.
| ImmutableList.of(new Property<>("firehose", firehoseFactory), new Property<>("inputSource", inputSource)) | ||
| ); | ||
| if (firehoseFactory != null && inputFormat != null) { | ||
| throw new IAE("Cannot use firehose and inputFormat together. Try use inputSource instead of firehose."); |
There was a problem hiding this comment.
Typo: Try use inputFormat -> Try using inputSource
| ); | ||
| if (dataSchema.getParserMap() != null && ioConfig.getInputSource() != null) { | ||
| if (!(ioConfig.getInputSource() instanceof FirehoseFactoryToInputSourceAdaptor)) { | ||
| throw new IAE("Cannot use parser and inputSource together. Try use inputFormat instead of parser."); |
There was a problem hiding this comment.
Typo: Try use inputFormat -> Try using inputFormat
| new Object[]{LockGranularity.TIME_CHUNK, false}, | ||
| new Object[]{LockGranularity.TIME_CHUNK, true}, | ||
| new Object[]{LockGranularity.SEGMENT, false}, | ||
| new Object[]{LockGranularity.SEGMENT, true} |
There was a problem hiding this comment.
This is a relatively slow test (~15 seconds per parameterized run), so all the permutations may be overkill. Perhaps remove (SEGMENT, false), which will still give coverage of both lock granularities and both with/without the input format API.
| new Object[]{LockGranularity.SEGMENT} | ||
| new Object[]{LockGranularity.TIME_CHUNK, false}, | ||
| new Object[]{LockGranularity.TIME_CHUNK, true}, | ||
| new Object[]{LockGranularity.SEGMENT, false}, |
There was a problem hiding this comment.
Similar comment to IndexingTest about skipping this permutation
| new Object[]{LockGranularity.SEGMENT} | ||
| new Object[]{LockGranularity.TIME_CHUNK, false}, | ||
| new Object[]{LockGranularity.TIME_CHUNK, true}, | ||
| new Object[]{LockGranularity.SEGMENT, false}, |
There was a problem hiding this comment.
Similar comment to IndexingTest about skipping this permutation
|
This pull request fixes 1 alert when merging 546d957 into 0e8c3f7 - view on LGTM.com fixed alerts:
|
|
This pull request fixes 1 alert when merging ea2c8f9 into 75ea0d5 - view on LGTM.com fixed alerts:
|
This reverts commit 1ea7758.
|
This pull request fixes 1 alert when merging 218b392 into e9e1625 - view on LGTM.com fixed alerts:
|
clintropolis
left a comment
There was a problem hiding this comment.
lgtm 👍
I'm slightly hesitant since I feel that this will be a moderately disruptive change that further fractures the state of indexing with regards to differences between specs, but I think these new interfaces are nicer going forward, so worth the pain of migrating stuff to this model and fully replacing firehoses.
|
This pull request fixes 1 alert when merging ce88049 into ce4ee42 - view on LGTM.com fixed alerts:
|
| @@ -74,13 +74,13 @@ public interface Firehose extends Closeable | |||
| * | |||
| * @return an InputRowPlusRaw which may contain any of: an InputRow, the raw data, or a ParseException | |||
There was a problem hiding this comment.
javadoc for @return needs to be updated
There was a problem hiding this comment.
This method is only for sampler and will be removed in the follow-up pr.
| public boolean isEmpty() | ||
| { | ||
| return new InputRowPlusRaw(null, raw, parseException); | ||
| return (inputRows == null || inputRows.isEmpty()) && raw == null && rawJson == null && parseException == null; |
There was a problem hiding this comment.
Should this also check if rawJson.isEmpty()?
There was a problem hiding this comment.
This class is also used only by sampler and will be cleaned up in the follow-up pr.
|
@ccaominh @clintropolis @jon-wei thanks for the review! |
* Refactor parallel indexing perfect rollup partitioning Refactoring to make it easier to later add range partitioning for perfect rollup parallel indexing. This is accomplished by adding several new base classes (e.g., PerfectRollupWorkerTask) and new classes for encapsulating logic that needs to be changed for different partitioning strategies (e.g., IndexTaskInputRowIteratorBuilder). The code is functionally equivalent to before except for the following small behavior changes: 1) PartialSegmentMergeTask: Previously, this task had a priority of DEFAULT_TASK_PRIORITY. It now has a priority of DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask base class), since it is a batch index task. 2) ParallelIndexPhaseRunner: A decorator was added to subTaskSpecIterator to ensure the subtasks are generated with unique ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest) would have this decorator, but this behavior is desired for non-test code as well. * Fix forbidden apis and pmd warnings * Fix analyze dependencies warnings * Fix IndexTask json and add IT diags * Fix parallel index supervisor<->worker serde * Fix TeamCity inspection errors/warnings * Fix TeamCity inspection errors/warnings again * Integrate changes with those from #8823 * Address review comments * Address more review comments * Fix forbidden apis * Address more review comments
* Refactor parallel indexing perfect rollup partitioning Refactoring to make it easier to later add range partitioning for perfect rollup parallel indexing. This is accomplished by adding several new base classes (e.g., PerfectRollupWorkerTask) and new classes for encapsulating logic that needs to be changed for different partitioning strategies (e.g., IndexTaskInputRowIteratorBuilder). The code is functionally equivalent to before except for the following small behavior changes: 1) PartialSegmentMergeTask: Previously, this task had a priority of DEFAULT_TASK_PRIORITY. It now has a priority of DEFAULT_BATCH_INDEX_TASK_PRIORITY (via the new PerfectRollupWorkerTask base class), since it is a batch index task. 2) ParallelIndexPhaseRunner: A decorator was added to subTaskSpecIterator to ensure the subtasks are generated with unique ids. Previously, only tests (i.e., MultiPhaseParallelIndexingTest) would have this decorator, but this behavior is desired for non-test code as well. * Fix forbidden apis and pmd warnings * Fix analyze dependencies warnings * Fix IndexTask json and add IT diags * Fix parallel index supervisor<->worker serde * Fix TeamCity inspection errors/warnings * Fix TeamCity inspection errors/warnings again * Integrate changes with those from apache#8823 * Address review comments * Address more review comments * Fix forbidden apis * Address more review comments
The FiniteFirehoseFactory and InputRowParser classes were deprecated in 0.17.0 (#8823) in favor of InputSource & InputFormat. This PR removes the FiniteFirehoseFactory and all its implementations along with classes solely used by them like Fetcher (Used by PrefetchableTextFilesFirehoseFactory). Refactors classes including tests using FiniteFirehoseFactory to use InputSource instead. Removing InputRowParser may not be as trivial as many classes that aren't deprecated depends on it (with no alternatives), like EventReceiverFirehoseFactory. Hence FirehoseFactory, EventReceiverFirehoseFactory, and Firehose are marked deprecated.
Description
This is the First PR for #8812 which includes the new interfaces proposed in #8812. A couple of implementations are also included such as
LocalInputSourceandHttpInputSourceforInputSource, andCsvInputFormatandJsonInputFormatforInputFormat. Their formats are:Note that both
inputSourceandinputFormatare inioConfigas below:These are supported only by native batch indexing tasks yet. Sampler doesn't support them yet.
The old
firehoseandparserparameters should still work, but you cannot mix them. Only the combinations offirehose+parserorinputSource+inputFormatare allowed.Documents will be added after more inputSources and inputFormats are implemented in follow-up PRs.
This PR has:
This change is