-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Create splits of multiple files for parallel indexing #9360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6c839bb
4b78cf8
6f812cd
c00cc53
8ae5271
0210ba1
605ffb2
383d256
76fb01c
8623f10
bcbb345
689d467
c10552a
acaa848
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.druid.data.input; | ||
|
|
||
| /** | ||
| * A class storing some attributes of an input file. | ||
| * This information is used to make splits in the parallel indexing. | ||
| * | ||
| * @see SplitHintSpec | ||
| * @see org.apache.druid.data.input.impl.SplittableInputSource | ||
| */ | ||
| public class InputFileAttribute | ||
| { | ||
| /** | ||
| * The size of the input file. | ||
| */ | ||
| private final long size; | ||
|
|
||
| public InputFileAttribute(long size) | ||
| { | ||
| this.size = size; | ||
| } | ||
|
|
||
| public long getSize() | ||
| { | ||
| return size; | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.druid.data.input; | ||
|
|
||
| import com.fasterxml.jackson.annotation.JsonCreator; | ||
| import com.fasterxml.jackson.annotation.JsonProperty; | ||
| import com.google.common.annotations.VisibleForTesting; | ||
|
|
||
| import javax.annotation.Nullable; | ||
| import java.util.ArrayList; | ||
| import java.util.Iterator; | ||
| import java.util.List; | ||
| import java.util.NoSuchElementException; | ||
| import java.util.Objects; | ||
| import java.util.function.Function; | ||
|
|
||
| /** | ||
| * A SplitHintSpec that can create splits of multiple files. | ||
| * A split created by this class can have one or more input files. | ||
| * If there is only one file in the split, its size can be larger than {@link #maxSplitSize}. | ||
| * If there are two or more files in the split, their total size cannot be larger than {@link #maxSplitSize}. | ||
| */ | ||
| public class MaxSizeSplitHintSpec implements SplitHintSpec | ||
| { | ||
| public static final String TYPE = "maxSize"; | ||
|
|
||
| @VisibleForTesting | ||
| static final long DEFAULT_MAX_SPLIT_SIZE = 512 * 1024 * 1024; | ||
|
|
||
| private final long maxSplitSize; | ||
|
|
||
| @JsonCreator | ||
| public MaxSizeSplitHintSpec(@JsonProperty("maxSplitSize") @Nullable Long maxSplitSize) | ||
| { | ||
| this.maxSplitSize = maxSplitSize == null ? DEFAULT_MAX_SPLIT_SIZE : maxSplitSize; | ||
| } | ||
|
|
||
| @JsonProperty | ||
| public long getMaxSplitSize() | ||
| { | ||
| return maxSplitSize; | ||
| } | ||
|
|
||
| @Override | ||
| public <T> Iterator<List<T>> split(Iterator<T> inputIterator, Function<T, InputFileAttribute> inputAttributeExtractor) | ||
| { | ||
| return new Iterator<List<T>>() | ||
| { | ||
| private T peeking; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can simplify the logic of the next method below if you initialize peeking to inputIterator.next(), and only set peeking to null when inputIterator.hasNext() is false. In your next() below, you would just keeping shifting values from inputIterator into current after each iteration as long as there are more inputs.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand how it works. |
||
|
|
||
| @Override | ||
| public boolean hasNext() | ||
| { | ||
| return peeking != null || inputIterator.hasNext(); | ||
| } | ||
|
|
||
| @Override | ||
| public List<T> next() | ||
| { | ||
| if (!hasNext()) { | ||
| throw new NoSuchElementException(); | ||
| } | ||
| final List<T> current = new ArrayList<>(); | ||
| long splitSize = 0; | ||
| while (splitSize < maxSplitSize && (peeking != null || inputIterator.hasNext())) { | ||
| if (peeking == null) { | ||
| peeking = inputIterator.next(); | ||
| } | ||
| final long size = inputAttributeExtractor.apply(peeking).getSize(); | ||
| if (current.isEmpty() || splitSize + size < maxSplitSize) { | ||
| current.add(peeking); | ||
| splitSize += size; | ||
| peeking = null; | ||
| } else { | ||
| break; | ||
| } | ||
| } | ||
| assert !current.isEmpty(); | ||
| return current; | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object o) | ||
| { | ||
| if (this == o) { | ||
| return true; | ||
| } | ||
| if (o == null || getClass() != o.getClass()) { | ||
| return false; | ||
| } | ||
| MaxSizeSplitHintSpec that = (MaxSizeSplitHintSpec) o; | ||
| return maxSplitSize == that.maxSplitSize; | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() | ||
| { | ||
| return Objects.hash(maxSplitSize); | ||
| } | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. equals and hashCode need unit tests
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,10 +23,19 @@ | |
| import com.fasterxml.jackson.annotation.JsonProperty; | ||
|
|
||
| import javax.annotation.Nullable; | ||
| import java.util.Iterator; | ||
| import java.util.List; | ||
| import java.util.Objects; | ||
| import java.util.function.Function; | ||
|
|
||
| /** | ||
| * {@link SplitHintSpec} for IngestSegmentFirehoseFactory. | ||
| * {@link SplitHintSpec} for IngestSegmentFirehoseFactory and DruidInputSource. | ||
| * | ||
| * In DruidInputSource, this spec is converted into {@link MaxSizeSplitHintSpec}. As a result, its {@link #split} | ||
| * method is never called (IngestSegmentFirehoseFactory creates splits on its own instead of calling the | ||
| * {@code split()} method). This doesn't necessarily mean this class is deprecated in favor of the MaxSizeSplitHintSpec. | ||
| * We may want to create more optimized splits in the future. For example, segments can be split to maximize the rollup | ||
| * ratio if the segments have different sets of columns or even different value ranges of columns. | ||
| */ | ||
| public class SegmentsSplitHintSpec implements SplitHintSpec | ||
| { | ||
|
|
@@ -41,9 +50,7 @@ public class SegmentsSplitHintSpec implements SplitHintSpec | |
| private final long maxInputSegmentBytesPerTask; | ||
|
|
||
| @JsonCreator | ||
| public SegmentsSplitHintSpec( | ||
| @JsonProperty("maxInputSegmentBytesPerTask") @Nullable Long maxInputSegmentBytesPerTask | ||
| ) | ||
| public SegmentsSplitHintSpec(@JsonProperty("maxInputSegmentBytesPerTask") @Nullable Long maxInputSegmentBytesPerTask) | ||
| { | ||
| this.maxInputSegmentBytesPerTask = maxInputSegmentBytesPerTask == null | ||
| ? DEFAULT_MAX_INPUT_SEGMENT_BYTES_PER_TASK | ||
|
|
@@ -56,6 +63,13 @@ public long getMaxInputSegmentBytesPerTask() | |
| return maxInputSegmentBytesPerTask; | ||
| } | ||
|
|
||
| @Override | ||
| public <T> Iterator<List<T>> split(Iterator<T> inputIterator, Function<T, InputFileAttribute> inputAttributeExtractor) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like this method really doesn't belong here if not all subclasses or implementation need it? Or should this class be abstract instead?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added comment about it. |
||
| { | ||
| // This method is not supported currently, but we may want to implement in the future to create optimized splits. | ||
| throw new UnsupportedOperationException(); | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object o) | ||
| { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should make spec classes be pure data objects (or beans). Adding methods like split to them makes them complicated and adds logic that makes it hard to version them in the future. We should think of data objects as literals, not as objects with business logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I agree it is a better structure, but the problem is there are too many classes doing this kind of things especially on the ingestion side. I don't think it's possible to apply the suggested design to all classes anytime soon. Also, I think it's better to promote SQL for ingestion as well so that Druid users don't have to worry about the API changes.