-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Native parallel batch indexing without shuffle #5492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fb5a787
fcaf1bf
6ce2db8
0ace07a
2e8dbe0
a2b44f0
67ca45c
cc8bd38
2d2347b
932f69c
a83c76a
8d68dba
00be56c
b9724a5
7f21927
3a94432
c68b21a
fd7dda3
b142a30
1fe0a05
bd5d5d5
9d5d7c0
702b2a5
5dc8fb1
6d32f0f
185549f
9a8ccc4
2447bbf
bda4fec
ac817c6
4924fa5
dabb8bf
0998e71
f306338
e43adf2
7c7af69
b077c92
3631e20
b2b1de0
acb5305
28e310d
a7fc8e8
00e2663
0a6eb30
c1c6f3f
86b4582
6c85f62
892a0ff
77f6937
bf73f7e
a5254a6
6cd0dd6
509c9b3
6a1bcfd
d700395
550fd38
77935bd
7e6b110
b7266ce
e75baa4
b8ce241
3b29f6e
8da2971
cc111ae
9853cfa
20f7c7b
826ae54
97a7819
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package io.druid.data.input; | ||
|
|
||
| import com.fasterxml.jackson.annotation.JsonIgnore; | ||
| import io.druid.data.input.impl.InputRowParser; | ||
|
|
||
| import java.io.IOException; | ||
| import java.util.stream.Stream; | ||
|
|
||
| /** | ||
| * {@link FiniteFirehoseFactory} designed for batch processing. Its implementations assume that the amount of inputs is | ||
| * limited. | ||
| * | ||
| * @param <T> parser type | ||
| * @param <S> input split type | ||
| */ | ||
| public interface FiniteFirehoseFactory<T extends InputRowParser, S> extends FirehoseFactory<T> | ||
| { | ||
| /** | ||
| * Returns true if this {@link FiniteFirehoseFactory} supports parallel batch indexing. | ||
| */ | ||
| @JsonIgnore | ||
| @Override | ||
| default boolean isSplittable() | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| /** | ||
| * Returns a {@link Stream} for {@link InputSplit}s. In parallel batch indexing, each {@link InputSplit} is processed | ||
| * by a sub task. | ||
| * | ||
| * Listing splits may cause high overhead in some implementations. In this case, {@link InputSplit}s should be listed | ||
| * lazily so that the listing overhead could be amortized. | ||
| */ | ||
| @JsonIgnore | ||
| Stream<InputSplit<S>> getSplits() throws IOException; | ||
|
|
||
| /** | ||
| * Returns number of splits returned by {@link #getSplits()}. | ||
| */ | ||
| @JsonIgnore | ||
| int getNumSplits() throws IOException; | ||
|
|
||
| /** | ||
| * Returns the same {@link FiniteFirehoseFactory} but with the given {@link InputSplit}. The returned | ||
| * {@link FiniteFirehoseFactory} is used by sub tasks in parallel batch indexing. | ||
| */ | ||
| FiniteFirehoseFactory<T, S> withSplit(InputSplit<S> split); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (minor) would
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to have several implementations for the same split type. Do you have any concrete example?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the most common scenario where something like this comes into play is if you have withSplit(ImmutableList.of(testObject1,testObject2).toSplit()))kind of thing where your test objects inherit the split type. In practice it can sometimes come up, but the method signature can be changed later if needed
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this specific case, However, this is not what I intended. |
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package io.druid.data.input; | ||
|
|
||
| import com.fasterxml.jackson.annotation.JsonCreator; | ||
| import com.fasterxml.jackson.annotation.JsonProperty; | ||
|
|
||
| /** | ||
| * Input unit for distributed batch ingestion. Used in {@link FiniteFirehoseFactory}. | ||
| * An {@link InputSplit} represents the input data processed by a {@code io.druid.indexing.common.task.Task}. | ||
| */ | ||
| public class InputSplit<T> | ||
| { | ||
| private final T split; | ||
|
|
||
| @JsonCreator | ||
| public InputSplit(@JsonProperty("split") T split) | ||
| { | ||
| this.split = split; | ||
| } | ||
|
|
||
| @JsonProperty("split") | ||
| public T get() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, but I think most of use cases would be like this: final InputSplit split = someMethodToGetSplit();
...
final T actualSplit = split.get();If we rename this method, the code would be like |
||
| { | ||
| return split; | ||
| } | ||
|
|
||
| @Override | ||
| public String toString() | ||
| { | ||
| return "InputSplit{" + | ||
| "split=" + split + | ||
| "}"; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,5 +25,5 @@ public enum RunnerTaskState | |
| WAITING, | ||
| PENDING, | ||
| RUNNING, | ||
| NONE; // is used for a completed task | ||
| NONE // is used for a completed task | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (minor)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like, but not sure. I don't want to change it in this PR. |
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -172,4 +172,20 @@ public int hashCode() | |
| getErrorMsg() | ||
| ); | ||
| } | ||
|
|
||
| @Override | ||
| public String toString() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this toString have errorMsg as well?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. Added. |
||
| { | ||
| return "TaskStatusPlus{" + | ||
| "id='" + id + '\'' + | ||
| ", type='" + type + '\'' + | ||
| ", createdTime=" + createdTime + | ||
| ", queueInsertionTime=" + queueInsertionTime + | ||
| ", state=" + state + | ||
| ", duration=" + duration + | ||
| ", location=" + location + | ||
| ", dataSource='" + dataSource + '\'' + | ||
| ", errorMsg='" + errorMsg + '\'' + | ||
| '}'; | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) suggest using descriptive type names. I always have a hard time keeping track of what
TandSrepresentThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather stick to the generic parameter naming conventions for Java. See 'Type Parameter Naming Conventions' section in https://docs.oracle.com/javase/tutorial/java/generics/types.html.