Use binary search to improve DimensionRangeShardSpec lookup#12417
Use binary search to improve DimensionRangeShardSpec lookup#12417kfaraz merged 4 commits intoapache:masterfrom
Conversation
745cf67 to
2eff273
Compare
| return (long timestamp, InputRow row) -> { | ||
| StringTuple inputRowTuple = getInputRowTuple(dimensions, row); | ||
| int startIndex = 0; | ||
| int endIndex = shardSpecs.size() - 1; | ||
| while (startIndex <= endIndex) { | ||
| int mid = (startIndex + endIndex) >>> 1; | ||
| BaseDimensionRangeShardSpec rangeShardSpec = rangeShardSpecs[mid]; | ||
| if (startComparator.compare(inputRowTuple, rangeShardSpec.start) < 0) { | ||
| endIndex = mid - 1; | ||
| } else if (endComparator.compare(inputRowTuple, rangeShardSpec.end) < 0) { | ||
| return rangeShardSpec; | ||
| } else { | ||
| startIndex = mid + 1; | ||
| } | ||
| } |
There was a problem hiding this comment.
You could probably simplify this using Arrays.binarySearch(array, key, comparator)
There was a problem hiding this comment.
The array component is BaseDimensionRangeShardSpec while the key is StringTuple, so I can not directly call Arrays.binarySearch
There was a problem hiding this comment.
You could use a dummy key. It would still be cleaner than writing the binary search logic yourself.
Something like this:
final StringTuple searchTuple = getInputRowTuple(dimensions, row);
final BaseDimensionRangeShardSpec searchKey = new DimensionRangeShardSpec(dimensions, searchTuple, searchTuple, 0, 1);
final int searchResult = Arrays.binarySearch(rangeShardSpecs, searchKey, shardSpecComparator);
if (searchResult < 0) {
throw new ISE("row[%s] doesn't fit in any shard[%s]", row, shardSpecs);
} else {
return rangeShardSpecs[searchResult];
}Please let me know if this seems cleaner.
There was a problem hiding this comment.
Arrays.binarySearch requires the searchKey to equals one array element, while actually we do need find one that equals or contains the searchKey. For example:
input:
searchKey: ["a","a"]
shardSpecs:[[null,"c"], ["c", "h"],["h",null]]
then the expect result is [null, "c"], but the Arrays.binarySearch will return -2, not 0.
| final Comparator<StringTuple> startComparator = Comparators.naturalNullsFirst(); | ||
| final Comparator<StringTuple> endComparator = Ordering.natural().nullsLast(); | ||
| final Comparator<BaseDimensionRangeShardSpec> shardSpecComparator = new Comparator<BaseDimensionRangeShardSpec>() | ||
| { | ||
| @Override | ||
| public int compare(BaseDimensionRangeShardSpec o1, BaseDimensionRangeShardSpec o2) | ||
| { | ||
| int startComparison = startComparator.compare(o1.start, o2.start); | ||
| if (startComparison != 0) { | ||
| return startComparison; | ||
| } | ||
| return endComparator.compare(o1.end, o2.end); | ||
| } | ||
| }; |
There was a problem hiding this comment.
You can try chaining the comparator. See below for an example.
There was a problem hiding this comment.
The start comparator is null first while the end comparator is null last. I do not know how to chaining the comparator
There was a problem hiding this comment.
Something like this:
final Comparator<BaseDimensionRangeShardSpec> shardSpecComparator = Comparator
.comparing((BaseDimensionRangeShardSpec spec) -> spec.start, Comparators.naturalNullsFirst())
.thenComparing(spec -> spec.end, Ordering.natural().nullsLast());
Arrays.sort(rangeShardSpecs, shardSpecComparator);| BaseDimensionRangeShardSpec[] rangeShardSpecs = new BaseDimensionRangeShardSpec[shardSpecs.size()]; | ||
| for (int i = 0; i < shardSpecs.size(); i++) { | ||
| rangeShardSpecs[i] = (BaseDimensionRangeShardSpec) shardSpecs.get(i); | ||
| } |
There was a problem hiding this comment.
Nit: You could use shardSpecs.toArray() for cleaner code.
There was a problem hiding this comment.
Directly call toArray cause compile error as the component is not BaseDimensionRangeShardSpec
There was a problem hiding this comment.
Ah, that makes sense. I had missed the cast.
| { | ||
| final String[] inputDimensionValues = new String[dimensions.size()]; | ||
| for (int i = 0; i < dimensions.size(); ++i) { | ||
| List<String> values = inputRow.getDimension(dimensions.get(i)); |
There was a problem hiding this comment.
Nit: please add the comment originally present in this method.
// Get the values of this dimension, treat multiple values as null
| return isInChunk(dimensions, start, end, inputRow); | ||
| } | ||
|
|
||
| public static boolean isInChunk( |
There was a problem hiding this comment.
You could probably remove this method. I don't think it is needed anymore.
da954eb to
506312f
Compare
|
Thanks for you time @kfaraz to review my code |
| private final List<String> dimensions = new ArrayList<>(); | ||
|
|
||
| @Test | ||
| public void testIsInChunk() |
There was a problem hiding this comment.
Please don't remove these tests altogether as there is nothing else testing this behaviour.
You can convert these tests to test getLookup instead.
There was a problem hiding this comment.
You could create multiple shard specs, representing adjacent (or non-adjacent too if you are testing for it) partitions. Then create a ShardSpecLookup using the getLookup method. Then convert the assertions accordingly.
| assertTrue(isInChunk( | ||
| shardSpec, | ||
| createRow("India", "Delhi") | ||
| )); |
There was a problem hiding this comment.
These lines (and all the other assertions) could be translated to something like:
assertEquals(
shard1,
lookup.getShardSpec(timestamp, createRow("India", "Delhi"))
)
@hqx871 , do you mean you have implemented range partitioning for hadoop-based ingestion? |
|
Hi @kfaraz, I will contribute it soon |
I tried to use binary search to improve DimensionRangeShard lookup, detail is discussed at #11848. Hope this is useful to somebody else.
Description
I have implement hadoop based batch ingestion with range shard spec, by referring the native batch ingestion and hadoop based single_dim ingestion.
When I use my hadoop based batch ingestion to load about 70 billion rows, I found the mapper of IndexGeneratorJob will cost more than 11 hours! I checked and found that the most time is spended on calling DimensionRangeShardSpec.isInChunk to lookup target shard, as there are about 6000 shards. Using binary search instead of comparing the row to every shardSpec to decide its shard obviously improve this, the mapper cost only 30 minutes now!
Key changed/added classes in this PR
BaseDimensionRangeShardSpecDimensionRangeShardSpecSingleDimensionShardSpecDimensionRangeBucketShardSpecThis PR has: