Making optimal usage of multiple segment cache locations#8038
Making optimal usage of multiple segment cache locations#8038dclim merged 39 commits intoapache:masterfrom
Conversation
…egments to multiple segment cache locations
|
to me choosing the segment cache location with the max free size instead of round robin makes more sense. otherwise we can make the segment cache location selection strategy configurable and default to max free available. |
| @@ -102,6 +105,8 @@ public SegmentLoaderLocalCacheManager( | |||
| ); | |||
| } | |||
| locations.sort(COMPARATOR); | |||
There was a problem hiding this comment.
looks like we are already trying to sort via the available free size,
The issue seems like the order is not updated after a segment is loaded.
what do you think about sorting the locations after a segment has been loaded ?
I think that would probably fix the issue in #7641 ?
There was a problem hiding this comment.
@nishantmonu51 , This probably makes sense. However, one case is when the segment cache location max sizes are skewed (one or few locations with way more availability than others). The sort strategy resorts to selecting the same location again and again until it's availability falls short of others. This will end up having more or less the same behaviour reported in #7641. Round-robin on the other hand will try to distribute the segments across multiple locations there by improving I/O if the locations are backed by different physical drives. However I'm not sure whether the round-robin strategy has any implications on query performance. Let me know your thoughts.
@dclim and others, let us know your thoughts.
I like the idea of making the segment cache location selection strategy configurable. |
|
Ah interesting - I thought I remembered the behavior used to select the least filled disk! Looks like a regression at some point. @sashidhar I do still think there's value in making the selector strategy configurable to something like round-robin for the reason you mentioned. An example - I was setting up a Druid cluster that had two volumes mounted (let's say they were each 10G and called /mnt and /mnt1). I was also using /mnt for other stuff - as a general scratch drive, storing intermediate indexing files, log files, etc. so I needed to reserve some space for this - let's say I reserved 2G. I had 8G left, so I set the size of the segment cache for /mnt to 8G. Now, what do I set the size of the segment cache for /mnt1 to? If I set it to 10G to fully utilize the volume and at a point in time have less than 2G of data, it would all be on /mnt1 and potentially wouldn't be maximizing the I/O throughput available. I could instead set it to 8G to be the same as /mnt and that would evenly distribute the segments, but I'd lose those 2G unnecessarily just to coax the algorithm to utilize both locations. A round-robin strategy (or one that selects the location that has the least bytes used in absolute terms instead of relative to the capacity) would have been what I wanted. |
|
@dclim , @nishantmonu51 Here's what I'm thinking. As discussed, the segment cache location selector strategy should be configurable. There could be 3 possible strategies currently.
Questions:
Other things to note:
@gianm FYI. |
|
This sounds like a PR which needs a proposal to me. |
|
I think, ideally in all cases, we want to minimize |
|
@jihoonson , @himanshug , thanks for your inputs. Should I raise a separate proposal PR or modify this PR to make it a proposal ? |
|
I think this kind of issue needs a proposal before writing code so that the author can avoid unnecessary work. However, in this case, I think you don’t have to write a proposal at this moment because you already raised this PR. But still, it would be worth to get design review from 3 or more committers. I added the label. Also please update the PR description accordingly once the design issue is resolved. |
|
Updated the description with the proposed algorithm and the alternatives discussed. Round-robin and least-bytes-used both seem reasonable. Please review the design. |
|
it doesn't hurt to make the strategy configurable , however I think "Least-Bytes-Used" should be default instead of "Round-Robin" .
write happens in 1 or very few threads so write throughput is not impacted and on the contrary it improves read throughput due to similar space utilization in each location which has significantly higher concurrency. many times users add new segment locations after the node has been in use for a while and already has some data and then restart the node, with "Round Robin" newly added location will likely stay underutilized . Round-Robin wouldn't solve #7641 in that case. |
Makes sense. It seems to me that the negative case I mentioned for Least-Bytes-Used might not be much of a concern. It makes sense for the Least-Bytes-Used to be the default for the write and read throughput reasons mentioned. |
|
@jihoonson , @himanshug , @dclim , @nishantmonu51 have you had a chance to review this ? |
|
hey @sashidhar - your proposal mentioned:
But I don't see that implemented (I only see the round-robin implementation). Am I missing something? You also said:
Which I agree with. Apologies if you were waiting on further confirmation before implementing the Least-Bytes-Used strategy. Between round-robin and Least-Bytes-Used, I would be okay if you just implemented the latter, as I think it would be the right option in the majority of cases, but I would also be okay if you implemented both and had a configuration parameter to select the strategy. |
|
Hi David,
Sorry for the unclear wording. It should have been "This PR will
introduce...". I was waiting for the design approval before implementing
Least-Bytes-Used strategy. I would like to implement both and make the
strategy configurable default being Least-Bytes-Used. I'll
resume working on the implementation.
Thanks,
Sashi
…On Thu, Jul 25, 2019, 5:34 AM David Lim ***@***.***> wrote:
hey @sashidhar <https://github.com/sashidhar> - your proposal mentioned:
This PR introduces an optional new Historical runtime property
druid.segmentCache.locationSelectorStrategy to make the segment cache
location selection strategy configurable. Possible values for the above
property - round-robin, least-bytes-used.
But I don't see that implemented (I only see the round-robin
implementation). Am I missing something? You also said:
It makes sense for the Least-Bytes-Used to be the default for the write
and read throughput reasons mentioned.
Which I agree with. Apologies if you were waiting on further confirmation
before implementing the Least-Bytes-Used strategy. Between round-robin and
Least-Bytes-Used, I would be okay if you just implemented the latter, as I
think it would be the right option in the majority of cases, but I would
also be okay if you implemented both and had a configuration parameter to
select the strategy.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8038>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACGIELQDFBIGA3NCQLR3HLQBDUY3ANCNFSM4H6T5OQQ>
.
|
…rategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy
… least bytes used. Adding currSizeBytes() method in StorageLocation.
|
Implemented both the strategies and made the strategy configurable. However there is one implementation glitch due to which Here's the scenario. For example, assume strategy configured is least bytes used strategy and there are two locations - loc1 and loc2 each on different disks disk1 and disk2 respectively. loc1 has the least bytes used. The strategy picks loc1 and before |
| * | ||
| * @return The storage location to load the given segment into or null if no location has the capacity to store the given segment. | ||
| */ | ||
| StorageLocation select(DataSegment dataSegment, String storageDirStr); |
There was a problem hiding this comment.
Implemented both the strategies and made the strategy configurable. However there is one implementation glitch due to which SegmentLoaderLocalCacheManagerTest.testRetrySuccessAtSecondLocation() is failing.
Here's the scenario. For example, assume strategy configured is least bytes used strategy and there are two locations - loc1 and loc2 each on different disks disk1 and disk2 respectively. loc1 has the least bytes used. The strategy picks loc1 and before SegmentLoaderLocalCacheManager loads a segment if disk1 fails or is not writable the segment loading fails. The strategy has no way (with my impl) to find if loc1 is bad, this results in the strategy picking loc1 every time failing all segment load attempts. What is a clean way to handle this ?
good catch, I think, for that reason the interface here should be something like..
Iterator<StorageLocation> getLocations(..)
so that caller can go through all of them like it does currently and caller should be responsible for calling the reserve(..) method not the impls of this.
There was a problem hiding this comment.
Changed the method contract to return an iterator of StorageLocations as suggested. If the changes look good will add a few more tests.
|
@sashidhar thanks for the quick update! I'll finish my review once #8038 (comment) is addressed. Would you take a look? |
Addressed the comment please review. |
| { | ||
| for (StorageLocation loc : locations) { | ||
| Iterator<StorageLocation> locationsIterator = strategy.getLocations(); | ||
| int numLocationsToTry = this.locations.size(); |
There was a problem hiding this comment.
numLocationsToTry is not necessary now.
There was a problem hiding this comment.
Oops! will fix it.
There was a problem hiding this comment.
Removed numLocationsToTry and update java docs. Let me know if the description isn't clear or any change is required.
| @Override | ||
| public Iterator<StorageLocation> getLocations(DataSegment dataSegment, String storageDirStr) | ||
| { | ||
| return cyclicIterator; |
There was a problem hiding this comment.
Oh, now I know what the round robin you want is. What I thought was, each caller will get an iterator with a different startIndex which is changed in a round robin fashion. Okay, your implementation makes sense. Please add more details description about the behavior of this strategy especially when multiple threads use this.
jihoonson
left a comment
There was a problem hiding this comment.
+1 after CI. Thank you @sashidhar!
|
Thanks a lot @jihoonson for your thorough and patient review. |
|
@dclim @himanshug @nishantmonu51 do you have more comments? |
| * https://github.com/apache/incubator-druid/pull/8038#discussion_r325520829 of PR https://github | ||
| * .com/apache/incubator-druid/pull/8038 for more details. | ||
| */ | ||
| @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "tier", defaultImpl = LeastBytesUsedStorageLocationSelectorStrategy.class) |
There was a problem hiding this comment.
Is property = "tier" here required, or is it copy/pasted from another location (like TierSelectorStrategy)?
There was a problem hiding this comment.
good catch, should probably be "type"
There was a problem hiding this comment.
@dclim, let me know if it needs to be removed or changed to type.
There was a problem hiding this comment.
Changed it to type.
| localStorageFolder1, loc1.getPath()); | ||
|
|
||
| StorageLocation loc2 = locations.next(); | ||
| Assert.assertEquals("The next element of the iterator should point to path local_storage_folder_1", |
There was a problem hiding this comment.
The assert message is wrong here
| localStorageFolder2, loc2.getPath()); | ||
|
|
||
| StorageLocation loc3 = locations.next(); | ||
| Assert.assertEquals("The next element of the iterator should point to path local_storage_folder_1", |
There was a problem hiding this comment.
The assert message is wrong here
|
+1 after minor assert message change |
|
Thank you @sashidhar and @jihoonson for working through this |
|
Thanks @jihoonson, @himanshug, @dclim, @nishantmonu51 for your review and suggestions. |
|
Please add Release Notes label as this PR introduces a new Historical runtime property. |
|
@sashidhar oh yes I added. Thank you! |
* apache#7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations * Fixing indentation * WIP * Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy * fixing code style * Fixing test * Adding a method visible only for testing, fixing tests * 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes * fixing the conditional statement * Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy * to trigger CI build * Add documentation for the selection strategy configuration * to re trigger CI build * updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes * In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail * Implementing review comments. Added tests for StorageLocationSelectorStrategy * Checkstyle fixes * Adding java doc comments for StorageLocationSelectorStrategy interface * checkstyle * empty commit to retrigger build * Empty commit * Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file * Impl review comments including updating docs as suggested * Removing checkLocationConfigForNull(), @notempty annotation serves the purpose * Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy * Fixing the round robin iterator * Removed numLocationsToTry, updated java docs * changing property attribute value from tier to type * Fixing assert messages
* apache#7641 - Changing segment distribution algorithm to distribute segments to multiple segment cache locations * Fixing indentation * WIP * Adding interface for location strategy selection, least bytes used strategy impl, round-robin strategy impl, locationSelectorStrategy config with least bytes used strategy as the default strategy * fixing code style * Fixing test * Adding a method visible only for testing, fixing tests * 1. Changing the method contract to return an iterator of locations instead of a single best location. 2. Check style fixes * fixing the conditional statement * Added testSegmentDistributionUsingLeastBytesUsedStrategy, fixed testSegmentDistributionUsingRoundRobinStrategy * to trigger CI build * Add documentation for the selection strategy configuration * to re trigger CI build * updated docs as per review comments, made LeastBytesUsedStorageLocationSelectorStrategy.getLocations a synchronzied method, other minor fixes * In checkLocationConfigForNull method, using getLocations() to check for null instead of directly referring to the locations variable so that tests overriding getLocations() method do not fail * Implementing review comments. Added tests for StorageLocationSelectorStrategy * Checkstyle fixes * Adding java doc comments for StorageLocationSelectorStrategy interface * checkstyle * empty commit to retrigger build * Empty commit * Adding suppressions for words leastBytesUsed and roundRobin of ../docs/configuration/index.md file * Impl review comments including updating docs as suggested * Removing checkLocationConfigForNull(), @notempty annotation serves the purpose * Round robin iterator to keep track of the no. of iterations, impl review comments, added tests for round robin strategy * Fixing the round robin iterator * Removed numLocationsToTry, updated java docs * changing property attribute value from tier to type * Fixing assert messages
Design proposal for #7641.
Description
Making optimal usage of multiple segment cache locations to distribute segments. See #7641 for more details.
Proposed Algorithm
Alternative Algorithms Considered
The following alternative algorithms have been discussed.
Least bytes used algorithm (or least filled disk) approach: This algorithm picks a location with the least bytes used. This to me seems reasonable in most cases. See Making optimal usage of multiple segment cache locations #8038 (comment). In practice the distribution of segment sizes are not very even for several reasons (an interval having less or more data, improperly tuned cluster etc). For example, segments sizes across intervals could be any where from 100MB to 1GB assuming most intervals with very close segment sizes, few intervals having outliers like say 100MB or 1GB. Let us consider we have 3 locations. If a location (location 1) loads a segment of size 1GB, the subsequent calls to load segments of lesser sizes will be distributed between locations 2 and 3 until both of them reach/cross 1 GB. This repeats every time a particular location loads a bigger size segment. This might not have optimal write throughput in such a scenario. However, I'm not sure how much of a problem is this.
Max free size algorithm: Choose the segment cache location with the max free size each time. This algorithm has a possible short coming as explained in Making optimal usage of multiple segment cache locations #8038 (comment).
New configuration
This PR introduces an optional new Historical runtime property druid.segmentCache.locationSelectorStrategy to make the segment cache location selection strategy configurable. Possible values for the above property - round-robin, least-bytes-used.
Test plan
Unit tests to be added.
Documentation
Documentation needs to be updated with the new property if the location selection strategy is made configurable and release notes for the same.