Add shuffleSegmentPusher for data shuffle#8115
Conversation
|
/subscribe |
|
@himanshug I requested a review to you :) |
|
@jihoonson sure, I did want to go through it. |
| final String relativeSegmentPath = localtionPath | ||
| .relativize(eachFile.toPath().toAbsolutePath()) | ||
| .toString(); | ||
| final File reservedFile = location.reserve( |
There was a problem hiding this comment.
couldn't understand why we need to do this, can you please add some comments.
| // Create a zipped segment in a temp directory. | ||
| final File taskTempDir = taskConfig.getTaskTempDir(subTaskId); | ||
| if (taskTempDir.mkdirs()) { | ||
| taskTempDir.deleteOnExit(); |
There was a problem hiding this comment.
this will delete on jvm exit, it is probably ok for existing peon processes as they do exit but wouldn't log anything if jvm was not able to delete this location at exit.
it wouldn't work for the long running indexer process
I think we should do the cleanup in the code.
There was a problem hiding this comment.
Hmm, yeah. Don't remember why I wrote this code. Fixed to clean up properly.
| taskTempDir.deleteOnExit(); | ||
| } | ||
| final File tempZippedFile = new File(taskTempDir, segment.getId().toString()); | ||
| final long unzippedSizeBytes = CompressionUtils.zip(segmentDir, tempZippedFile, true); |
There was a problem hiding this comment.
fsync=true here is useless as this is only a temp location
| if (destFile != null) { | ||
| try { | ||
| FileUtils.forceMkdirParent(destFile); | ||
| StreamUtils.retryCopy( |
There was a problem hiding this comment.
here we should use FileUtils.writeAtomically(..)
clintropolis
left a comment
There was a problem hiding this comment.
overall lgtm, +1 after @himanshug's comments are addressed
| import java.util.Map; | ||
|
|
||
| /** | ||
| * DataSegmentPusher used for storing intermeidary data in local storage during data shuffle of native parallel |
There was a problem hiding this comment.
typo: intermeidary-> intermediary
| supervisorTaskId, | ||
| k -> { | ||
| for (File eachFile : FileUtils.listFiles(supervisorTaskDir, null, true)) { | ||
| final String relativeSegmentPath = localtionPath |
There was a problem hiding this comment.
typo: localtionPath -> locationPath
|
@himanshug do you have more comments? |
|
@himanshug @clintropolis thank you for the review. |
This PR is for #8061 and based on #8114.
Description
ShuffleDataSegmentPusheris a dataSegmentPusher used for writing shuffle data in local storage.ShuffleDataSegmentPusherusesIntermediaryDataManagerinternally which coordinates the segment writes in a round-robin fashion per supervisor task across sub tasks. This is to fully utilize the local disk bandwidth for shuffle.The middleManager and the indexer can use this. However, with the middleManager, each task uses a separate
IntermediaryDataManagerinstance. This could potentially result in two issues:IntermediaryDataSegmentneeds to smoosh segment files into larger ones to avoid "too many open files" problem. This could also be an issue if there are a lot of tasks sinceIntermediaryDataSegmentcan't smoosh files across tasks with middleManager.I think this would be ok for now and could be improved if required in the future.
This PR has: