-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3000][CORE] drop old blocks to disk in parallel when free memory is not enough for caching new blocks #2134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2134 at commit
|
|
QA tests have finished for PR 2134 at commit
|
|
QA tests have started for PR 2134 at commit
|
|
@andrewor14 |
|
QA tests have finished for PR 2134 at commit
|
|
test this please |
|
QA tests have started for PR 2134 at commit
|
|
@liyezhang556520 I'm a little swamped with the 1.1 release at the moment, but I'll try to look at this soon after we put out some fires there. Thanks for your PR. |
|
QA tests have finished for PR 2134 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScrapCodes
updated, thanks~
|
QA tests have started for PR 2134 at commit
|
|
QA tests have finished for PR 2134 at commit
|
|
There is something similar in #791. |
|
And some of the comment there apply to this patch as well.. |
|
QA tests have started for PR 2134 at commit
|
|
QA tests have started for PR 2134 at commit
|
|
@ScrapCodes Thanks for your comment! |
|
QA tests have finished for PR 2134 at commit
|
|
QA tests have finished for PR 2134 at commit
|
|
@mridulm , @tdas , @andrewor14 , @ScrapCodes Can one of you help review the code? |
|
Can you explain how this differs from SPARK-1888/#791? Is this just a duplicate? |
|
@pwendell I think they are duplicated in JIRA (I didn't discovered there is a similar JIRA before I opened a new one). But the two PR are based on different code base. This PR is based on [SPARK-1777]/#1165, which has much difference from the logic of before. |
|
But this one is 5X more code, so I'm just wondering if there is a difference in the feature set... |
|
@pwendell |
|
@pwendell And also, SPARK-1888/#791 has a problem to maintain the freeMemory, the freeMemory is not changed for next blocks to tryToPut after the previous blocks are finished selecting to-be-dropped blocks (which means previous blocks will reserve the freeMemory, and freeMemory should be changed for next blocks). |
|
QA tests have started for PR 2134 at commit
|
|
QA tests have finished for PR 2134 at commit
|
ebe899b to
73b3339
Compare
|
@andrewor14, I have rebased the code and updated a spark-3000 design doc, Would you please take a look and help to review the code? I think current code has get rid of the OOM risk. |
|
Test build #22998 has finished for PR 2134 at commit
|
|
Hey @liyezhang556520 sorry I've been swamped with the 1.2 release. I will look at this shortly after that's out of the window |
|
Hi @andrewor14 , do you have time to take a look at this patch? SPARK-4777 is supposed has been fixed here. |
|
yes, its duplicate with your patch |
|
Hey @liyezhang556520 sorry for the delay. Thanks for writing up a detailed design doc on the JIRA. I'll take a look at it in a day or two. |
|
Hi @andrewor14 , |
|
Hi @andrewor14 , I don't know if you have reproduced this issue. Since I know most of your cases are tested on Amazon EC2 which are equipped with SSD. And even one SSD's throughput may can be up to more than 3 HDDs' . So that this problem may not that obvious on your cluster. |
|
@liyezhang556520 I would like to fix the issue you raised in #1165 first (i.e. SPARK-4777) before looking at SPARK-3000, which seems to me more like an optimization. Let's agree on a solution in #3629 before making more progress in this PR, since it seems that there are logical conflicts between the two PRs. |
… not large enough for caching new blocks Currently, old blocks dropping for new blocks' caching are processed by one thread at the same time. Which can not fully utilize the disk throughput. If the to be dropped block size is huge, then the dropping time will be very long. We need to make it processed in parallel. In this patch, dropping blocks operation are processed in multiple threads, before dropping, each thread will select the blocks that to be dropped for itself.
…ig old blocks can only be used by current blocks, in this way to avoid OOM risk
73b3339 to
3192a6d
Compare
|
Test build #30397 has finished for PR 2134 at commit
|
|
jenkins, retest this please |
|
Test build #30398 has finished for PR 2134 at commit
|
|
Test build #30400 has finished for PR 2134 at commit
|
|
@liyezhang556520 this issue has mostly gone stale at this point, and I'm not sure if it's applicable anymore given some of the latest changes in master. Unfortunately I won't have the bandwidth to review this further and push it forward. I think we should close this patch for now and reopen it later if there's interest. |
|
ok, I'll close this PR |
Currently, old blocks dropping for new blocks' caching are processed by one thread at the same time. Which can not fully utilize the disk throughput. If the to be dropped block size is huge, then the dropping time will be very long. We need to make it processed in parallel. In this patch, dropping blocks operation are processed in multiple threads, before dropping, each thread will select the blocks that to be dropped for itself.