Skip to content

Conversation

@liyezhang556520
Copy link
Contributor

Currently, old blocks dropping for new blocks' caching are processed by one thread at the same time. Which can not fully utilize the disk throughput. If the to be dropped block size is huge, then the dropping time will be very long. We need to make it processed in parallel. In this patch, dropping blocks operation are processed in multiple threads, before dropping, each thread will select the blocks that to be dropped for itself.

@liyezhang556520 liyezhang556520 changed the title [SPARK-3000][CORE] drop old blocks to disk in parallel when memory is no... [SPARK-3000][CORE] drop old blocks to disk in parallel when memory is not large enough for caching new blocks Aug 26, 2014
@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2134 at commit 357dae8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2134 at commit 357dae8.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2134 at commit 3299414.

  • This patch merges cleanly.

@liyezhang556520
Copy link
Contributor Author

@andrewor14
Can you help review the code?

@liyezhang556520 liyezhang556520 changed the title [SPARK-3000][CORE] drop old blocks to disk in parallel when memory is not large enough for caching new blocks [SPARK-3000][CORE] drop old blocks to disk in parallel when free memory is not enough for caching new blocks Aug 27, 2014
@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 2134 at commit 3299414.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExternalSorter(object):
    • protected class AttributeEquals(val a: Attribute)

@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have started for PR 2134 at commit 3299414.

  • This patch merges cleanly.

@andrewor14
Copy link
Contributor

@liyezhang556520 I'm a little swamped with the 1.1 release at the moment, but I'll try to look at this soon after we put out some fires there. Thanks for your PR.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA tests have finished for PR 2134 at commit 3299414.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • class ExternalSorter(object):
    • "$FWDIR"/bin/spark-submit --class $CLASS "$
    • protected class AttributeEquals(val a: Attribute)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorrect indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScrapCodes
updated, thanks~

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2134 at commit 9ec7d36.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2134 at commit 9ec7d36.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ScrapCodes
Copy link
Member

There is something similar in #791.

@ScrapCodes
Copy link
Member

And some of the comment there apply to this patch as well..

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have started for PR 2134 at commit f2f2c62.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have started for PR 2134 at commit 6604e9a.

  • This patch merges cleanly.

@liyezhang556520
Copy link
Contributor Author

@ScrapCodes Thanks for your comment!

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have finished for PR 2134 at commit f2f2c62.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ByteArrayChunkOutputStream(chunkSize: Int) extends OutputStream

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have finished for PR 2134 at commit 6604e9a.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liyezhang556520
Copy link
Contributor Author

@mridulm , @tdas , @andrewor14 , @ScrapCodes Can one of you help review the code?

@pwendell
Copy link
Contributor

pwendell commented Sep 2, 2014

Can you explain how this differs from SPARK-1888/#791? Is this just a duplicate?

@liyezhang556520
Copy link
Contributor Author

@pwendell I think they are duplicated in JIRA (I didn't discovered there is a similar JIRA before I opened a new one). But the two PR are based on different code base. This PR is based on [SPARK-1777]/#1165, which has much difference from the logic of before.

@pwendell
Copy link
Contributor

pwendell commented Sep 2, 2014

But this one is 5X more code, so I'm just wondering if there is a difference in the feature set...

@liyezhang556520
Copy link
Contributor Author

@pwendell
This patch also fix some existing bugs introduced from [SPARK-1777]. Since [SPARK-1777] need to resolve the OOM problem, the logic of the original code is changed a lot, and then it becomes more complicated to make the dropping blocks operation in parallel, that's why there need 5X more code.

@liyezhang556520
Copy link
Contributor Author

@pwendell And also, SPARK-1888/#791 has a problem to maintain the freeMemory, the freeMemory is not changed for next blocks to tryToPut after the previous blocks are finished selecting to-be-dropped blocks (which means previous blocks will reserve the freeMemory, and freeMemory should be changed for next blocks).

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2134 at commit 71765eb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2134 at commit 71765eb.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liyezhang556520
Copy link
Contributor Author

@andrewor14, I have rebased the code and updated a spark-3000 design doc, Would you please take a look and help to review the code? I think current code has get rid of the OOM risk.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22998 has finished for PR 2134 at commit 73b3339.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Hey @liyezhang556520 sorry I've been swamped with the 1.2 release. I will look at this shortly after that's out of the window

@liyezhang556520
Copy link
Contributor Author

Hi @andrewor14 , do you have time to take a look at this patch? SPARK-4777 is supposed has been fixed here.

@suyanNone
Copy link
Contributor

yes, its duplicate with your patch
I just see you patch title "parallel drop to disk"... so I don't see the code in detail. I already close my patch.

@andrewor14
Copy link
Contributor

Hey @liyezhang556520 sorry for the delay. Thanks for writing up a detailed design doc on the JIRA. I'll take a look at it in a day or two.

@andrewor14
Copy link
Contributor

Quick question though, what does this patch provide that #3629 doesn't? It seems that they're both trying to solve the same problem but this one is much bigger (I haven't looked at the code in detail yet). Is there a particular issue that is addressed in this PR but not in #3629?

@liyezhang556520
Copy link
Contributor Author

Hi @andrewor14 ,
PR#3629 solved the problem that I pointed out in your original patch PR#1165, you can check the comment history on Aug 12th.
This PR mainly not focus on this bug, but resolved this bug meanwhile.
This PR mainly focus on the disk IO issue, which is memory dropping problem. There is only one thread dropping memory when cached RDD memory need to evict to disk. This problem also pointed out in PR#791. The main difference between this PR and PR#791 is that this PR also make the tryToPut process in parallel. And the memory maintain will be more complex. Also this PR make some change with testSuite file.

@liyezhang556520
Copy link
Contributor Author

Hi @andrewor14 , I don't know if you have reproduced this issue. Since I know most of your cases are tested on Amazon EC2 which are equipped with SSD. And even one SSD's throughput may can be up to more than 3 HDDs' . So that this problem may not that obvious on your cluster.

@andrewor14
Copy link
Contributor

@liyezhang556520 I would like to fix the issue you raised in #1165 first (i.e. SPARK-4777) before looking at SPARK-3000, which seems to me more like an optimization. Let's agree on a solution in #3629 before making more progress in this PR, since it seems that there are logical conflicts between the two PRs.

… not large enough for caching new blocks

Currently, old blocks dropping for new blocks' caching are processed by one thread at the same time. Which can not fully utilize the disk throughput. If the to be dropped block size is huge, then the dropping time will be very long. We need to make it processed in parallel. In this patch, dropping blocks operation are processed in multiple threads, before dropping, each thread will select the blocks that to be dropped for itself.
…ig old blocks can only be used by current blocks, in this way to avoid OOM risk
@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30397 has finished for PR 2134 at commit 3192a6d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@liyezhang556520
Copy link
Contributor Author

jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30398 has finished for PR 2134 at commit 3192a6d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30400 has finished for PR 2134 at commit c248156.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@andrewor14
Copy link
Contributor

@liyezhang556520 this issue has mostly gone stale at this point, and I'm not sure if it's applicable anymore given some of the latest changes in master. Unfortunately I won't have the bandwidth to review this further and push it forward. I think we should close this patch for now and reopen it later if there's interest.

@liyezhang556520
Copy link
Contributor Author

ok, I'll close this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants