Add pyspark tpch benchmarks #1027

milesgranger · 2023-09-28T10:49:59Z

No description provided.

mrocklin · 2023-09-28T13:21:05Z

Ha! This is fun. Is this working?

milesgranger · 2023-09-28T14:12:44Z

Ha! This is fun. Is this working?

Works in a notebook, adapted from Florian's. Fails now for different, unrelated reasons.

mrocklin · 2023-09-29T13:09:17Z

It looks like this might work now? If so, I'll be curious to see what performance is like. (although I appreciate that that might take more work than what's here so far)

milesgranger · 2023-09-29T13:11:32Z

Does indeed work; planning to do preliminary comparison to #971 on Monday. I'm sure some things will need to be changed/adjusted. :)

fjetter · 2023-09-29T16:38:30Z

I believe this was the cluster https://cloud.coiled.io/clusters/281807/information?account=dask-benchmarks&tab=Code

The hardware metrics all look a little disappointing. Nothing is utilized properly. I guess the 100GB dataset is just to small to see any interesting activity. Well, memory seems to increase pretty consistently but I assume this is JAVA VM foo.

ntabris · 2023-09-29T16:43:11Z

Well, memory seems to increase pretty consistently but I assume this is JAVA VM foo

At my old job, there were QA folks using some automated tools and I think I literally had to tell them at least 20 times that this pattern from the JVM didn't mean there was a memory leak. (We had JVM configured to not dealloc to OS, which I think is pretty typical config.)

mrocklin · 2023-09-29T16:58:51Z

I'm guessing that the troughs between the peaks are some setup/teardown code. Is that correct? Maybe this is because the WorkerPlugin is running repeatedly and downloading stuff repeatedly. Maybe we can make this a bit more smooth by avoiding repeats there? There's probably some other things we could be sensitive to there.

mrocklin · 2023-09-29T16:59:36Z

Also, every time I've tried to use Spark I've missed some critical configuration parameter. We should run this by a Spark person. @fjetter do you have that covered or should I go hunting?

milesgranger · 2023-09-30T04:43:11Z

There were some setup/teardown calls being made between the queries; moved to the cluster setup and then noticed it was running too fast, which ended up being we needed to materialize the dataframe. This cluster is more representative now: https://cloud.coiled.io/clusters/282553/information?account=dask-engineering&tab=Metrics

Edit: and I agree, there is surely some Spark configurations we ought to investigate and trial.

milesgranger · 2023-09-30T06:29:35Z

As a side note, occasionally I'm getting the following error with the queries, specifically when materializing the Spark DF:

java.nio.file.AccessDeniedException: s3a://coiled-runtime-ci/tpch_scale_100/part/part.0.parquet: getFileStatus on s3a://coiled-runtime-ci/tpch_scale_100/part/part.0.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: RWF629STDDBDBT3Q

Full error

Exception: ('Long error message', , 'An error occurred while calling o36.showString.\n: org.apache.spark.SparkException: Job aborted due to stage failure: Task 49 in stage 1.0 failed 4 times, most recent failure: Lost task 49.3 in stage 1.0 (TID 63) (10.0.29.67 executor 6): java.nio.file.AccessDeniedException: s3a://coiled-runtime-ci/tpch_scale_100/lineitem/part.765.parquet: getFileStatus on s3a://coiled-runtime-ci/tpch_scale_100/lineitem/part.765.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: JKG9XPAYKWF912MY; S3 Extended Request ID: vuyILTZotzrJpWUFiTd6I5GPIt4guzoNkjQ2qfxOEbJoKLVZlKWluRDd457/sPLvSqNYqNnzUXH8Mll4NR1qyg==; Proxy: null), S3 Extended Request ID: vuyILTZotzrJpWUFiTd6I5GPIt4guzoNkjQ2qfxOEbJoKLVZlKWluRDd457/sPLvSqNYqNnzUXH8Mll4NR1qyg==:403 Forbidden\n\tat org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)\n\tat org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3796)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556)\n\tat org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)\n\tat org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3554)\n\tat org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:211)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1(ParquetFileFormat.scala:210)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:213)\n\tat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)\n\tat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)\n\tat org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)\n\tat org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:594)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithKeys_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)\n\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\n\tat org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)\n\tat org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)\n\tat org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:139)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1623)\nCaused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: JKG9XPAYKWF912MY; S3 Extended Request ID: vuyILTZotzrJpWUFiTd6I5GPIt4guzoNkjQ2qfxOEbJoKLVZlKWluRDd457/sPLvSqNYqNnzUXH8Mll4NR1qyg==; Proxy: null), S3 Extended Request ID: vuyILTZotzrJpWUFiTd6I5GPIt4guzoNkjQ2qfxOEbJoKLVZlKWluRDd457/sPLvSqNYqNnzUXH8Mll4NR1qyg==\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)\n\tat com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)\n\tat com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)\n\tat com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)\n\tat com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456)\n\tat com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403)\n\tat com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$10(S3AFileSystem.java:2545)\n\tat org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:414)\n\tat org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:377)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2533)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2513)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3776)\n\t... 34 more\n\nDriver stacktrace:\n\tat org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)\n\tat scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\n\tat scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\n\tat scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\n\tat org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)\n\tat org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)\n\tat scala.Option.foreach(Option.scala:407)\n\tat org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)\n\tat org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)\n\tat org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\nCaused by: java.nio.file.AccessDeniedException: s3a://coiled-runtime-ci/tpch_scale_100/lineitem/part.765.parquet: getFileStatus on s3a://coiled-runtime-ci/tpch_scale_100/lineitem/part.765.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: JKG9XPAYKWF912MY; S3 Extended Request ID: vuyILTZotzrJpWUFiTd6I5GPIt4guzoNkjQ2qfxOEbJoKLVZlKWluRDd457/sPLvSqNYqNnzUXH8Mll4NR1qyg==; Proxy: null), S3 Extended Request ID: vuyILTZotzrJpWUFiTd6I5GPIt4guzoNkjQ2qfxOEbJoKLVZlKWluRDd457/sPLvSqNYqNnzUXH8Mll4NR1qyg==:403 Forbidden\n\tat org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:255)\n\tat org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:175)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3796)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getFileStatus$24(S3AFileSystem.java:3556)\n\tat org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)\n\tat org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)\n\tat org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3554)\n\tat org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:39)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:211)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.footerFileMetaData$1(ParquetFileFormat.scala:210)\n\tat org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.')

/opt/coiled/env/lib/python3.10/site-packages/py4j/protocol.py:326: Exception

In this cluster for example: https://cloud.coiled.io/clusters/282430/information?account=dask-engineering&tab=Logs&filterPattern=&computation=&sinceMs=1696055246817&untilMs=1696055251817

Was shutting down w/o tasks after 20mins

fjetter · 2023-10-02T09:03:23Z

At my old job, there were QA folks using some automated tools and I think I literally had to tell them at least 20 times that this pattern from the JVM didn't mean there was a memory leak. (We had JVM configured to not dealloc to OS, which I think is pretty typical config.)

Thanks for confirming. That's what I thought.

Also, every time I've tried to use Spark I've missed some critical configuration parameter. We should run this by a Spark person. @fjetter do you have that covered or should I go hunting?

Yeah, wouldn't be surprised if that was the case. We're only using default.

do you have that covered or should I go hunting?

I don't have a lot of spark contacts. If you can find somebody, that'll be helpful. I'll poke Powers again but that's it on my end, unfortunately.

mrocklin · 2023-10-02T12:06:53Z

Maybe try Powers. If that doesn't work I'll ask around.

…

On Mon, Oct 2, 2023 at 4:03 AM Florian Jetter ***@***.***> wrote: At my old job, there were QA folks using some automated tools and I think I literally had to tell them at least 20 times that this pattern from the JVM didn't mean there was a memory leak. (We had JVM configured to not dealloc to OS, which I think is pretty typical config.) Thanks for confirming. That's what I thought. Also, every time I've tried to use Spark I've missed some critical configuration parameter. We should run this by a Spark person. @fjetter <https://github.com/fjetter> do you have that covered or should I go hunting? Yeah, wouldn't be surprised if that was the case. We're only using default. do you have that covered or should I go hunting? I don't have a lot of spark contacts. If you can find somebody, that'll be helpful. I'll poke Powers again but that's it on my end, unfortunately. — Reply to this email directly, view it on GitHub <#1027 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEIWPJLN7CP3QS24J3X5J7OLANCNFSM6AAAAAA5KYHZL4> . You are receiving this because you commented.Message ID: ***@***.***>

-- <https://coiled.io> Matthew Rocklin CEO, Dask Maintainer

milesgranger · 2023-10-05T11:21:14Z

closing in favor of work done at #1044

milesgranger force-pushed the milesgranger/pyspark-tpch-benchmarks branch 3 times, most recently from 1a09820 to be20e23 Compare September 28, 2023 11:37

milesgranger force-pushed the milesgranger/pyspark-tpch-benchmarks branch from be20e23 to e9a0b10 Compare September 29, 2023 12:06

milesgranger changed the title ~~[WIP] Add pyspark tpch benchmarks~~ Add pyspark tpch benchmarks Sep 29, 2023

milesgranger marked this pull request as ready for review September 29, 2023 12:06

Add pyspark tpch

083756e

milesgranger force-pushed the milesgranger/pyspark-tpch-benchmarks branch from e9a0b10 to 083756e Compare September 29, 2023 13:02

milesgranger force-pushed the milesgranger/pyspark-tpch-benchmarks branch from 5000543 to 45cfc40 Compare September 30, 2023 04:43

milesgranger force-pushed the milesgranger/pyspark-tpch-benchmarks branch 2 times, most recently from fdc5aec to 623bf40 Compare September 30, 2023 05:43

milesgranger added 4 commits September 30, 2023 13:06

Reduce setup calls and materialize df

ac82ef1

Fix bad method calls to plugins on teardown/close

4e2285c

Bump to 12 workers

5bb4073

Add idle_timeout=None for scheduler

11539b6

Was shutting down w/o tasks after 20mins

milesgranger force-pushed the milesgranger/pyspark-tpch-benchmarks branch from 623bf40 to 11539b6 Compare September 30, 2023 11:07

Add retry logic for S3 403 sporadic errors on materialization

7fd2ca5

milesgranger mentioned this pull request Oct 3, 2023

Tpch: Dask vs PySpark and PySpark, Polars and DuckDB single node #1044

Merged

milesgranger closed this Oct 5, 2023

milesgranger deleted the milesgranger/pyspark-tpch-benchmarks branch October 5, 2023 11:21

Add pyspark tpch benchmarks #1027

Add pyspark tpch benchmarks #1027

Uh oh!

Conversation

milesgranger commented Sep 28, 2023

Uh oh!

mrocklin commented Sep 28, 2023

Uh oh!

milesgranger commented Sep 28, 2023

Uh oh!

mrocklin commented Sep 29, 2023

Uh oh!

milesgranger commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Sep 29, 2023

Uh oh!

ntabris commented Sep 29, 2023

Uh oh!

mrocklin commented Sep 29, 2023

Uh oh!

mrocklin commented Sep 29, 2023

Uh oh!

milesgranger commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milesgranger commented Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter commented Oct 2, 2023

Uh oh!

mrocklin commented Oct 2, 2023 via email

Uh oh!

milesgranger commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

milesgranger commented Sep 29, 2023 •

edited

Loading

milesgranger commented Sep 30, 2023 •

edited

Loading

milesgranger commented Sep 30, 2023 •

edited

Loading