[SPARK-29259][SQL] call fs.exists only when necessary #25928

rahij · 2019-09-25T16:13:14Z

What changes were proposed in this pull request?

Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.

Why are the changes needed?

When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests should cover it since this doesn't change the behavior.

dongjoon-hyun · 2019-09-25T17:27:41Z

Hi, @rahij . Thank you for making a PR.

BTW, how long does it take?

the exists call can be expensive

rahij · 2019-09-25T19:40:22Z

@dongjoon-hyun the exists method does:

return getFileStatus(f) != null;

the getFileStatus method can be slow depending on the implementation. The one I'm working with can take > 1 minute for datasources with a lot of files.

dongjoon-hyun · 2019-09-25T21:30:51Z

ok to test

dongjoon-hyun · 2019-09-25T21:32:03Z

@rahij . Could you create an Apache Spark JIRA issue for this? Then, you can use the prefix [SPARK-XXX][SQL]. You need put the newly create JIRA ID instead of XXX.

HeartSaVioR · 2019-09-25T21:34:50Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

        fs, catalogTable.get, qualifiedOutputPath, matchingPartitions)
    }

-    val pathExists = fs.exists(qualifiedOutputPath)


Would making this as lazy val do the same? Even not needed if we agree to follow my next suggestion.

HeartSaVioR · 2019-09-25T21:44:34Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

-          true
+    val pathExists = () => fs.exists(qualifiedOutputPath)
+
+    val doInsertion = mode match {


I think the only exceptional case is Append - it doesn't need to know about pathExists.

Personally current pattern matching looks cleaner and concise, it might be better to exclude Append via if statement and keep current pattern matching (without Append) in else statement. And then seems like pathExists even could be a local variable in else statement - I can't find the usage otherwise.

SparkQA · 2019-09-26T00:04:30Z

Test build #111368 has finished for PR 25928 at commit e477da8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rahij · 2019-09-26T09:09:15Z

@dongjoon-hyun created a ticket and update the PR title.

@HeartSaVioR I've now updated it to do what you describe.

HeartSaVioR

LGTM

SparkQA · 2019-09-26T13:05:27Z

Test build #111417 has finished for PR 25928 at commit 9edfa7b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

rahij · 2019-09-26T14:00:09Z

Thanks for the approval @srowen. Would you be able to merge it when you get a chance?

viirya · 2019-09-26T15:42:19Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

+            deleteMatchingPartitions(fs, qualifiedOutputPath, customPartitionLocations, committer)
+            true
+          }
+        case (SaveMode.Overwrite, _) | (SaveMode.ErrorIfExists, false) =>


nit: we can simply put false here instead of _. It is more clearer.

@viirya I just kept the existing code for these since I didn't want to make unnecessary changes.

dongjoon-hyun

+1, LGTM. Merged to master.

dongjoon-hyun · 2019-09-26T22:49:10Z

Welcome to the Apache Spark community, @rahij . You are added to the Apache Spark contributor group.

Thank you, @srowen , @HeartSaVioR , @viirya , too!

dongjoon-hyun · 2019-09-26T22:50:25Z

BTW, @rahij . You can add rramsharan@palantir.com as your additional email in the GitHub setting. Then, you can see your image on the GitHub commit history.

rahij · 2019-09-27T10:46:45Z

Thanks @dongjoon-hyun, will do!

### What changes were proposed in this pull request? Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand. ### Why are the changes needed? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests should cover it since this doesn't change the behavior. Closes apache#25928 from rahij/rr/exists-upstream. Authored-by: Rahij Ramsharan <rramsharan@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain) apache#25928 ## What changes were proposed in this pull request? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the `pathExists` variable is actually not used in the case of SaveMode.Append. In some file systems, the `exists` call can be expensive and hence this PR makes that call only when necessary. ## How was this patch tested? Existing unit tests should cover this. Please review http://spark.apache.org/contributing.html before opening a pull request.

optimization: call fs.exists only when necessary

e477da8

rahij mentioned this pull request Sep 25, 2019

[SPARK-29259][SQL] call fs.exists only when necessary palantir/spark#607

Merged

HeartSaVioR reviewed Sep 25, 2019

View reviewed changes

dongjoon-hyun added the SQL label Sep 26, 2019

rahij changed the title ~~optimization: call fs.exists only when necessary~~ [SPARK-29259][SQL] call fs.exists only when necessary Sep 26, 2019

rahij added 3 commits September 26, 2019 10:06

CR comments

f9d008e

revert

2a4dbf0

fix imports

9edfa7b

HeartSaVioR approved these changes Sep 26, 2019

View reviewed changes

srowen reviewed Sep 26, 2019

View reviewed changes

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala Show resolved Hide resolved

srowen approved these changes Sep 26, 2019

View reviewed changes

viirya reviewed Sep 26, 2019

View reviewed changes

dongjoon-hyun approved these changes Sep 26, 2019

View reviewed changes

dongjoon-hyun closed this in 9f3c821 Sep 26, 2019

rahij deleted the rr/exists-upstream branch September 27, 2019 10:46

[SPARK-29259][SQL] call fs.exists only when necessary #25928

[SPARK-29259][SQL] call fs.exists only when necessary #25928

Uh oh!

Conversation

rahij commented Sep 25, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Sep 25, 2019

Uh oh!

rahij commented Sep 25, 2019

Uh oh!

dongjoon-hyun commented Sep 25, 2019

Uh oh!

dongjoon-hyun commented Sep 25, 2019

Uh oh!

HeartSaVioR Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 26, 2019

Uh oh!

rahij commented Sep 26, 2019

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 26, 2019

Uh oh!

Uh oh!

rahij commented Sep 26, 2019

Uh oh!

viirya Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

rahij Sep 26, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 26, 2019

Uh oh!

dongjoon-hyun commented Sep 26, 2019

Uh oh!

rahij commented Sep 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HeartSaVioR Sep 25, 2019 •

edited

Loading

HeartSaVioR Sep 25, 2019 •

edited

Loading