Skip to content

Conversation

@rahij
Copy link
Contributor

@rahij rahij commented Sep 25, 2019

What changes were proposed in this pull request?

Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.

Why are the changes needed?

When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing unit tests should cover it since this doesn't change the behavior.

@dongjoon-hyun
Copy link
Member

Hi, @rahij . Thank you for making a PR.

BTW, how long does it take?

the exists call can be expensive

@rahij
Copy link
Contributor Author

rahij commented Sep 25, 2019

@dongjoon-hyun the exists method does:

return getFileStatus(f) != null;

the getFileStatus method can be slow depending on the implementation. The one I'm working with can take > 1 minute for datasources with a lot of files.

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

@rahij . Could you create an Apache Spark JIRA issue for this? Then, you can use the prefix [SPARK-XXX][SQL]. You need put the newly create JIRA ID instead of XXX.

fs, catalogTable.get, qualifiedOutputPath, matchingPartitions)
}

val pathExists = fs.exists(qualifiedOutputPath)
Copy link
Contributor

@HeartSaVioR HeartSaVioR Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would making this as lazy val do the same? Even not needed if we agree to follow my next suggestion.

true
val pathExists = () => fs.exists(qualifiedOutputPath)

val doInsertion = mode match {
Copy link
Contributor

@HeartSaVioR HeartSaVioR Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only exceptional case is Append - it doesn't need to know about pathExists.

Personally current pattern matching looks cleaner and concise, it might be better to exclude Append via if statement and keep current pattern matching (without Append) in else statement. And then seems like pathExists even could be a local variable in else statement - I can't find the usage otherwise.

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111368 has finished for PR 25928 at commit e477da8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rahij rahij changed the title optimization: call fs.exists only when necessary [SPARK-29259][SQL] call fs.exists only when necessary Sep 26, 2019
@rahij
Copy link
Contributor Author

rahij commented Sep 26, 2019

@dongjoon-hyun created a ticket and update the PR title.

@HeartSaVioR I've now updated it to do what you describe.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Sep 26, 2019

Test build #111417 has finished for PR 25928 at commit 9edfa7b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rahij
Copy link
Contributor Author

rahij commented Sep 26, 2019

Thanks for the approval @srowen. Would you be able to merge it when you get a chance?

deleteMatchingPartitions(fs, qualifiedOutputPath, customPartitionLocations, committer)
true
}
case (SaveMode.Overwrite, _) | (SaveMode.ErrorIfExists, false) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can simply put false here instead of _. It is more clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I just kept the existing code for these since I didn't want to make unnecessary changes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

@dongjoon-hyun
Copy link
Member

Welcome to the Apache Spark community, @rahij . You are added to the Apache Spark contributor group.

Thank you, @srowen , @HeartSaVioR , @viirya , too!

@dongjoon-hyun
Copy link
Member

BTW, @rahij . You can add rramsharan@palantir.com as your additional email in the GitHub setting. Then, you can see your image on the GitHub commit history.

@rahij rahij deleted the rr/exists-upstream branch September 27, 2019 10:46
@rahij
Copy link
Contributor Author

rahij commented Sep 27, 2019

Thanks @dongjoon-hyun, will do!

rahij added a commit to palantir/spark that referenced this pull request Sep 27, 2019
### What changes were proposed in this pull request?

Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.

### Why are the changes needed?

When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit tests should cover it since this doesn't change the behavior.

Closes apache#25928 from rahij/rr/exists-upstream.

Authored-by: Rahij Ramsharan <rramsharan@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
bulldozer-bot bot pushed a commit to palantir/spark that referenced this pull request Sep 27, 2019
## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain)

apache#25928

## What changes were proposed in this pull request?

When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the `pathExists` variable is actually not used in the case of SaveMode.Append. In some file systems, the `exists` call can be expensive and hence this PR makes that call only when necessary.

## How was this patch tested?

Existing unit tests should cover this.

Please review http://spark.apache.org/contributing.html before opening a pull request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants