-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29259][SQL] call fs.exists only when necessary #25928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, @rahij . Thank you for making a PR. BTW, how long does it take?
|
|
@dongjoon-hyun the exists method does: the |
|
ok to test |
|
@rahij . Could you create an Apache Spark JIRA issue for this? Then, you can use the prefix |
| fs, catalogTable.get, qualifiedOutputPath, matchingPartitions) | ||
| } | ||
|
|
||
| val pathExists = fs.exists(qualifiedOutputPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would making this as lazy val do the same? Even not needed if we agree to follow my next suggestion.
| true | ||
| val pathExists = () => fs.exists(qualifiedOutputPath) | ||
|
|
||
| val doInsertion = mode match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only exceptional case is Append - it doesn't need to know about pathExists.
Personally current pattern matching looks cleaner and concise, it might be better to exclude Append via if statement and keep current pattern matching (without Append) in else statement. And then seems like pathExists even could be a local variable in else statement - I can't find the usage otherwise.
|
Test build #111368 has finished for PR 25928 at commit
|
|
@dongjoon-hyun created a ticket and update the PR title. @HeartSaVioR I've now updated it to do what you describe. |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #111417 has finished for PR 25928 at commit
|
...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala
Show resolved
Hide resolved
|
Thanks for the approval @srowen. Would you be able to merge it when you get a chance? |
| deleteMatchingPartitions(fs, qualifiedOutputPath, customPartitionLocations, committer) | ||
| true | ||
| } | ||
| case (SaveMode.Overwrite, _) | (SaveMode.ErrorIfExists, false) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can simply put false here instead of _. It is more clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I just kept the existing code for these since I didn't want to make unnecessary changes.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
|
Welcome to the Apache Spark community, @rahij . You are added to the Apache Spark contributor group. Thank you, @srowen , @HeartSaVioR , @viirya , too! |
|
BTW, @rahij . You can add |
|
Thanks @dongjoon-hyun, will do! |
### What changes were proposed in this pull request? Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand. ### Why are the changes needed? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests should cover it since this doesn't change the behavior. Closes apache#25928 from rahij/rr/exists-upstream. Authored-by: Rahij Ramsharan <rramsharan@palantir.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## Upstream SPARK-XXXXX ticket and PR link (if not applicable, explain) apache#25928 ## What changes were proposed in this pull request? When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the `pathExists` variable is actually not used in the case of SaveMode.Append. In some file systems, the `exists` call can be expensive and hence this PR makes that call only when necessary. ## How was this patch tested? Existing unit tests should cover this. Please review http://spark.apache.org/contributing.html before opening a pull request.
What changes were proposed in this pull request?
Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.
Why are the changes needed?
When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing unit tests should cover it since this doesn't change the behavior.