GH-41493: [C++][S3] Add a new option to check existence before CreateDir #41822

HaochengLIU · 2024-05-24T22:18:59Z

Rationale for this change

I have a use case that thousands of jobs are writing hive partitioned parquet files daily to the same bucket via S3FS filesystem. The gist here is a lot of keys are being created at the same time hense jobs hits AWS Error SLOW_DOWN. during Put Object operation: The object exceeded the rate limit for object mutation operations(create, update, and delete). Please reduce your rate request error. frequently throughout the day since the code is creating directories pessimistically.

What changes are included in this PR?

Add a new S3Option to check the existence of the directory before creation in CreateDir. It's disabled by default.

When it's enabled, the CreateDir function will check the existence of the directory first before creation. It ensures that the create operation is only acted when necessary. Though there are more I/O calls, but it avoids hitting the cloud vendor put object limit.

Are these changes tested?

Add test cases when the flag is set to true. Right on top of the mind i donno how to ensure it's working in these tests. But in our production environment, we have very similar code and it worked well.

Are there any user-facing changes?

GitHub Issue: Avoid throttling: S3FileSystem::CreateDir should check the existence first rather than creating the parent path all the time #41493

github-actions · 2024-05-24T22:19:22Z

⚠️ GitHub issue #41493 has been automatically assigned in GitHub to PR creator.

HaochengLIU · 2024-05-26T00:42:01Z

@pitrou @westonpace Could you pls review when you have time?

pitrou · 2024-05-28T15:46:01Z

cpp/src/arrow/filesystem/s3fs.cc

I'm curious, don't you want to do the checks by walking up instead of walking down?
For example, when CreateDir("a/b/c") is called, it would seem more logical to first check for a/b/c's existence, then a/b if it doesn't exist, etc.

I just reflect on it and my 2cents is to keep the walking down logic. I'm doing the checking inside the creation step. I can only build a wall from the bottom up otherwise the world is upside down :) So I must check a then a/b and a/b/c.

when CreateDir("a/b/c") is called, a/b/c's existence will be checked immediately in non recursive mode

What I'm proposing is (example in the case of a/b/c):

check which ancestors exist by walking up the directory tree: first a/b/c then a/b... until you find the first existing ancestor

create the missing descendents by walking down the directory tree: for example, if you just found that a exists, create a/b then a/b/c

The idea is that, most of the time, almost the entire directory chain will exist, especially if your workload is writing into a deeply partitioned dataset. So doing the directory checks from leaf to root should issue less requests and have less latency.

More formally, let's call n the depth of the path given to CreateDir, and m the number of directories missing along that path.

with the current approach, we're calling HeadObject O(n) times and PutObject O(m) times;

with my proposal, we're calling HeadObject O(m) times and PutObject O(m) times.

Assuming that m is on average much smaller than n (m would be 0 or 1 most of the time), the approach I'm proposing should be much more efficient, and it would never be less efficient anyway.

Does that make sense or am I missing something?

@pitrou Could you please review again?
In this round:

I added the above logic.

Ensure my tests are working as I foget to call MakeFileSystem()

Add missing tests for non recursive CreateDir scenario.

cpp/src/arrow/filesystem/s3fs.h

pitrou

Thanks for submitting this @HaochengLIU ! Here are some comments.

…eateDir

cpp/src/arrow/filesystem/s3fs.cc

cpp/src/arrow/filesystem/s3fs_test.cc

cpp/src/arrow/filesystem/s3fs.cc

cpp/src/arrow/filesystem/s3fs_test.cc

pitrou

Thanks for the update @HaochengLIU . This looks good to me, bar some minor comments.

cpp/src/arrow/filesystem/s3fs_test.cc

pitrou · 2024-06-04T15:58:22Z

@bkietz Will we have to support this new option in S3 URIs as well?

HaochengLIU · 2024-06-05T00:30:55Z

Whoa this is insane
TIL in github you can modify a code review in place as a reviewer!
Thanks for the reviewing and guidance @pitrou here

conbench-apache-arrow · 2024-06-11T13:43:56Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit a44b537.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting review Awaiting review labels May 24, 2024

HaochengLIU force-pushed the 41493-check-file-existence-before-creation branch from 34f71d2 to 4ac9ef5 Compare May 26, 2024 00:14

HaochengLIU changed the title ~~WIP GH-41493: [C++] Add a new S3option to check existence before CreateDir~~ GH-41493: [C++] Add a new S3option to check existence before CreateDir May 26, 2024

HaochengLIU force-pushed the 41493-check-file-existence-before-creation branch from 4ac9ef5 to 81a4b61 Compare May 26, 2024 00:38

pitrou reviewed May 28, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 28, 2024

pitrou reviewed May 28, 2024

View reviewed changes

cpp/src/arrow/filesystem/s3fs.h Outdated Show resolved Hide resolved

pitrou requested changes May 28, 2024

View reviewed changes

HaochengLIU force-pushed the 41493-check-file-existence-before-creation branch from 81a4b61 to 6becf33 Compare May 29, 2024 01:56

HaochengLIU requested a review from pitrou May 29, 2024 02:00

apacheGH-41493: [C++] Add a new S3option to check existence before Cr…

ced6a7a

…eateDir

HaochengLIU force-pushed the 41493-check-file-existence-before-creation branch from 6becf33 to ced6a7a Compare June 3, 2024 21:59

HaochengLIU commented Jun 3, 2024

View reviewed changes

cpp/src/arrow/filesystem/s3fs.cc Show resolved Hide resolved

HaochengLIU commented Jun 3, 2024

View reviewed changes

cpp/src/arrow/filesystem/s3fs_test.cc Outdated Show resolved Hide resolved

HaochengLIU commented Jun 3, 2024

View reviewed changes

cpp/src/arrow/filesystem/s3fs_test.cc Show resolved Hide resolved