-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-453: [C++] Filesystem implementation for Amazon S3 #5167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
s3fs test passed on Travis-CI: https://travis-ci.org/pitrou/arrow/jobs/575384780#L5243 |
adba89c to
2392160
Compare
|
s3fs test passed on AppVeyor: https://ci.appveyor.com/project/pitrou/arrow/build/job/o639sped92wndo5f#L1740 |
Codecov Report
@@ Coverage Diff @@
## master #5167 +/- ##
==========================================
+ Coverage 87.53% 88.75% +1.21%
==========================================
Files 923 938 +15
Lines 140358 122444 -17914
Branches 0 1437 +1437
==========================================
- Hits 122868 108679 -14189
+ Misses 17138 13403 -3735
- Partials 352 362 +10
Continue to review full report at Codecov.
|
|
Will endeavor to get a review to you in the next couple working days |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this off the ground! S3 is a fairly complex beast. I left some sparse comments. I didn't scrutinize the specifics of the unit tests in too great detail
Suffice to say a next stage of this project will be making sure we can the test suite against "the real thing" (aka Amazon's S3), since that's how most people will be using this software in practice. We should probably also do some basic in-EC2 performance evaluations to get a baseline understanding of in-data-center and outside-of-datacenter performance in different scenarios. Maybe we can start a benchmark suite that can be modestly configured (probably using a JSON configuration file of some kind). Anyway plenty of follow up JIRA issues that can be created
|
|
||
| macro(build_awssdk) | ||
| message( | ||
| FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong libcrypto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's super fun.
cpp/src/arrow/filesystem/s3fs.h
Outdated
|
|
||
| struct ARROW_EXPORT S3Options { | ||
| // AWS region to connect to | ||
| std::string region = "us-east-1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how boto/boto3, Python s3fs, or other S3 libraries handle setting the region? Do they automatically determine the region for a bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea. But it seems to be configuration-based.
cpp/src/arrow/filesystem/s3fs.cc
Outdated
|
|
||
| if (path.key.empty()) { | ||
| // Create bucket | ||
| return impl_->CreateBucket(path.bucket); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should not make it too easy to create buckets, but in AWS access controls, creating buckets may be disallowed (so this would simply fail). I don't think there's necessarily anything to change here
cpp/src/arrow/filesystem/s3fs.cc
Outdated
| } | ||
|
|
||
| // XXX Should we check that no non-directory entry exists? | ||
| // Minio does it for us, not sure about other S3 implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely checking against Amazon S3 would be a good idea
cpp/src/arrow/filesystem/s3fs.cc
Outdated
| S3Model::DeleteBucketRequest req; | ||
| req.SetBucket(ToAwsString(path.bucket)); | ||
| return OutcomeToStatus(impl_->client_->DeleteBucket(req)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear if we want to allow this function to delete buckets (since these are the "heaviest" of S3 objects) by default. It might be better to error when trying to delete a bucket and instead offset a DeleteBucket API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, the bucket is already empty, though...
|
|
||
| #define ARROW_AWS_ASSIGN_OR_FAIL(lhs, rexpr) \ | ||
| ARROW_AWS_ASSIGN_OR_FAIL_IMPL( \ | ||
| ARROW_AWS_ASSIGN_OR_FAIL_NAME(_aws_error_or_value, __COUNTER__), lhs, rexpr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as what's in s3_internal.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this one fails in the GTest sense.
0b4a7ec to
ca8c7d6
Compare
|
Ok, I added a narrative test and checked it against Amazon. Also addressed some review comments. |
cpp/src/arrow/filesystem/s3fs.cc
Outdated
|
|
||
| if (current_part_size_ >= part_upload_threshold_) { | ||
| // Current part large enough, upload it | ||
| RETURN_NOT_OK(CommitCurrentPart()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific case, can't we always re-use the same buffer and adjust capacity, and use the size as indicator?
cpp/src/arrow/filesystem/s3fs.h
Outdated
| std::string scheme = "https"; | ||
|
|
||
| std::string access_key; | ||
| std::string secret_key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should align with what usual aws cli provide via AWSCredentialsProvider and AWSCredentialsProviderChain. I'm not sure how we should expose this in the option struct.
To add more, I suspect that users will expect that the usual mecanism works, env var, file backed config and then explicit auth creds.
9303131 to
9c1abaa
Compare
Unit testing is done using the Minio standalone S3 server.
|
Thanks @fsaintjacques :-) |
Unit testing is done using the Minio standalone S3 server.
Currently missing from this PR:
libcrypto, ending up loading two different OpenSSL versions at runtime)