Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Aug 22, 2019

Unit testing is done using the Minio standalone S3 server.

Currently missing from this PR:

  • fix the bundled building (currently it links against the wrong version of libcrypto, ending up loading two different OpenSSL versions at runtime)

@pitrou pitrou changed the title ARROW-453: [C++] Filesystem implementation for S3 [WIP] ARROW-453: [C++] Filesystem implementation for S3 Aug 22, 2019
@pitrou
Copy link
Member Author

pitrou commented Aug 22, 2019

s3fs test passed on Travis-CI: https://travis-ci.org/pitrou/arrow/jobs/575384780#L5243

@pitrou pitrou force-pushed the ARROW-453-s3fs branch 3 times, most recently from adba89c to 2392160 Compare August 22, 2019 16:27
@pitrou
Copy link
Member Author

pitrou commented Aug 22, 2019

@wesm wesm changed the title [WIP] ARROW-453: [C++] Filesystem implementation for S3 [WIP] ARROW-453: [C++] Filesystem implementation for Amazon S3 Aug 22, 2019
@codecov-io
Copy link

codecov-io commented Aug 22, 2019

Codecov Report

Merging #5167 into master will increase coverage by 1.21%.
The diff coverage is 94.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5167      +/-   ##
==========================================
+ Coverage   87.53%   88.75%   +1.21%     
==========================================
  Files         923      938      +15     
  Lines      140358   122444   -17914     
  Branches        0     1437    +1437     
==========================================
- Hits       122868   108679   -14189     
+ Misses      17138    13403    -3735     
- Partials      352      362      +10
Impacted Files Coverage Δ
cpp/src/arrow/testing/util.h 91.3% <ø> (ø) ⬆️
cpp/src/arrow/filesystem/filesystem.h 100% <ø> (ø) ⬆️
cpp/src/arrow/filesystem/filesystem.cc 80% <0%> (-6.8%) ⬇️
cpp/src/arrow/filesystem/test_util.cc 99.4% <100%> (+0.05%) ⬆️
cpp/src/arrow/filesystem/filesystem_test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/filesystem/path_util.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/filesystem/path_util.h 100% <100%> (ø) ⬆️
cpp/src/arrow/filesystem/test_util.h 100% <100%> (ø) ⬆️
cpp/src/arrow/filesystem/s3_internal.h 100% <100%> (ø)
cpp/src/arrow/filesystem/s3fs.h 100% <100%> (ø)
... and 207 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17a0709...de1795a. Read the comment docs.

@wesm
Copy link
Member

wesm commented Aug 22, 2019

Will endeavor to get a review to you in the next couple working days

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this off the ground! S3 is a fairly complex beast. I left some sparse comments. I didn't scrutinize the specifics of the unit tests in too great detail

Suffice to say a next stage of this project will be making sure we can the test suite against "the real thing" (aka Amazon's S3), since that's how most people will be using this software in practice. We should probably also do some basic in-EC2 performance evaluations to get a baseline understanding of in-data-center and outside-of-datacenter performance in different scenarios. Maybe we can start a benchmark suite that can be modestly configured (probably using a JSON configuration file of some kind). Anyway plenty of follow up JIRA issues that can be created


macro(build_awssdk)
message(
FATAL_ERROR "FIXME: Building AWS C++ SDK from source will link with wrong libcrypto")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's super fun.


struct ARROW_EXPORT S3Options {
// AWS region to connect to
std::string region = "us-east-1";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how boto/boto3, Python s3fs, or other S3 libraries handle setting the region? Do they automatically determine the region for a bucket?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea. But it seems to be configuration-based.


if (path.key.empty()) {
// Create bucket
return impl_->CreateBucket(path.bucket);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should not make it too easy to create buckets, but in AWS access controls, creating buckets may be disallowed (so this would simply fail). I don't think there's necessarily anything to change here

}

// XXX Should we check that no non-directory entry exists?
// Minio does it for us, not sure about other S3 implementations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely checking against Amazon S3 would be a good idea

S3Model::DeleteBucketRequest req;
req.SetBucket(ToAwsString(path.bucket));
return OutcomeToStatus(impl_->client_->DeleteBucket(req));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear if we want to allow this function to delete buckets (since these are the "heaviest" of S3 objects) by default. It might be better to error when trying to delete a bucket and instead offset a DeleteBucket API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the bucket is already empty, though...


#define ARROW_AWS_ASSIGN_OR_FAIL(lhs, rexpr) \
ARROW_AWS_ASSIGN_OR_FAIL_IMPL( \
ARROW_AWS_ASSIGN_OR_FAIL_NAME(_aws_error_or_value, __COUNTER__), lhs, rexpr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as what's in s3_internal.h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this one fails in the GTest sense.

@pitrou pitrou force-pushed the ARROW-453-s3fs branch 2 times, most recently from 0b4a7ec to ca8c7d6 Compare August 26, 2019 15:40
@pitrou
Copy link
Member Author

pitrou commented Aug 26, 2019

Ok, I added a narrative test and checked it against Amazon. Also addressed some review comments.

@pitrou pitrou changed the title [WIP] ARROW-453: [C++] Filesystem implementation for Amazon S3 ARROW-453: [C++] Filesystem implementation for Amazon S3 Aug 26, 2019

if (current_part_size_ >= part_upload_threshold_) {
// Current part large enough, upload it
RETURN_NOT_OK(CommitCurrentPart());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific case, can't we always re-use the same buffer and adjust capacity, and use the size as indicator?

std::string scheme = "https";

std::string access_key;
std::string secret_key;
Copy link
Contributor

@fsaintjacques fsaintjacques Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should align with what usual aws cli provide via AWSCredentialsProvider and AWSCredentialsProviderChain. I'm not sure how we should expose this in the option struct.

To add more, I suspect that users will expect that the usual mecanism works, env var, file backed config and then explicit auth creds.

@pitrou pitrou force-pushed the ARROW-453-s3fs branch 2 times, most recently from 9303131 to 9c1abaa Compare August 26, 2019 17:46
Unit testing is done using the Minio standalone S3 server.
@pitrou pitrou closed this in 7ec1731 Aug 29, 2019
@pitrou
Copy link
Member Author

pitrou commented Aug 29, 2019

Thanks @fsaintjacques :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants