Skip to content

Conversation

@coryan
Copy link
Contributor

@coryan coryan commented Dec 20, 2021

No description provided.

@github-actions
Copy link

@coryan
Copy link
Contributor Author

coryan commented Dec 20, 2021

The appveyor failures look unrelated.

@coryan coryan marked this pull request as ready for review December 20, 2021 15:32
@coryan
Copy link
Contributor Author

coryan commented Dec 21, 2021

Rebased to resolve conflicts.

@coryan
Copy link
Contributor Author

coryan commented Dec 22, 2021

The failure on Java JNI / AMD64 Debian 9 Java JNI (Gandiva, Plasma, ORC, Dataset) looks unrelated, please take a look.

@coryan
Copy link
Contributor Author

coryan commented Jan 3, 2022

Ping

@coryan
Copy link
Contributor Author

coryan commented Jan 6, 2022

@emkornfield can you take a look?

const google::cloud::Status& status) {
using ::google::cloud::StatusCode;
auto canonical = internal::EnsureTrailingSlash(path.full_path);
static bool IsDirectory(gcs::ObjectMetadata const& o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static bool IsDirectory(gcs::ObjectMetadata const& o) {
static bool IsDirectory(const gcs::ObjectMetadata& o) {

nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there might be a lot more branches add here then tested in the changes, not sure how critical these are if they are covered by the generic file system tests?

"arrow/gcsfs", "directory")))
.status();
const auto canonical = internal::RemoveTrailingSlash(name).to_string();
auto object = client_.InsertObject(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I think it would be clearer to spell out type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if (status.code() == GcsCode::kAlreadyExists) {
break;
if (dir.empty()) {
// We could not find any of the parent directories in the bucket, the last step is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this consistent with what the s3 connector does. bucket creation seems pretty heavy weight to maybe do it accidentally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is what you do for s3 too:

// Create object
if (recursive) {
// Ensure bucket exists
ARROW_ASSIGN_OR_RAISE(bool bucket_exists, impl_->BucketExists(path.bucket));
if (!bucket_exists) {
RETURN_NOT_OK(impl_->CreateBucket(path.bucket));
}

if (dir.empty()) {
// We could not find any of the parent directories in the bucket, the last step is
// to find out if the bucket exists, and if necessary, create it
auto b = client_.GetBucketMetadata(bucket);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling out type here would be more readable I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Come join the almost-always-auto world, it is nice here 😀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not that it isn't tempting, just not in the style guide used for this project.

Status Move(const GcsPath& src, const GcsPath& dest) {
if (src.full_path.empty() || src.object.empty() ||
src.object.back() == internal::kSep) {
if (src == dest) return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return Status::OK() is more idiomatic for the code base

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here and elsewhere.

if (!info.IsFile()) {
return Status::IOError("Only files can be opened as input streams");
if (info.IsDirectory()) {
return Status::IOError("Cannot open a directory as an input stream");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include the path requested here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I also searched for any other places where I was not returning the affected path with the IOError.

Result<std::shared_ptr<io::RandomAccessFile>> GcsFileSystem::OpenInputFile(
const FileInfo& info) {
if (info.IsDirectory()) {
return Status::IOError("Cannot open a directory as an input stream");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment above about providing the path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed too.

@emkornfield
Copy link
Contributor

Looks like all failures are unrelated. I'll merge this on Monday if there are no further comments.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is: does this reduce interoperability with filesystem "hierarchies" created by other GCS clients?

// To find out if the directory exists we need to perform an additional query.
ARROW_ASSIGN_OR_RAISE(auto directory, GetFileInfo(p));
if (directory.IsDirectory()) return result;
return Status::IOError("No such file or directory '", select.base_dir, "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error message accurate? If base_dir is an existing file, do we still get this message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if (!status.ok()) {
return status;
}
for (auto d = missing_parents.rbegin(); d != missing_parents.rend(); ++d) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, is it useful to iterate these in reverse order? Is it to avoid creating a/b if a exists as a file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added a comment to explain that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment a bit off? It seems missing_parents is created in reverse depth order (from child to parent), so iterating it in reverse order visits the ancestors first and the descendents last?

[](FileInfo const& info) {
if (!info.IsDirectory()) return info;
return Dir(internal::RemoveTrailingSlash(info.path()).to_string());
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is non-trivial enough to be factored out in a separate function, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@coryan
Copy link
Contributor Author

coryan commented Jan 10, 2022

My main concern here is: does this reduce interoperability with filesystem "hierarchies" created by other GCS clients?

This is how I understand the problem:

  • The Apache/Arrow APIs want to be able to query if a directory "exists", even if empty.
  • That means we need to put some kind of marker in the GCS file system or any query will return kNotFound (even listing)
  • Most tools and libraries using GCS natively do not bother with these markers at all, so whatever we do will need to work when there are no markers (by basically listing all the files prefixed with directory/ whether a marker is found or not).
  • We can use markers using a trailing slash name/, that makes things more similar to the GCS UI.
  • That breaks the generic FS tests, and one may need two RPCs to check if name is a file or a directory.
    • And still get in trouble because something could create name and name/ in GCS (and name// for that matter).
  • Using metadata works for the generic FS tests, requires fewer RPCs, and generally seems to fit the APIs in Apache "better"
    • It makes things less similar to the GCS UI 🤷

That is, both solutions have downsides. If you want things to be more compatible with the GCS UI we can certainly do that. I think the really interesting case is working when there are no markers at all, which we can get to work in both cases.

@pitrou
Copy link
Member

pitrou commented Jan 11, 2022

I think the really interesting case is working when there are no markers at all, which we can get to work in both cases.

Great, I have no problem with this approach then.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @coryan

@pitrou pitrou closed this in 2164b0b Jan 11, 2022
@ursabot
Copy link

ursabot commented Jan 11, 2022

Benchmark runs are scheduled for baseline = aa59d17 and contender = 2164b0b. 2164b0b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.04% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants