-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15114: [C++] GcsFileSystem uses metadata for directory markers #11996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15114: [C++] GcsFileSystem uses metadata for directory markers #11996
Conversation
|
The appveyor failures look unrelated. |
|
Rebased to resolve conflicts. |
|
The failure on |
|
Ping |
|
@emkornfield can you take a look? |
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| const google::cloud::Status& status) { | ||
| using ::google::cloud::StatusCode; | ||
| auto canonical = internal::EnsureTrailingSlash(path.full_path); | ||
| static bool IsDirectory(gcs::ObjectMetadata const& o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static bool IsDirectory(gcs::ObjectMetadata const& o) { | |
| static bool IsDirectory(const gcs::ObjectMetadata& o) { |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there might be a lot more branches add here then tested in the changes, not sure how critical these are if they are covered by the generic file system tests?
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| "arrow/gcsfs", "directory"))) | ||
| .status(); | ||
| const auto canonical = internal::RemoveTrailingSlash(name).to_string(); | ||
| auto object = client_.InsertObject( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, I think it would be clearer to spell out type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| if (status.code() == GcsCode::kAlreadyExists) { | ||
| break; | ||
| if (dir.empty()) { | ||
| // We could not find any of the parent directories in the bucket, the last step is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this consistent with what the s3 connector does. bucket creation seems pretty heavy weight to maybe do it accidentally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is what you do for s3 too:
arrow/cpp/src/arrow/filesystem/s3fs.cc
Lines 2305 to 2311 in b325ef7
| // Create object | |
| if (recursive) { | |
| // Ensure bucket exists | |
| ARROW_ASSIGN_OR_RAISE(bool bucket_exists, impl_->BucketExists(path.bucket)); | |
| if (!bucket_exists) { | |
| RETURN_NOT_OK(impl_->CreateBucket(path.bucket)); | |
| } |
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| if (dir.empty()) { | ||
| // We could not find any of the parent directories in the bucket, the last step is | ||
| // to find out if the bucket exists, and if necessary, create it | ||
| auto b = client_.GetBucketMetadata(bucket); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling out type here would be more readable I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Come join the almost-always-auto world, it is nice here 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not that it isn't tempting, just not in the style guide used for this project.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| Status Move(const GcsPath& src, const GcsPath& dest) { | ||
| if (src.full_path.empty() || src.object.empty() || | ||
| src.object.back() == internal::kSep) { | ||
| if (src == dest) return {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: return Status::OK() is more idiomatic for the code base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here and elsewhere.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| if (!info.IsFile()) { | ||
| return Status::IOError("Only files can be opened as input streams"); | ||
| if (info.IsDirectory()) { | ||
| return Status::IOError("Cannot open a directory as an input stream"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe include the path requested here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I also searched for any other places where I was not returning the affected path with the IOError.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| Result<std::shared_ptr<io::RandomAccessFile>> GcsFileSystem::OpenInputFile( | ||
| const FileInfo& info) { | ||
| if (info.IsDirectory()) { | ||
| return Status::IOError("Cannot open a directory as an input stream"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment above about providing the path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed too.
|
Looks like all failures are unrelated. I'll merge this on Monday if there are no further comments. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is: does this reduce interoperability with filesystem "hierarchies" created by other GCS clients?
| // To find out if the directory exists we need to perform an additional query. | ||
| ARROW_ASSIGN_OR_RAISE(auto directory, GetFileInfo(p)); | ||
| if (directory.IsDirectory()) return result; | ||
| return Status::IOError("No such file or directory '", select.base_dir, "'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the error message accurate? If base_dir is an existing file, do we still get this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
| if (!status.ok()) { | ||
| return status; | ||
| } | ||
| for (auto d = missing_parents.rbegin(); d != missing_parents.rend(); ++d) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, is it useful to iterate these in reverse order? Is it to avoid creating a/b if a exists as a file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added a comment to explain that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comment a bit off? It seems missing_parents is created in reverse depth order (from child to parent), so iterating it in reverse order visits the ancestors first and the descendents last?
| [](FileInfo const& info) { | ||
| if (!info.IsDirectory()) return info; | ||
| return Dir(internal::RemoveTrailingSlash(info.path()).to_string()); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is non-trivial enough to be factored out in a separate function, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This is how I understand the problem:
That is, both solutions have downsides. If you want things to be more compatible with the GCS UI we can certainly do that. I think the really interesting case is working when there are no markers at all, which we can get to work in both cases. |
Great, I have no problem with this approach then. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @coryan
|
Benchmark runs are scheduled for baseline = aa59d17 and contender = 2164b0b. 2164b0b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.