-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code #45998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code #45998
Conversation
|
Ah, this will not work. If we want to remove cc @pitrou |
Well, ARROW_HDFS=ON could imply ARROW_FILESYSTEM=ON. I don't think that's a problem.
Yes, indeed. |
|
OK, I will then move |
e95f3f3 to
287cb9b
Compare
1bd3f43 to
8e26730
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @AlenkaF! Here are some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these declarations actually needed by PyArrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, most of them aren't and are copied from libarrow.pxd. I can remove the unused ones - but am not sure if some external application can actually use them?
cpp/src/arrow/CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to link to arrow::hadoop as was done above? cc @kou for advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah. I will add a link as above as it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it is there already (that explains why nothing failed =) )
arrow/cpp/src/arrow/CMakeLists.txt
Lines 857 to 861 in 53ef438
| if(ARROW_HDFS) | |
| foreach(ARROW_FILESYSTEM_TARGET ${ARROW_FILESYSTEM_TARGETS}) | |
| target_link_libraries(${ARROW_FILESYSTEM_TARGET} PRIVATE arrow::hadoop) | |
| endforeach() | |
| endif() |
Not sure if the line with CMAKE_DL_LIBS is also needed here then?
cpp/src/arrow/filesystem/hdfs_io.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but we don't want to keep those two unofficial FileSystem and HadoopFileSystem classes which create confusion with the other (public) filesystem classes.
Ideally, those two classes disappear and their implementation code gets folded into the public HadoopFileSystem class.
If that's too annoying, we should at least merge those two classes and give them a less ambiguous name, for example HdfsClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will go with the disappearing =)
IIUC hdfs_io.h will be removed altogether:
FileSystemandHadoopFileSystemwill go intohdfs.cc, folded into the publicHadoopFileSystemHdfsConnectionConfigwill also go intohdfs.cc- declarations that are left will go into
hdfs_internal.cc
cpp/src/arrow/filesystem/hdfs_io.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these declarations should IMHO go into the arrow::filesystem::internal namespace, except for HdfsConnectionConfig which can go into arrow::filesystem.
|
Hi @pitrou, could you please take a quick look at the changes when you have a moment? I've done my best to implement the suggested changes, but am sure there's still room for improvement.
The Python and MATLAB test failures are not related. |
|
Hi @AlenkaF
I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal: The problem is in the other tests, because it seems a destructor crashes: |
Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this: ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file,
std::shared_ptr<HdfsReadableFile>* out);would become: class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
public:
(...)
static Result<std::shared_ptr<HdfsReadableFile>> Make(
const std::string& path, int32_t buffer_size,
const io::IOContext& io_context, LibHdfsShim* driver,
hdfsFS fs, hdfsFile file); |
|
Aha, I see! Thanks, will look into it. |
|
@pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :) |
72eae6e to
7810940
Compare
benibus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks pretty good to me. Just a few comments.
c7beefc to
281b51a
Compare
|
@pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0? |
281b51a to
f09d6d9
Compare
Rationale for this change
ObjectTypeandFileStatisticsin io/hdfs.h have been deprecated for a while and can be removed.What changes are included in this PR?
ObjectTypeandFileStatisticsstructs are removed and instead FileSystem API inarrow::fsis used. Together with this change, the hdfs connected code is moved fromcpp/src/arrow/iotocpp/src/arrow/filesystemmergingFileSystemandHadoopFileSystemclasses fromarrow::iointo the publicHadoopFileSystemclass.Are these changes tested?
Existing tests should pass.
Are there any user-facing changes?
Deprecated structs are removed and all hdfs related code is now a part of the filesystem module.
Also closes: #22457 (not sure about
io/interfaces.h?)