Skip to content

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Apr 1, 2025

Rationale for this change

ObjectType and FileStatistics in io/hdfs.h have been deprecated for a while and can be removed.

What changes are included in this PR?

ObjectType and FileStatistics structs are removed and instead FileSystem API in arrow::fs is used. Together with this change, the hdfs connected code is moved from cpp/src/arrow/io to cpp/src/arrow/filesystem merging FileSystem and HadoopFileSystem classes from arrow::io into the public HadoopFileSystem class.

Are these changes tested?

Existing tests should pass.

Are there any user-facing changes?

Deprecated structs are removed and all hdfs related code is now a part of the filesystem module.

Also closes: #22457 (not sure about io/interfaces.h?)

@AlenkaF
Copy link
Member Author

AlenkaF commented Apr 1, 2025

Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default. Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR? (#22457)

cc @pitrou

@pitrou
Copy link
Member

pitrou commented Apr 2, 2025

Ah, this will not work. If we want to remove ObjectType and FileStatistics from io/hdfs.h and instead use FileSystem API, then filesystem component would have to be enabled by default.

Well, ARROW_HDFS=ON could imply ARROW_FILESYSTEM=ON. I don't think that's a problem.

Also, if I understand correctly, we want to do a refactoring of io/hdfs anyways which would include the changes in this PR?

Yes, indeed. io/hdfs could be moved to filesystem/hdfs_internal or something similar.

@AlenkaF
Copy link
Member Author

AlenkaF commented Apr 2, 2025

OK, I will then move io/hdfs to filesystem/hdfs_internal. Thanks!

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from e95f3f3 to 287cb9b Compare April 8, 2025 13:24
@AlenkaF AlenkaF changed the title GH-45747: [C++] Remove deprecated ObjectType and FileStatistics GH-45747: [C++] Remove deprecated ObjectType and FileStatistics, refactor hdfs code Apr 9, 2025
@AlenkaF AlenkaF marked this pull request as ready for review April 9, 2025 11:40
@AlenkaF
Copy link
Member Author

AlenkaF commented Apr 9, 2025

@pitrou I think this is ready for review. The failing builds have an issue opened: #46077

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from 1bd3f43 to 8e26730 Compare April 28, 2025 05:10
@AlenkaF AlenkaF requested review from raulcd and rok as code owners April 28, 2025 05:10
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @AlenkaF! Here are some comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these declarations actually needed by PyArrow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, most of them aren't and are copied from libarrow.pxd. I can remove the unused ones - but am not sure if some external application can actually use them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to link to arrow::hadoop as was done above? cc @kou for advice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah. I will add a link as above as it makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it is there already (that explains why nothing failed =) )

if(ARROW_HDFS)
foreach(ARROW_FILESYSTEM_TARGET ${ARROW_FILESYSTEM_TARGETS})
target_link_libraries(${ARROW_FILESYSTEM_TARGET} PRIVATE arrow::hadoop)
endforeach()
endif()

Not sure if the line with CMAKE_DL_LIBS is also needed here then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but we don't want to keep those two unofficial FileSystem and HadoopFileSystem classes which create confusion with the other (public) filesystem classes.

Ideally, those two classes disappear and their implementation code gets folded into the public HadoopFileSystem class.

If that's too annoying, we should at least merge those two classes and give them a less ambiguous name, for example HdfsClient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will go with the disappearing =)
IIUC hdfs_io.h will be removed altogether:

  • FileSystem and HadoopFileSystem will go into hdfs.cc, folded into the public HadoopFileSystem
  • HdfsConnectionConfig will also go into hdfs.cc
  • declarations that are left will go into hdfs_internal.cc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these declarations should IMHO go into the arrow::filesystem::internal namespace, except for HdfsConnectionConfig which can go into arrow::filesystem.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 29, 2025
@AlenkaF AlenkaF marked this pull request as draft May 19, 2025 07:36
@AlenkaF
Copy link
Member Author

AlenkaF commented May 19, 2025

Hi @pitrou, could you please take a quick look at the changes when you have a moment?

I've done my best to implement the suggested changes, but am sure there's still room for improvement.
A couple of issues I could use your input on:

  • Some C++ tests are failing in hdfs_test, with the following error:

    /arrow/cpp/src/arrow/filesystem/hdfs_test.cc:90: HadoopFileSystem::Make failed, it is possible when we don't have proper 
    driver on this node, err msg is IOError: Unable to load libjvm
    

    I'm not sure how best to resolve this — any guidance would be appreciated.

  • The MSVC compiler is complaining about a forward-declared friend function I'm using in hdfs.cc. Do you have any advice on how to better organise this?

The Python and MATLAB test failures are not related.
Thanks in advance!

@pitrou
Copy link
Member

pitrou commented May 19, 2025

Hi @AlenkaF

  • Some C++ tests are failing in hdfs_test, with the following error:

I think you're misreading the output, the test is actually skipped when the driver fails unloading, which is normal:

(...)
dlopen(/usr/java/latest//lib/amd64/server/libjvm.so) failed: /usr/java/latest//lib/amd64/server/libjvm.so: cannot open shared object file: No such file or directory
/arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:294  try_dlopen(libjvm_potential_paths, "libjvm")
/arrow/cpp/src/arrow/filesystem/hdfs.cc:95  ConnectLibHdfs(&driver_)
/arrow/cpp/src/arrow/filesystem/hdfs.cc:725  ptr->impl_->Init()
/arrow/cpp/src/arrow/filesystem/hdfs_test.cc:205: Skipped
Driver not loaded, skipping

https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3277

The problem is in the other tests, because it seems a destructor crashes:


[ RUN      ] TestHadoopFileSystem.DeleteDirContents
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Running '/build/cpp/debug/arrow-hdfs-test' produced core dump at '/tmp/core.arrow-hdfs-test.25630', printing backtrace:
[New LWP 25630]
[New LWP 25631]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/build/cpp/debug/arrow-hdfs-test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc37b23d5be in arrow::io::internal::LibHdfsShim::Disconnect () at /arrow/cpp/src/arrow/filesystem/hdfs_internal.cc:340
340	int LibHdfsShim::Disconnect(hdfsFS fs) { return this->hdfsDisconnect(fs); }
[Current thread is 1 (Thread 0x7fc374633dc0 (LWP 25630))]
(...)

https://github.com/apache/arrow/actions/runs/15109276550/job/42464862030?pr=45998#step:7:3281

@pitrou
Copy link
Member

pitrou commented May 19, 2025

  • The MSVC compiler is complaining about a forward-declared friend function I'm using in hdfs.cc. Do you have any advice on how to better organise this?

Hmm, rather than trying to find the exact explanation, a simple solution would be to change these functions into static methods, for example this:

ARROW_EXPORT Status MakeReadableFile(const std::string& path, int32_t buffer_size,
  const io::IOContext& io_context, LibHdfsShim* driver,
  hdfsFS fs, hdfsFile file,
  std::shared_ptr<HdfsReadableFile>* out);

would become:

class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile {
 public:
   (...)

  static Result<std::shared_ptr<HdfsReadableFile>> Make(
      const std::string& path, int32_t buffer_size,
      const io::IOContext& io_context, LibHdfsShim* driver,
      hdfsFS fs, hdfsFile file);

@AlenkaF
Copy link
Member Author

AlenkaF commented May 19, 2025

Aha, I see! Thanks, will look into it.

@AlenkaF
Copy link
Member Author

AlenkaF commented May 20, 2025

@pitrou I cleaned up the CI failures (others are not related) and am hoping this changes will not be too bad to review :)

@AlenkaF AlenkaF marked this pull request as ready for review May 27, 2025 09:37
@AlenkaF AlenkaF requested a review from pitrou May 27, 2025 09:37
@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from 72eae6e to 7810940 Compare June 5, 2025 08:38
Copy link
Contributor

@benibus benibus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks pretty good to me. Just a few comments.

@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 10, 2025

@benibus @pitrou would you mind taking another look?

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from c7beefc to 281b51a Compare June 23, 2025 13:13
@AlenkaF
Copy link
Member Author

AlenkaF commented Jun 23, 2025

@pitrou gentle ping. Would I be too optimistic to try to get it into 21.0.0?

@AlenkaF AlenkaF force-pushed the gh-45747-remove-deprecated-ObjectType-FileStatistics branch from 281b51a to f09d6d9 Compare September 9, 2025 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Refactor arrow/io/hdfs.h to use common FileSystem API

3 participants