Skip to content

[C++] write_dataset: max_open_files does not close least recently used file #45038

@xWaita

Description

@xWaita

Describe the bug, including details regarding any error messages, version, and platform.

On the latest version of pyarrow (18.0.0), according to the docs of pyarrow.dataset.write_dataset, when max_open_files is reached, the least recently used file will be closed. (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html)

However in the C++ code of dataset writer, what actually happens is the largest file is closed. (https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/dataset_writer.cc#L656)

For long running dataset writes with many partitions, this means that once max open files is saturated, over time the set of open files will trend towards being smaller and smaller in size. Recently opened files can begin to be closed prematurely, while old open files can hang around and take up a file slot.

It would probably make sense for C++ behaviour to be brought in line with the docs, which should be to close the least recently used file rather than the largest file, with the rationale that the least recently used file is more likely to be finished permanently.

A solution could be to store the last write time whenever DatasetWriterDirectoryQueue::StartWrite() is called (maybe someone more knowledgeable about the codebase can suggest somewhere better), and use that to replace DatasetWriter::TryCloseLargestFile() with something like DatasetWriter::TryCloseLeastRecentlyUsedFile().

Component(s)

C++, Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions