Skip to content

Sorting 1m large rows runs out of disk. #18011

@ivankelly

Description

@ivankelly

Describe the bug

Doing an external sort is using over 10 times the amount of disk relative to the raw data.

To Reproduce

Create a test file in datafusion-cli with

copy (select digest(value::string, 'md5') as key, repeat('x', 10000) as text from generate_series(1000000)) to 'data.parquet';

The file itself is fairly small, due to compression, but raw data should be around 10GB.

Then get the repro code from from https://github.com/ivankelly/datafusion/tree/out_of_disk

RUST_LOG=trace target/release/out_of_disk sort --input data.parquet --output data-1m-1.parquet --batch-size 128 --sort-partitions 1 --sort-mem-limit 1000000000
...
Error: Resources exhausted: The used disk space during the spilling process has exceeded the allowable limit of 100.0 GB. Try increasing the `max_temp_directory_size` in the disk manager configuration.

Strangely, this succeeds if I set batch size to 8192 (because no spilling seems to occur).

If I create a file with purely random text data (so that there's no compression).

RUST_LOG=trace target/release/out_of_disk gen-input --output input.parquet --num-rows 1000000

Sort succeeds with batch_size 128

RUST_LOG=trace ~/datafusion/target/release/out_of_disk sort --input input.parquet --output data-1m-1.parquet --batch-size 128 --sort-partitions 1 --sort-mem-limit 1000000000

but fails with batch size of 8192

Error: Failed to allocate additional 1573.7 MB for ExternalSorterMerge[0] with 0.0 B already allocated for this reservation - 953.7 MB remain available for the total pool

Expected behavior

I would expect external sort to use 2-3x the amount of disk as exists in the raw data. But this is going to 10x and probably more. Are spill files being left behind after they're no longer needed?

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions