-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
Doing an external sort is using over 10 times the amount of disk relative to the raw data.
To Reproduce
Create a test file in datafusion-cli with
copy (select digest(value::string, 'md5') as key, repeat('x', 10000) as text from generate_series(1000000)) to 'data.parquet';
The file itself is fairly small, due to compression, but raw data should be around 10GB.
Then get the repro code from from https://github.com/ivankelly/datafusion/tree/out_of_disk
RUST_LOG=trace target/release/out_of_disk sort --input data.parquet --output data-1m-1.parquet --batch-size 128 --sort-partitions 1 --sort-mem-limit 1000000000
...
Error: Resources exhausted: The used disk space during the spilling process has exceeded the allowable limit of 100.0 GB. Try increasing the `max_temp_directory_size` in the disk manager configuration.
Strangely, this succeeds if I set batch size to 8192 (because no spilling seems to occur).
If I create a file with purely random text data (so that there's no compression).
RUST_LOG=trace target/release/out_of_disk gen-input --output input.parquet --num-rows 1000000
Sort succeeds with batch_size 128
RUST_LOG=trace ~/datafusion/target/release/out_of_disk sort --input input.parquet --output data-1m-1.parquet --batch-size 128 --sort-partitions 1 --sort-mem-limit 1000000000
but fails with batch size of 8192
Error: Failed to allocate additional 1573.7 MB for ExternalSorterMerge[0] with 0.0 B already allocated for this reservation - 953.7 MB remain available for the total pool
Expected behavior
I would expect external sort to use 2-3x the amount of disk as exists in the raw data. But this is going to 10x and probably more. Are spill files being left behind after they're no longer needed?
Additional context
No response