perf: disable auto FSST for binary fields#6047
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
ReviewClean PR — the logic is correct, the refactoring reduces branching duplication, and the tests cover the three key scenarios (binary no auto-FSST, utf8 still auto-FSST, binary with explicit FSST). No P0/P1 issues found. One minor note for consideration (not blocking):
LGTM. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
Seems like a good rule to me.
Disable automatic FSST selection for `Binary` and `LargeBinary` fields by default. FSST performs well on UTF‑8 data, but not on already compressed content such as images and videos. In my benchmark using the `le-robot-push-t-image` dataset (which contains many small images), Lance 2.2 ingests data about 10 times slower than Lance 2.0 when FSST is enabled. | data storage version | median wall time (ms) | mean wall time (ms) | rows/s (median) | MiB/s (median) | compared to 2.0 | |---|---:|---:|---:|---:|---:| | 2.0 | 11 | 11.2 | 2,331,818 | 4,496 | 1.00x | | 2.1 | 125 | 125.2 | 205,200 | 395 | 11.36x slower | | 2.2 | 125 | 124.8 | 205,200 | 395 | 11.36x slower | At the same time, the file size is not much smaller (31,246,208 → 31,053,120, about 0.62% reduction), and the read performance does not improve significantly (614 ms → 581 ms, about 5%). Therefore, I recommend disabling FSST for binary fields by default, while still allowing users to override this setting via field metadata. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**
Disable automatic FSST selection for
BinaryandLargeBinaryfields by default. FSST performs well on UTF‑8 data, but not on already compressed content such as images and videos.In my benchmark using the
le-robot-push-t-imagedataset (which contains many small images), Lance 2.2 ingests data about 10 times slower than Lance 2.0 when FSST is enabled.At the same time, the file size is not much smaller (31,246,208 → 31,053,120, about 0.62% reduction), and the read performance does not improve significantly (614 ms → 581 ms, about 5%).
Therefore, I recommend disabling FSST for binary fields by default, while still allowing users to override this setting via field metadata.
Parts of this PR were drafted with assistance from Codex (with
gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.