Skip to content

perf: disable auto FSST for binary fields#6047

Merged
Xuanwo merged 2 commits intomainfrom
xuanwo/disable-fsst-binary-default-reimpl
Feb 27, 2026
Merged

perf: disable auto FSST for binary fields#6047
Xuanwo merged 2 commits intomainfrom
xuanwo/disable-fsst-binary-default-reimpl

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Feb 27, 2026

Disable automatic FSST selection for Binary and LargeBinary fields by default. FSST performs well on UTF‑8 data, but not on already compressed content such as images and videos.

In my benchmark using the le-robot-push-t-image dataset (which contains many small images), Lance 2.2 ingests data about 10 times slower than Lance 2.0 when FSST is enabled.

data storage version median wall time (ms) mean wall time (ms) rows/s (median) MiB/s (median) compared to 2.0
2.0 11 11.2 2,331,818 4,496 1.00x
2.1 125 125.2 205,200 395 11.36x slower
2.2 125 124.8 205,200 395 11.36x slower

At the same time, the file size is not much smaller (31,246,208 → 31,053,120, about 0.62% reduction), and the read performance does not improve significantly (614 ms → 581 ms, about 5%).

Therefore, I recommend disabling FSST for binary fields by default, while still allowing users to override this setting via field metadata.


Parts of this PR were drafted with assistance from Codex (with gpt-5.3-codex) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions
Copy link
Copy Markdown
Contributor

Review

Clean PR — the logic is correct, the refactoring reduces branching duplication, and the tests cover the three key scenarios (binary no auto-FSST, utf8 still auto-FSST, binary with explicit FSST). No P0/P1 issues found.

One minor note for consideration (not blocking):

BinaryView / LargeBinaryView: The auto-FSST gate checks DataType::Binary | DataType::LargeBinary. If BinaryView or LargeBinaryView data ever flows through this path, it would still auto-select FSST. If the same reasoning applies to view types, you may want to include them in the matches! check. If they don't go through this path today, this is a non-issue.

LGTM.

@Xuanwo Xuanwo changed the title encoding: disable auto FSST for binary fields perf: disable auto FSST for binary fields Feb 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 98.78049% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/compression.rs 98.78% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good rule to me.

@Xuanwo Xuanwo merged commit a97170a into main Feb 27, 2026
29 checks passed
@Xuanwo Xuanwo deleted the xuanwo/disable-fsst-binary-default-reimpl branch February 27, 2026 18:42
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 4, 2026
Disable automatic FSST selection for `Binary` and `LargeBinary` fields
by default. FSST performs well on UTF‑8 data, but not on already
compressed content such as images and videos.

In my benchmark using the `le-robot-push-t-image` dataset (which
contains many small images), Lance 2.2 ingests data about 10 times
slower than Lance 2.0 when FSST is enabled.

| data storage version | median wall time (ms) | mean wall time (ms) |
rows/s (median) | MiB/s (median) | compared to 2.0 |
|---|---:|---:|---:|---:|---:|
| 2.0 | 11 | 11.2 | 2,331,818 | 4,496 | 1.00x |
| 2.1 | 125 | 125.2 | 205,200 | 395 | 11.36x slower |
| 2.2 | 125 | 124.8 | 205,200 | 395 | 11.36x slower |

At the same time, the file size is not much smaller (31,246,208 →
31,053,120, about 0.62% reduction), and the read performance does not
improve significantly (614 ms → 581 ms, about 5%).

Therefore, I recommend disabling FSST for binary fields by default,
while still allowing users to override this setting via field metadata.


---
**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.3-codex`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants