Skip to content

fix: avoid panic when repdef serializes empty offsets#5890

Merged
westonpace merged 1 commit intolance-format:mainfrom
fenfeng9:fix/repdef-empty-offsets
Feb 5, 2026
Merged

fix: avoid panic when repdef serializes empty offsets#5890
westonpace merged 1 commit intolance-format:mainfrom
fenfeng9:fix/repdef-empty-offsets

Conversation

@fenfeng9
Copy link
Copy Markdown
Contributor

@fenfeng9 fenfeng9 commented Feb 5, 2026

Fixes panic on empty list offsets ([0]).

The assertion expected_len - 1 underflows when expected_len == 0. We early-return because zero values produce no rep/def levels.

@github-actions github-actions Bot added the bug Something isn't working label Feb 5, 2026
@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Feb 5, 2026

PTAL @westonpace .

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good fix for repdef, thanks! Did you run into this scenario with actual data? Or is it theoretical? I wonder if we had checks for empty arrays higher in the code somewhere?

@westonpace westonpace merged commit 0c48dba into lance-format:main Feb 5, 2026
26 of 28 checks passed
@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Feb 5, 2026

This is a good fix for repdef, thanks! Did you run into this scenario with actual data? Or is it theoretical? I wonder if we had checks for empty arrays higher in the code somewhere?

You're right that FileWriter::write_batch does check for empty batches and returns early

pub async fn write_batch(&mut self, batch: &RecordBatch) -> Result<()> {
debug!(
"write_batch called with {} rows, {} columns, and {} bytes of data",
batch.num_rows(),
batch.num_columns(),
batch.get_array_memory_size()
);
self.ensure_initialized(batch)?;
self.verify_nullability_constraints(batch)?;
let num_rows = batch.num_rows() as u64;
if num_rows == 0 {
return Ok(());
}

The panic I encountered was triggered through the encode_batch function

/// Helper method to encode a batch of data into memory
///
/// This is primarily for testing and benchmarking but could be useful in other
/// niche situations like IPC.
pub async fn encode_batch(
batch: &RecordBatch,
schema: Arc<Schema>,
encoding_strategy: &dyn FieldEncodingStrategy,
options: &EncodingOptions,
) -> Result<EncodedBatch> {

@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Feb 5, 2026

This is a good fix for repdef, thanks! Did you run into this scenario with actual data? Or is it theoretical? I wonder if we had checks for empty arrays higher in the code somewhere?

Inspired by issue #5141 regarding fuzz testing for the file format, I wrote a simple local fuzzer. This panic occurred while generating seed files.

@fenfeng9
Copy link
Copy Markdown
Contributor Author

fenfeng9 commented Feb 5, 2026

BTW, could you help review this PR #5867 when you have a moment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants