Skip to content

fix: substring with negative start index#4017

Open
kazuyukitanimura wants to merge 10 commits intoapache:mainfrom
kazuyukitanimura:fix-3919
Open

fix: substring with negative start index#4017
kazuyukitanimura wants to merge 10 commits intoapache:mainfrom
kazuyukitanimura:fix-3919

Conversation

@kazuyukitanimura
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura commented Apr 21, 2026

Which issue does this PR close?

Closes #3919
Closes #3337

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review April 24, 2026 21:35
let result = DictionaryArray::try_new(dict.keys().clone(), values)?;
Ok(Arc::new(result) as ArrayRef)
}
_ => Ok(Arc::clone(array)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an error rather than just returning the input data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated

Ok(Arc::new(builder.finish()) as ArrayRef)
}
DataType::Dictionary(_, _) => {
let dict = as_dictionary_array::<Int32Type>(array);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would panic for a dictionary with Int64Type. Can we add a check for the type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet dictionary uses Integer, so we are not doing Int64Type including other locations. E.g.https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/static_invoke/char_varchar_utils/read_side_padding.rs#L68

Comment on lines +202 to +216
fn spark_substr_negative(s: &str, pos: i64, len: u64) -> String {
let num_chars = s.chars().count() as i64;
let start = num_chars + pos;
let end = start.saturating_add(len as i64).min(num_chars);
let start = start.max(0);

if start >= end {
return String::new();
}

s.chars()
.skip(start as usize)
.take((end - start) as usize)
.collect()
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude recommended an optimized version to avoid an intermediate string allocation per row. I have not verified.

fn spark_substr_negative(s: &str, pos: i64, len: u64) -> &str {                                                                                                                                                        
      let num_chars = s.chars().count() as i64;                              
      let end = (num_chars + pos).saturating_add(len as i64).min(num_chars);                                                                                                                                             
      let start = (num_chars + pos).max(0);
      if start >= end {                                                                                                                                                                                                  
          return "";                                                                                                                                                                                                     
      }
                                                                                                                                                                                                                         
      // Translate char indices [start, end) to byte offsets in a single forward pass.                                                                                                                                   
      let mut it = s.char_indices();
      let byte_start = it.by_ref().nth(start as usize).map(|(b, _)| b).unwrap_or(s.len());                                                                                                                               
      let span = (end - start - 1) as usize;                                                                                                                                                                             
      let byte_end = it.nth(span).map(|(b, _)| b).unwrap_or(s.len());
                                                                                                                                                                                                                         
      &s[byte_start..byte_end]                                               
  }         

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

substring incompatible with spark for negative start index Native engine panics on all-scalar inputs for Substring and StringSpace

2 participants