-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11651: [Rust][DataFusion] Implement Postgres String Functions: Length Functions #9509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @seddonm1 . This looks great to me.
I left two comments that imo should be addressed related to reducing the maintenance cost by DRY.
| fn bit_length_string<OffsetSize>(array: &Array, data_type: DataType) -> ArrayRef | ||
| where | ||
| OffsetSize: OffsetSizeTrait, | ||
| { | ||
| // note: offsets are stored as u8, but they can be interpreted as OffsetSize | ||
| let offsets = &array.data_ref().buffers()[0]; | ||
| // this is a 30% improvement over iterating over u8s and building OffsetSize, which | ||
| // justifies the usage of `unsafe`. | ||
| let slice: &[OffsetSize] = | ||
| &unsafe { offsets.typed_data::<OffsetSize>() }[array.offset()..]; | ||
|
|
||
| let bit_size = OffsetSize::from_usize(8).unwrap(); | ||
| let lengths = slice | ||
| .windows(2) | ||
| .map(|offset| (offset[1] - offset[0]) * bit_size); | ||
|
|
||
| // JUSTIFICATION | ||
| // Benefit | ||
| // ~60% speedup | ||
| // Soundness | ||
| // `values` is an iterator with a known size. | ||
| let buffer = unsafe { Buffer::from_trusted_len_iter(lengths) }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a copy of length, with a single difference, that we multiply the lengths by bit_size?
If that is the case, can we create a generic or something that avoids code duplication? Copy-pasting code like this is IMO an unnecessary maintenance burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i can have a look at doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a macro allowing just the core function to be replaced.
| /// $ARRAY_TYPE is the column type after function applied | ||
| macro_rules! test_function { | ||
| ($FUNC:ident, $ARGS:expr, $EXPECTED:expr, $EXPECTED_TYPE:ty, $DATA_TYPE: ident, $ARRAY_TYPE:ident) => { | ||
| println!("{:?}", BuiltinScalarFunction::$FUNC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this println! intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is just there to help debugging but happy to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
|
||
| /// Returns number of characters in the string. | ||
| /// character_length_i64('josé') = 4 | ||
| pub fn character_length_i64(args: &[ArrayRef]) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this code can be reduced to a single generic. Something like this (untested):
pub fn character_length<T: ArrowPrimitiveType>(args: &[ArrayRef]) -> Result<ArrayRef>
where
T::Native: StringOffsetSizeTrait {
let string_array: &GenericStringArray<T::Native> = args[0]
.as_any()
.downcast_ref::<GenericStringArray<T::Native>>()
.unwrap();
string_array
.iter()
.map(|x| x.map(|x: &str| T::Native::from_usize(x.graphemes(true).count())).unwrap())
.collect()
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I will have a go. There are a couple more instances this generic from_usize could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Impressive you can come up with this in your head. I have incorporated this code.
|
Thanks @jorgecarleitao . I will update soon. |
|
@jorgecarleitao Thanks for the review. I have addressed all of your points. The |
|
Thanks a lot, @seddonm1 . I went through this PR again and I pull the code locally to understand the reason for the macro. I now get it that it is not easy to write this. :) I PRed something to your branch with a suggestion to reduce the risk of using that function (which is unsafe due to the |
Added some safety.
Codecov Report
@@ Coverage Diff @@
## master #9509 +/- ##
==========================================
+ Coverage 82.15% 82.28% +0.12%
==========================================
Files 243 244 +1
Lines 54908 55517 +609
==========================================
+ Hits 45108 45680 +572
- Misses 9800 9837 +37
Continue to review full report at Codecov.
|
|
@jorgecarleitao Sorry I updated this but forgot to update you. Once this is merged I can do more of the big Postgres string functions PRs. |
|
The integration test failure appears unrelated https://github.com/apache/arrow/pull/9509/checks?check_run_id=1921983282 I will restart it to try and get a clean run |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @seddonm1 -- I think this is ready to go once it is green
|
@seddonm1 we are so close -- now there is a clippy error: https://github.com/apache/arrow/pull/9509/checks?check_run_id=1922005826 |
|
@alamb done and ready. Thanks to @jorgecarleitao for his elite knowledge of trait implementations. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @seddonm1 ! Ready to go from my side.
|
Thanks again @seddonm1 -- this is epic work |
Splitting up #9243
This implements the following functions: