Skip to content

Conversation

@seddonm1
Copy link
Contributor

Splitting up #9243

This implements the following functions:

  • String functions
    • bit_Length
    • char_length
    • character_length
    • length
    • octet_length

@github-actions
Copy link

@seddonm1 seddonm1 changed the title ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions: Length Functions ARROW-11651: [Rust][DataFusion] Implement Postgres String Functions: Length Functions Feb 16, 2021
@github-actions
Copy link

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @seddonm1 . This looks great to me.

I left two comments that imo should be addressed related to reducing the maintenance cost by DRY.

Comment on lines 27 to 49
fn bit_length_string<OffsetSize>(array: &Array, data_type: DataType) -> ArrayRef
where
OffsetSize: OffsetSizeTrait,
{
// note: offsets are stored as u8, but they can be interpreted as OffsetSize
let offsets = &array.data_ref().buffers()[0];
// this is a 30% improvement over iterating over u8s and building OffsetSize, which
// justifies the usage of `unsafe`.
let slice: &[OffsetSize] =
&unsafe { offsets.typed_data::<OffsetSize>() }[array.offset()..];

let bit_size = OffsetSize::from_usize(8).unwrap();
let lengths = slice
.windows(2)
.map(|offset| (offset[1] - offset[0]) * bit_size);

// JUSTIFICATION
// Benefit
// ~60% speedup
// Soundness
// `values` is an iterator with a known size.
let buffer = unsafe { Buffer::from_trusted_len_iter(lengths) };

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a copy of length, with a single difference, that we multiply the lengths by bit_size?

If that is the case, can we create a generic or something that avoids code duplication? Copy-pasting code like this is IMO an unnecessary maintenance burden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i can have a look at doing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a macro allowing just the core function to be replaced.

/// $ARRAY_TYPE is the column type after function applied
macro_rules! test_function {
($FUNC:ident, $ARGS:expr, $EXPECTED:expr, $EXPECTED_TYPE:ty, $DATA_TYPE: ident, $ARRAY_TYPE:ident) => {
println!("{:?}", BuiltinScalarFunction::$FUNC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this println! intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is just there to help debugging but happy to remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


/// Returns number of characters in the string.
/// character_length_i64('josé') = 4
pub fn character_length_i64(args: &[ArrayRef]) -> Result<ArrayRef> {
Copy link
Member

@jorgecarleitao jorgecarleitao Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this code can be reduced to a single generic. Something like this (untested):

pub fn character_length<T: ArrowPrimitiveType>(args: &[ArrayRef]) -> Result<ArrayRef> 
where 
T::Native: StringOffsetSizeTrait {
    let string_array: &GenericStringArray<T::Native> = args[0]
        .as_any()
        .downcast_ref::<GenericStringArray<T::Native>>()
        .unwrap();
    
     string_array
        .iter()
        .map(|x| x.map(|x: &str| T::Native::from_usize(x.graphemes(true).count())).unwrap())
        .collect()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I will have a go. There are a couple more instances this generic from_usize could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Impressive you can come up with this in your head. I have incorporated this code.

@seddonm1
Copy link
Contributor Author

Thanks @jorgecarleitao . I will update soon.

@seddonm1
Copy link
Contributor Author

@jorgecarleitao Thanks for the review. I have addressed all of your points.

The T::Native: StringOffsetSizeTrait will be useful in some of the other string functions so thanks 🙏

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Feb 17, 2021

Thanks a lot, @seddonm1 . I went through this PR again and I pull the code locally to understand the reason for the macro.

I now get it that it is not easy to write this. :) I PRed something to your branch with a suggestion to reduce the risk of using that function (which is unsafe due to the typed) on non-string arrays and make it just use generics, that imo are usually easier to read because the constraints are specific.

@codecov-io
Copy link

codecov-io commented Feb 17, 2021

Codecov Report

Merging #9509 (17740e9) into master (e2d6c05) will increase coverage by 0.12%.
The diff coverage is 88.05%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9509      +/-   ##
==========================================
+ Coverage   82.15%   82.28%   +0.12%     
==========================================
  Files         243      244       +1     
  Lines       54908    55517     +609     
==========================================
+ Hits        45108    45680     +572     
- Misses       9800     9837      +37     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/functions.rs 72.36% <68.49%> (-3.95%) ⬇️
rust/datafusion/src/logical_plan/expr.rs 81.13% <80.00%> (+0.94%) ⬆️
rust/arrow/src/compute/kernels/length.rs 100.00% <100.00%> (ø)
...datafusion/src/physical_plan/string_expressions.rs 69.62% <100.00%> (+2.49%) ⬆️
rust/arrow/src/array/equal/utils.rs 75.49% <0.00%> (-0.99%) ⬇️
rust/arrow/src/record_batch.rs 83.70% <0.00%> (-0.63%) ⬇️
rust/datafusion/src/physical_plan/limit.rs 57.47% <0.00%> (-0.49%) ⬇️
rust/datafusion/src/physical_plan/planner.rs 79.58% <0.00%> (-0.05%) ⬇️
rust/datafusion/src/logical_plan/plan.rs 82.45% <0.00%> (-0.04%) ⬇️
rust/benchmarks/src/bin/tpch.rs 38.33% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e2d6c05...17740e9. Read the comment docs.

@seddonm1
Copy link
Contributor Author

@jorgecarleitao Sorry I updated this but forgot to update you. Once this is merged I can do more of the big Postgres string functions PRs.

@alamb
Copy link
Contributor

alamb commented Feb 19, 2021

The integration test failure appears unrelated https://github.com/apache/arrow/pull/9509/checks?check_run_id=1921983282

I will restart it to try and get a clean run


[INFO] Final Memory: 176M/780M
[INFO] ------------------------------------------------------------------------
Error:  Failed to execute goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.0:compile (src) on project flight-core: Execution src of goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.0:compile failed: Plugin org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.0 or one of its dependencies could not be resolved: Could not transfer artifact org.apache.maven.plugin-tools:maven-plugin-annotations:jar:3.4 from/to central (https://repo.maven.apache.org/maven2): Connection reset -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <goals> -rf :flight-core
Destroying 3 processes
Destroying process..
Destroying process..
Destroying process..

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @seddonm1 -- I think this is ready to go once it is green

@alamb
Copy link
Contributor

alamb commented Feb 20, 2021

@seddonm1 we are so close -- now there is a clippy error: https://github.com/apache/arrow/pull/9509/checks?check_run_id=1922005826

ror: this function's return value is unnecessarily wrapped by `Result`
   --> arrow/src/compute/kernels/length.rs:86:1
    |
86  | / fn bit_length_impl<O: StringOffsetSizeTrait, T: ArrowPrimitiveType>(
87  | |     array: &dyn Array,
88  | | ) -> Result<ArrayRef>
89  | | where
...   |
99  | |     }))
100 | | }
    | |_^
    |
    = note: `-D clippy::unnecessary-wraps` implied by `-D warnings`
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_wraps
help: remove `Result` from the return type...
    |
88  | ) -> std::sync::Arc<(dyn array::array::Array + 'static)>
    |      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: ...and change the returning expressions
    |
97  |     unary_offsets_string::<O, _>(array, T::DATA_TYPE, |x| {
98  |         x * bits_in_bytes
99  |     })
    |

error: aborting due to previous error

error: could not compile `arrow`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: this function's return value is unnecessarily wrapped by `Result`
   --> arrow/src/compute/kernels/length.rs:86:1
    |
86  | / fn bit_length_impl<O: StringOffsetSizeTrait, T: ArrowPrimitiveType>(
87  | |     array: &dyn Array,
88  | | ) -> Result<ArrayRef>
89  | | where
...   |
99  | |     }))
100 | | }
    | |_^
    |
    = note: `-D clippy::unnecessary-wraps` implied by `-D warnings`
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_wraps
help: remove `Result` from the return type...
    |
88  | ) -> std::sync::Arc<(dyn array::array::Array + 'static)>
    |      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: ...and change the returning expressions
    |
97  |     unary_offsets_string::<O, _>(array, T::DATA_TYPE, |x| {
98  |         x * bits_in_bytes
99  |     })
    |

@seddonm1
Copy link
Contributor Author

@alamb done and ready. Thanks to @jorgecarleitao for his elite knowledge of trait implementations.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @seddonm1 ! Ready to go from my side.

@alamb alamb closed this in abf967d Feb 21, 2021
@alamb
Copy link
Contributor

alamb commented Feb 21, 2021

Thanks again @seddonm1 -- this is epic work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants