ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs] #9243

seddonm1 · 2021-01-18T06:55:18Z

This PR starts the large work of implementing the Postgres String functions. Most of these are naive implementations but the tests should allow rapid performance enhancement without regressions.

	ansi	postgres	done	notes
\|\|	x			need to test parser
IS NORMALIZED	x			need to test parser
bit_length	x		x
char_length	x		x
character_length	x		x
lower	x		x
normalize	x			need to understand unicode normalization
octet_length	x		x
overlay	x			requires parser change but logic is implemented below
position	x			requires parser change but logic is implemented below
substring	x			requires parser change but logic is implemented below
trim	x			requires parser change but logic is implemented below
upper	x		x
ascii		x	x
btrim		x	x
chr		x	x
concat		x	x
concat_ws		x	x
format		x		this will take significant effort without external crates as it needs `sprintf` implementation
initcap		x	x
left		x	x
length		x	x
lpad		x	x
ltrim		x	x
md5		x	x
parse_ident		x		need to read postgres code
pg_client_encoding		x		N/A
quote_ident		x		need to read postgres code
quote_literal		x		need to read postgres code
quote_nullable		x		need to read postgres code
regexp_match		x		requires FromIterator[ListArray]
regexp_matches		x		requires setof
regexp_replace		x	x
regexp_split_to_array		x		requires FromIterator[ListArray]
regexp_split_to_table		x		requires setof
repeat		x	x
replace		x	x
reverse		x	x
right		x	x
rpad		x	x
rtrim		x	x
split_part		x	x
strpos		x	x
substr		x	x
starts_with		x	x
to_ascii		x		this will need an external crate
to_hex		x	x
translate		x	x

Changes
I have had to make some changes to the existing implementations:

concat had the incorrect behavior for how to handle NULLs where any null would result in a NULL where the Postgres implementation documents: NULL arguments are ignored..
ltrim and rtrim were implemented to support only the default space character whereas Postgres supports an optional second parameter: ltrim('zzzytest', 'xyz') so that has been updated.
length kernel returns bytes not characters. character_length has been reimplemented but requires an import of the unicode-segmentation crate. The comments have been updated for length.
I have reworked the tests considerably so that they are easier to add and maintain at a slight performance penalty.

Questions

@jorgecarleitao I think we need this Signature::OneOf vs Signature::Uniform. This came up with a left function that takes a (utf8, int64) signature and it is not correct to try to cast both to utf8. You can see my implementation here but perhaps you have a better method.

github-actions · 2021-01-18T06:55:42Z

https://issues.apache.org/jira/browse/ARROW-11298

rust/datafusion/src/physical_plan/string_expressions.rs

codecov-io · 2021-01-21T07:53:57Z

Codecov Report

Merging #9243 (fa02182) into master (924449e) will increase coverage by 0.35%.
The diff coverage is 92.03%.

@@            Coverage Diff             @@
##           master    #9243      +/-   ##
==========================================
+ Coverage   82.29%   82.64%   +0.35%     
==========================================
  Files         244      245       +1     
  Lines       55616    57408    +1792     
==========================================
+ Hits        45767    47443    +1676     
- Misses       9849     9965     +116

Impacted Files	Coverage Δ
rust/datafusion/src/logical_plan/expr.rs	`81.56% <ø> (+0.42%)`	⬆️
...datafusion/src/physical_plan/string_expressions.rs	`83.21% <85.35%> (+13.59%)`	⬆️
rust/datafusion/src/physical_plan/functions.rs	`89.43% <92.49%> (+15.61%)`	⬆️
rust/datafusion/src/physical_plan/type_coercion.rs	`96.91% <97.05%> (-1.71%)`	⬇️
rust/arrow/src/compute/kernels/bit_length.rs	`100.00% <100.00%> (ø)`
rust/datafusion/tests/sql.rs	`99.93% <100.00%> (+<0.01%)`	⬆️
rust/parquet/src/encodings/encoding.rs	`94.86% <0.00%> (-0.20%)`	⬇️
rust/arrow/src/compute/kernels/cast.rs	`97.40% <0.00%> (+0.12%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 924449e...fa02182. Read the comment docs.

jorgecarleitao

Hey @seddonm1 , I went through this and it looks great so far. Impressive work 💯

I left some comments.

rust/datafusion/src/physical_plan/string_expressions.rs

rust/datafusion/src/physical_plan/functions.rs

rust/datafusion/src/physical_plan/string_expressions.rs

rust/datafusion/src/physical_plan/functions.rs

jorgecarleitao · 2021-01-22T05:00:13Z

rust/datafusion/src/physical_plan/type_coercion.rs

Wont this coerce any type to the first variant, even if the latter variant is accepted?

I.e. if we use

Uniform(vec![ vec![vec![A]], vec![vec![B]], ])

and pass arg types vec![B], I would expect that no coercion would happen, but I suspect that this will coerce B to A, because the first entry with the same number of arguments is vec![vec![A]].

I suggest that we PR this separately with a single function that requires this type of signature, as we need to get this requires much more care than the other parts of this PR as it affects all future functions that use it.

Thanks @jorgecarleitao . Yes I will split this out.

A good example is lpad which is either:
[string, int] or [string, int, string]. I am away a couple of days but will split this out so we can work throught methodically.

@jorgecarleitao as above: https://github.com/seddonm1/arrow/tree/oneof-function-signature

As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR.

rust/datafusion/src/physical_plan/string_expressions.rs

seddonm1 · 2021-02-04T21:25:44Z

@alamb @jorgecarleitao @andygrove

I think these are mostly implemented now. Not sure how we want to do the merge given this change is so large.

seddonm1 · 2021-02-10T21:24:17Z

@andygrove @alamb @jorgecarleitao
Here is the big PR that I was talking about in the Arrow call. I can rebase easily enough but I guess apart from the significant number of new lines (a lot of boilerplate) the key question is (from above):

I think we need this Signature::OneOf. A good example is lpad which is either:
[[utf8, largeutf8], int] or [[utf8, largeutf8], int, [utf8, largeutf8]] signature. You can see my implementation here but perhaps you have a better ideas and I don't know who wrote the original code.

alamb · 2021-02-11T20:31:57Z

I plan to try and review this probably this weekend. I wonder if we should update the title to remove the "WIP"

alamb · 2021-02-11T20:32:36Z

I think the Clippy CI check on this PR is failing due to a new stable rust being released. I am working on a fix here #9476

seddonm1 · 2021-02-11T20:43:37Z

Thanks @alamb . I know the prospect of doing a review like this is not something to look forward to. I will rebase and push soon.

alamb · 2021-02-13T10:59:45Z

I am going in to review this PR -- I am getting second cup of ☕ and settling down for a good read 👓 ...

alamb

First of all, THANK YOU so much @seddonm1 -- this is an Epic body of work and will really drive DataFusion forward and make it so much more useful. I found the code well structured, easy to follow, and well commented. ❤️ ❤️

I spent an hour reviewing this PR -- what I saw was great, but I will be honest that by the end of that hour my mind was quite exhausted and I am not sure I would have caught every little thing

My opinion is we should break the content of this PR into separate pieces (rather than try and merge the whole thing) and start putting them in piece by piece.

Here is one possible way to split it up:

bit_length kernels
Signature::Uniform
Length functions (BitLength, etc)
Ascii/unicode functions
Regex functions
Pad/trim functions

If you would like help doing so / creating the tickets I think I can find the time to do so as I think this is a really important PR.

If you don't want to split it up or disagree, I think I would be ok with merging this in as is once @jorgecarleitao is satisfied with the changes to type signatures if we commit ourselves to some post-merge cleanup / splitting up some of this code into smaller modules.

rust/arrow/src/compute/kernels/bit_length.rs

rust/arrow/src/compute/kernels/length.rs

alamb · 2021-02-13T11:16:11Z

rust/datafusion/Cargo.toml

This is something I have been thinking a lot about -- how can we keep DataFusion's dependency stack reasonable (it is already pretty large and it just keeps getting larger).

One thing I was thinking about was making some of these dependencies optional (so that we had features like regex and unicode and hash which would only pull in the dependencies / have those functions if the features were enabled.

What do you think @jorgecarleitao / @andygrove / @ovr ? If it is a reasonable idea (I think we mentioned it before) I will file a JIRA to track?

Yes I was nervous about additional dependencies. Perhaps this topic can be raised at the next Arrow Rust call to agree some sort of assessment criteria.

FWIW, regex / lazy_static are already non-optional dependencies of arrow, so I think not that much can be gained there, unless we make it optional in Arrow as well.

I think it is a good idea to make some features optional, to reduce compile times whenever you are not working on them.

Another thing we can do to split benchmarks / examples / etc. out of the crate to make compile times a bit shorter, which I started doing hereL #9494 and #9493

alamb · 2021-02-13T11:18:25Z

rust/datafusion/src/physical_plan/type_coercion.rs

As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR.

rust/datafusion/src/physical_plan/string_expressions.rs

rust/datafusion/tests/sql.rs

seddonm1 · 2021-02-14T03:21:22Z

@alamb Thanks for your extreme attention to detail and yes it is absolutely IMPLEMENT ALL THE FUNCTIONS 😆

I have addressed and resolved most of the comments you have made. The remaining unresolved comments do require further discussion.

I am happy to do the split based on your suggestions and I'm ok to raise the tickets:

bit_length kernels + length comments
Signature::OneOf
Length functions (BitLength, etc)
Ascii/unicode functions
Regex functions
Pad/trim functions

Obviously this is a lot of work but this should allow us to split up the reviews more fairly. I will start the PR-mageddon.

alamb · 2021-02-14T11:03:04Z

@seddonm1 -- I merged #9376, which, as you predicated, causes a bunch of conflicts.

Given this PR probably needs a bunch of rework now anyways, if splitting it up into pieces while doing so might not be that much extra work

jorgecarleitao · 2021-02-14T11:10:04Z

@seddonm1, If you need hands, just ping me with function names you would like me to work on and I will pick them up and PR then to this branch.

…Length Functions Splitting up #9243 This implements the following functions: - String functions - [x] bit_Length - [x] char_length - [x] character_length - [x] length - [x] octet_length Closes #9509 from seddonm1/length-functions Lead-authored-by: Mike Seddon <seddonm1@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

seddonm1 · 2021-02-21T22:16:43Z

@alamb FYI i have just rebased against master. @Dandandan has already added the Signature::OneOf functionality so I have rebased against that.

alamb · 2021-02-22T13:24:22Z

@seddonm1 just to be clear, your plan is still to merge this branch in in smaller chunks -- e.g. #9509?

@alamb

This PR is a child of #9243 It does a few things that are hard to separate: - fixes the behavior of `concat` and `trim` functions to be in line with the Postgres implementations - restructures some of the code base (mainly sorting and adding tests) to facilitate easier testing and implementation of the remainder of #9243 @alamb @jorgecarleitao please review but merging will be dependent on #9507 Closes #9551 from seddonm1/concat Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

@alamb

…, right, rpad @alamb Another one. Please pay close attention to the type coercion. It does two things: - fixes the behavior of the **type coercion**. - adds the simple functions `left`, `lpad`, `right`, `rpad` following the Postgres style. This PR is a child of #9243 Closes #9565 from seddonm1/left_ltrim_right_rpad Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2021-03-04T18:57:09Z

switching to draft so it is clear this is not being merged as is and instead is being broken up

@alamb

…, initcap, repeat, reverse, to_hex @alamb This is the second last of the current string functions but I think there may be one after that with new code. This implements some of the miscellaneous string functions `ascii`, `chr`, `initcap`, `repeat`, `reverse`, `to_hex`. The next PR will have more useful functions (including regex). A little bit of tidying for consistency to the other functions was applied. This PR is a child of #9243 Closes #9625 from seddonm1/ascii-chr-initcap-repeat-reverse-tohex Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

seddonm1 · 2021-03-17T22:41:29Z

Closed after splitting.

…Length Functions Splitting up apache/arrow#9243 This implements the following functions: - String functions - [x] bit_Length - [x] char_length - [x] character_length - [x] length - [x] octet_length Closes #9509 from seddonm1/length-functions Lead-authored-by: Mike Seddon <seddonm1@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>

seddonm1 changed the title ~~ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions~~ ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP] Jan 18, 2021

github-actions bot added Component: Rust - DataFusion Component: Rust labels Jan 18, 2021

seddonm1 marked this pull request as draft January 18, 2021 06:56

seddonm1 force-pushed the postgres-string-functions branch from 6691eac to 32333bb Compare January 18, 2021 06:57

seddonm1 commented Jan 21, 2021

View reviewed changes

rust/datafusion/src/physical_plan/string_expressions.rs Outdated Show resolved Hide resolved

jorgecarleitao reviewed Jan 22, 2021

View reviewed changes

rust/datafusion/src/physical_plan/string_expressions.rs Outdated Show resolved Hide resolved

seddonm1 force-pushed the postgres-string-functions branch from 9cdf641 to 5a90cf8 Compare January 27, 2021 21:38

seddonm1 mentioned this pull request Jan 29, 2021

ARROW-11434: [Rust][DataFusion] Rename length kernel to octet_length #9366

Closed

seddonm1 force-pushed the postgres-string-functions branch 2 times, most recently from 799cf68 to a69e099 Compare February 4, 2021 21:24

seddonm1 marked this pull request as ready for review February 8, 2021 21:18

This was referenced Feb 8, 2021

ARROW-11503: [Rust][DataFusion] implement string splitting #9437

Closed

ARROW-10354: [Rust][DataFusion] regexp_extract function to select regex groups from strings #9428

Closed

seddonm1 changed the title ~~ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]~~ ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions Feb 11, 2021

seddonm1 force-pushed the postgres-string-functions branch from a69e099 to b799b66 Compare February 11, 2021 20:59

jorgecarleitao mentioned this pull request Feb 11, 2021

ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions. #9376

Closed

alamb reviewed Feb 13, 2021

View reviewed changes

seddonm1 mentioned this pull request Feb 16, 2021

ARROW-11651: [Rust][DataFusion] Implement Postgres String Functions: Length Functions #9509

Closed

5 tasks

seddonm1 changed the title ~~ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions~~ ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs] Feb 16, 2021

seddonm1 mentioned this pull request Feb 18, 2021

ARROW-11650: [Rust][DataFusion] Add Postgres License #9507

Closed

rebase

145b108

seddonm1 force-pushed the postgres-string-functions branch from e133704 to 145b108 Compare February 21, 2021 21:33

seddonm1 added 2 commits February 22, 2021 08:35

add license

fa02182

apply generics for DRY

71b9a7a

seddonm1 mentioned this pull request Feb 22, 2021

ARROW-11738: [Rust][DataFusion] Fix Concat and Trim Functions #9551

Closed

seddonm1 mentioned this pull request Feb 24, 2021

ARROW-11655: [Rust][DataFusion] Postgres String Functions: left, lpad, right, rpad #9565

Closed

seddonm1 mentioned this pull request Mar 3, 2021

ARROW-11653: [Rust][DataFusion] Postgres String Functions: ascii, chr, initcap, repeat, reverse, to_hex #9625

Closed

alamb marked this pull request as draft March 4, 2021 18:56

seddonm1 closed this Mar 17, 2021

asfimport mentioned this pull request Apr 26, 2021

[Rust][DataFusion] Implement Postgres String Functions #27198

Closed

7 tasks

ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs] #9243

ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs] #9243

Uh oh!

Conversation

seddonm1 commented Jan 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 18, 2021

Uh oh!

Uh oh!

codecov-io commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seddonm1 commented Feb 4, 2021

Uh oh!

seddonm1 commented Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Feb 11, 2021

Uh oh!

alamb commented Feb 11, 2021

Uh oh!

seddonm1 commented Feb 11, 2021

Uh oh!

alamb commented Feb 13, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seddonm1 commented Feb 14, 2021

Uh oh!

alamb commented Feb 14, 2021

Uh oh!

jorgecarleitao commented Feb 14, 2021

Uh oh!

seddonm1 commented Feb 21, 2021

Uh oh!

alamb commented Feb 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 4, 2021

Uh oh!

seddonm1 commented Mar 17, 2021

seddonm1 commented Jan 18, 2021 •

edited

Loading

codecov-io commented Jan 21, 2021 •

edited

Loading

seddonm1 commented Feb 10, 2021 •

edited

Loading

alamb commented Feb 22, 2021 •

edited

Loading