Skip to content

Conversation

@seddonm1
Copy link
Contributor

@seddonm1 seddonm1 commented Jan 18, 2021

This PR starts the large work of implementing the Postgres String functions. Most of these are naive implementations but the tests should allow rapid performance enhancement without regressions.

ansi postgres done notes
|| x need to test parser
IS NORMALIZED x need to test parser
bit_length x x
char_length x x
character_length x x
lower x x
normalize x need to understand unicode normalization
octet_length x x
overlay x requires parser change but logic is implemented below
position x requires parser change but logic is implemented below
substring x requires parser change but logic is implemented below
trim x requires parser change but logic is implemented below
upper x x
ascii x x
btrim x x
chr x x
concat x x
concat_ws x x
format x this will take significant effort without external crates as it needs sprintf implementation
initcap x x
left x x
length x x
lpad x x
ltrim x x
md5 x x
parse_ident x need to read postgres code
pg_client_encoding x N/A
quote_ident x need to read postgres code
quote_literal x need to read postgres code
quote_nullable x need to read postgres code
regexp_match x requires FromIterator[ListArray]
regexp_matches x requires setof
regexp_replace x x
regexp_split_to_array x requires FromIterator[ListArray]
regexp_split_to_table x requires setof
repeat x x
replace x x
reverse x x
right x x
rpad x x
rtrim x x
split_part x x
strpos x x
substr x x
starts_with x x
to_ascii x this will need an external crate
to_hex x x
translate x x

Changes
I have had to make some changes to the existing implementations:

  • concat had the incorrect behavior for how to handle NULLs where any null would result in a NULL where the Postgres implementation documents: NULL arguments are ignored..
  • ltrim and rtrim were implemented to support only the default space character whereas Postgres supports an optional second parameter: ltrim('zzzytest', 'xyz') so that has been updated.
  • length kernel returns bytes not characters. character_length has been reimplemented but requires an import of the unicode-segmentation crate. The comments have been updated for length.
  • I have reworked the tests considerably so that they are easier to add and maintain at a slight performance penalty.

Questions

  • @jorgecarleitao I think we need this Signature::OneOf vs Signature::Uniform. This came up with a left function that takes a (utf8, int64) signature and it is not correct to try to cast both to utf8. You can see my implementation here but perhaps you have a better method.

@seddonm1 seddonm1 changed the title ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP] Jan 18, 2021
@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Jan 21, 2021

Codecov Report

Merging #9243 (fa02182) into master (924449e) will increase coverage by 0.35%.
The diff coverage is 92.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9243      +/-   ##
==========================================
+ Coverage   82.29%   82.64%   +0.35%     
==========================================
  Files         244      245       +1     
  Lines       55616    57408    +1792     
==========================================
+ Hits        45767    47443    +1676     
- Misses       9849     9965     +116     
Impacted Files Coverage Δ
rust/datafusion/src/logical_plan/expr.rs 81.56% <ø> (+0.42%) ⬆️
...datafusion/src/physical_plan/string_expressions.rs 83.21% <85.35%> (+13.59%) ⬆️
rust/datafusion/src/physical_plan/functions.rs 89.43% <92.49%> (+15.61%) ⬆️
rust/datafusion/src/physical_plan/type_coercion.rs 96.91% <97.05%> (-1.71%) ⬇️
rust/arrow/src/compute/kernels/bit_length.rs 100.00% <100.00%> (ø)
rust/datafusion/tests/sql.rs 99.93% <100.00%> (+<0.01%) ⬆️
rust/parquet/src/encodings/encoding.rs 94.86% <0.00%> (-0.20%) ⬇️
rust/arrow/src/compute/kernels/cast.rs 97.40% <0.00%> (+0.12%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 924449e...fa02182. Read the comment docs.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @seddonm1 , I went through this and it looks great so far. Impressive work 💯

I left some comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont this coerce any type to the first variant, even if the latter variant is accepted?

I.e. if we use

Uniform(vec![
    vec![vec![A]],
    vec![vec![B]],
])

and pass arg types vec![B], I would expect that no coercion would happen, but I suspect that this will coerce B to A, because the first entry with the same number of arguments is vec![vec![A]].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we PR this separately with a single function that requires this type of signature, as we need to get this requires much more care than the other parts of this PR as it affects all future functions that use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorgecarleitao . Yes I will split this out.

A good example is lpad which is either:
[string, int] or [string, int, string]. I am away a couple of days but will split this out so we can work throught methodically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR.

@seddonm1 seddonm1 force-pushed the postgres-string-functions branch from 9cdf641 to 5a90cf8 Compare January 27, 2021 21:38
@seddonm1 seddonm1 force-pushed the postgres-string-functions branch 2 times, most recently from 799cf68 to a69e099 Compare February 4, 2021 21:24
@seddonm1
Copy link
Contributor Author

seddonm1 commented Feb 4, 2021

@alamb @jorgecarleitao @andygrove

I think these are mostly implemented now. Not sure how we want to do the merge given this change is so large.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Feb 10, 2021

@andygrove @alamb @jorgecarleitao
Here is the big PR that I was talking about in the Arrow call. I can rebase easily enough but I guess apart from the significant number of new lines (a lot of boilerplate) the key question is (from above):

I think we need this Signature::OneOf. A good example is lpad which is either:
[[utf8, largeutf8], int] or [[utf8, largeutf8], int, [utf8, largeutf8]] signature. You can see my implementation here but perhaps you have a better ideas and I don't know who wrote the original code.

@alamb
Copy link
Contributor

alamb commented Feb 11, 2021

I plan to try and review this probably this weekend. I wonder if we should update the title to remove the "WIP"

@alamb
Copy link
Contributor

alamb commented Feb 11, 2021

I think the Clippy CI check on this PR is failing due to a new stable rust being released. I am working on a fix here #9476

@seddonm1 seddonm1 changed the title ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP] ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions Feb 11, 2021
@seddonm1
Copy link
Contributor Author

Thanks @alamb . I know the prospect of doing a review like this is not something to look forward to. I will rebase and push soon.

@alamb
Copy link
Contributor

alamb commented Feb 13, 2021

I am going in to review this PR -- I am getting second cup of ☕ and settling down for a good read 👓 ...

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

First of all, THANK YOU so much @seddonm1 -- this is an Epic body of work and will really drive DataFusion forward and make it so much more useful. I found the code well structured, easy to follow, and well commented. ❤️ ❤️

I spent an hour reviewing this PR -- what I saw was great, but I will be honest that by the end of that hour my mind was quite exhausted and I am not sure I would have caught every little thing

My opinion is we should break the content of this PR into separate pieces (rather than try and merge the whole thing) and start putting them in piece by piece.

Here is one possible way to split it up:

  1. bit_length kernels
  2. Signature::Uniform
  3. Length functions (BitLength, etc)
  4. Ascii/unicode functions
  5. Regex functions
  6. Pad/trim functions

If you would like help doing so / creating the tickets I think I can find the time to do so as I think this is a really important PR.

If you don't want to split it up or disagree, I think I would be ok with merging this in as is once @jorgecarleitao is satisfied with the changes to type signatures if we commit ourselves to some post-merge cleanup / splitting up some of this code into smaller modules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I have been thinking a lot about -- how can we keep DataFusion's dependency stack reasonable (it is already pretty large and it just keeps getting larger).

One thing I was thinking about was making some of these dependencies optional (so that we had features like regex and unicode and hash which would only pull in the dependencies / have those functions if the features were enabled.

What do you think @jorgecarleitao / @andygrove / @ovr ? If it is a reasonable idea (I think we mentioned it before) I will file a JIRA to track?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was nervous about additional dependencies. Perhaps this topic can be raised at the next Arrow Rust call to agree some sort of assessment criteria.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, regex / lazy_static are already non-optional dependencies of arrow, so I think not that much can be gained there, unless we make it optional in Arrow as well.

I think it is a good idea to make some features optional, to reduce compile times whenever you are not working on them.

Another thing we can do to split benchmarks / examples / etc. out of the crate to make compile times a bit shorter, which I started doing hereL #9494 and #9493

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR.

@seddonm1
Copy link
Contributor Author

@alamb Thanks for your extreme attention to detail and yes it is absolutely IMPLEMENT ALL THE FUNCTIONS 😆

I have addressed and resolved most of the comments you have made. The remaining unresolved comments do require further discussion.

I am happy to do the split based on your suggestions and I'm ok to raise the tickets:

  • bit_length kernels + length comments
  • Signature::OneOf
  • Length functions (BitLength, etc)
  • Ascii/unicode functions
  • Regex functions
  • Pad/trim functions

Obviously this is a lot of work but this should allow us to split up the reviews more fairly. I will start the PR-mageddon.

@alamb
Copy link
Contributor

alamb commented Feb 14, 2021

@seddonm1 -- I merged #9376, which, as you predicated, causes a bunch of conflicts.

Given this PR probably needs a bunch of rework now anyways, if splitting it up into pieces while doing so might not be that much extra work

@jorgecarleitao
Copy link
Member

@seddonm1, If you need hands, just ping me with function names you would like me to work on and I will pick them up and PR then to this branch.

@seddonm1 seddonm1 changed the title ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs] Feb 16, 2021
alamb pushed a commit that referenced this pull request Feb 21, 2021
…Length Functions

Splitting up #9243

This implements the following functions:

- String functions
  - [x] bit_Length
  - [x] char_length
  - [x] character_length
  - [x] length
  - [x] octet_length

Closes #9509 from seddonm1/length-functions

Lead-authored-by: Mike Seddon <seddonm1@gmail.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@seddonm1 seddonm1 force-pushed the postgres-string-functions branch from e133704 to 145b108 Compare February 21, 2021 21:33
@seddonm1
Copy link
Contributor Author

@alamb FYI i have just rebased against master. @Dandandan has already added the Signature::OneOf functionality so I have rebased against that.

@alamb
Copy link
Contributor

alamb commented Feb 22, 2021

@seddonm1 just to be clear, your plan is still to merge this branch in in smaller chunks -- e.g. #9509?

alamb pushed a commit that referenced this pull request Feb 24, 2021
This PR is a child of #9243

It does a few things that are hard to separate:

- fixes the behavior of `concat` and `trim` functions to be in line with the Postgres implementations
- restructures some of the code base (mainly sorting and adding tests) to facilitate easier testing and implementation of the remainder of #9243

@alamb @jorgecarleitao
please review but merging will be dependent on #9507

Closes #9551 from seddonm1/concat

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb pushed a commit that referenced this pull request Mar 3, 2021
…, right, rpad

@alamb Another one. Please pay close attention to the type coercion.

It does two things:
- fixes the behavior of the **type coercion**.
- adds the simple functions `left`, `lpad`, `right`, `rpad` following the Postgres style.

This PR is a child of #9243

Closes #9565 from seddonm1/left_ltrim_right_rpad

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb alamb marked this pull request as draft March 4, 2021 18:56
@alamb
Copy link
Contributor

alamb commented Mar 4, 2021

switching to draft so it is clear this is not being merged as is and instead is being broken up

alamb pushed a commit that referenced this pull request Mar 5, 2021
…, initcap, repeat, reverse, to_hex

@alamb This is the second last of the current string functions but I think there may be one after that with new code.

This implements some of the miscellaneous string functions `ascii`, `chr`, `initcap`, `repeat`, `reverse`, `to_hex`. The next PR will have more useful functions (including regex).

A little bit of tidying for consistency to the other functions was applied.

This PR is a child of #9243

Closes #9625 from seddonm1/ascii-chr-initcap-repeat-reverse-tohex

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@seddonm1
Copy link
Contributor Author

Closed after splitting.

@seddonm1 seddonm1 closed this Mar 17, 2021
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…Length Functions

Splitting up apache/arrow#9243

This implements the following functions:

- String functions
  - [x] bit_Length
  - [x] char_length
  - [x] character_length
  - [x] length
  - [x] octet_length

Closes #9509 from seddonm1/length-functions

Lead-authored-by: Mike Seddon <seddonm1@gmail.com>
Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants