Skip to content

Conversation

@seddonm1
Copy link
Contributor

This PR renames the length kernel to octet_length to clearly indicate what it returns and allows differentiation from character_length. The use of the term octet could be replaced with bytes but was chosen given there is an ANSI SQL function octet_length.

I have created the correct character_length function as part of #9243.

Issue
The rust length kernel currently counts number of bytes/octets which may or may not be the same as the number of characters given that Arrow uses UTF8 encoding. This means that the result of the length kernel on a string like josé will be 5 bytes rather than 4 characters.

@github-actions
Copy link

@jorgecarleitao
Copy link
Member

I agree that it may be misleading, but from Rust's perspective, it is not "incorrect" to use length to denote the number of bytes of a string: String::len uses the same convention, and you need to use s.chars().count() to call the number of characters.

This also collides with #9353, where length is extended to support ListArray and BinaryArray.

One idea is to keep the name as is on the arrow crate, but name it octet_length on DataFusion's SQL and API (to be consistent with Postgres).

@seddonm1
Copy link
Contributor Author

No problem. I will close this PR and raise one with the function comments updated to clarify it's intended behavior.

@seddonm1 seddonm1 closed this Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants