-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9100: [C++] Add ascii_lower kernel #7357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,6 +48,16 @@ struct AsciiUpper { | |
| } | ||
| }; | ||
|
|
||
| struct AsciiLower { | ||
| template <typename... Ignored> | ||
| static std::string Call(KernelContext*, const util::string_view& val) { | ||
| std::string result = val.to_string(); | ||
| std::transform(result.begin(), result.end(), result.begin(), | ||
| [](unsigned char c) { return std::tolower(c); }); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please don't use
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, please fix
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that I used |
||
| return result; | ||
| } | ||
| }; | ||
|
|
||
| void AddAsciiLength(FunctionRegistry* registry) { | ||
| auto func = std::make_shared<ScalarFunction>("ascii_length", Arity::Unary()); | ||
| ArrayKernelExec exec_offset_32 = | ||
|
|
@@ -108,6 +118,7 @@ void AddStrptime(FunctionRegistry* registry) { | |
|
|
||
| void RegisterScalarStringAscii(FunctionRegistry* registry) { | ||
| MakeUnaryStringToString<AsciiUpper>("ascii_upper", registry); | ||
| MakeUnaryStringToString<AsciiLower>("ascii_lower", registry); | ||
| AddAsciiLength(registry); | ||
| AddStrptime(registry); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is completely inefficient. We should write directly into the allocated array. The way this is architected needs rethinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @xhochy and I discussed this above. The question is, do we want a few functions in, and then start iterating on a strategy (for ascii it's trivial, for utf8 less so), or not, I'm happy to explore different strategies first as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start finding a sane strategy right now for the existing
ascii_upperkernel. Indeed, a different strategy will be needed for utf8upper.In both cases, this probably means implementing those kernels as "vector" kernels, not "scalar" kernels (because the latter implies you only process one item at a time).
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
ascii_upper, the strategy should be simple:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Happy to take a look into turning this into a 'vector' kernel, if you have some pointer that would be great. This also answers a question I had if it would be possible to operate on a full array/chunk in the existing framework, I guess that's a yes.
Any reason not to reuse the offset buffer, even if the offset is not 0?
What is the advantage of scalar kernels in Arrow? Apart from avoiding boilerplate? Also wondering if this is related to the Gandiva connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding boilerplate is the main motivation for scalar kernels. It really saves a lot of repetition in kernel implementations.
You can find a simple vector kernel example in
compute/kernels/vector_sort.cc. Take a look especially atRegisterVectorSort,AddSortingKernelsandSortIndices.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid you are mistaking what is meant by "ScalarFunction" and "VectorFunction". "Scalar" and "Vector" refer to the semantics of the function, not the implementation.
The only requirement for these string functions is that you provide
std::function<void(KernelContext*, const ExecBatch&, Datum*)>that implements the kernel. What is inside can be anything. The function is still a ScalarFunction though