Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions cpp/src/arrow/compute/kernels/scalar_string.cc
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,16 @@ struct AsciiUpper {
}
};

struct AsciiLower {
template <typename... Ignored>
static std::string Call(KernelContext*, const util::string_view& val) {
std::string result = val.to_string();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely inefficient. We should write directly into the allocated array. The way this is architected needs rethinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @xhochy and I discussed this above. The question is, do we want a few functions in, and then start iterating on a strategy (for ascii it's trivial, for utf8 less so), or not, I'm happy to explore different strategies first as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start finding a sane strategy right now for the existing ascii_upper kernel. Indeed, a different strategy will be needed for utf8 upper.

In both cases, this probably means implementing those kernels as "vector" kernels, not "scalar" kernels (because the latter implies you only process one item at a time).

Copy link
Member

@pitrou pitrou Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ascii_upper, the strategy should be simple:

  • reuse the same null bitmap buffer
  • recompute a similar offsets buffer, but without the base offset (or reuse the same offsets buffer, if the base offset is 0)
  • allocate a same-size values buffer, and convert it in one pass (no need to iterate over individual array elements)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Happy to take a look into turning this into a 'vector' kernel, if you have some pointer that would be great. This also answers a question I had if it would be possible to operate on a full array/chunk in the existing framework, I guess that's a yes.

Any reason not to reuse the offset buffer, even if the offset is not 0?

What is the advantage of scalar kernels in Arrow? Apart from avoiding boilerplate? Also wondering if this is related to the Gandiva connection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding boilerplate is the main motivation for scalar kernels. It really saves a lot of repetition in kernel implementations.

You can find a simple vector kernel example in compute/kernels/vector_sort.cc. Take a look especially at RegisterVectorSort, AddSortingKernels and SortIndices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid you are mistaking what is meant by "ScalarFunction" and "VectorFunction". "Scalar" and "Vector" refer to the semantics of the function, not the implementation.

The only requirement for these string functions is that you provide std::function<void(KernelContext*, const ExecBatch&, Datum*)> that implements the kernel. What is inside can be anything. The function is still a ScalarFunction though

std::transform(result.begin(), result.end(), result.begin(),
[](unsigned char c) { return std::tolower(c); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use std::tolower as it has locale-dependent behaviour. Instead, just create a hardcoded lookup table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please fix AsciiUpper similarly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I used std::toupper just for illustration purposes, not because I thought it was the correct way to implement the kernel.

return result;
}
};

void AddAsciiLength(FunctionRegistry* registry) {
auto func = std::make_shared<ScalarFunction>("ascii_length", Arity::Unary());
ArrayKernelExec exec_offset_32 =
Expand Down Expand Up @@ -108,6 +118,7 @@ void AddStrptime(FunctionRegistry* registry) {

void RegisterScalarStringAscii(FunctionRegistry* registry) {
MakeUnaryStringToString<AsciiUpper>("ascii_upper", registry);
MakeUnaryStringToString<AsciiLower>("ascii_lower", registry);
AddAsciiLength(registry);
AddStrptime(registry);
}
Expand Down
5 changes: 5 additions & 0 deletions cpp/src/arrow/compute/kernels/scalar_string_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ TYPED_TEST(TestStringKernels, AsciiUpper) {
"[\"AAA&\", null, \"\", \"B\"]");
}

TYPED_TEST(TestStringKernels, AsciiLower) {
this->CheckUnary("ascii_lower", "[\"aAa&\", null, \"\", \"b\"]", this->string_type(),
"[\"aaa&\", null, \"\", \"b\"]");
}

TYPED_TEST(TestStringKernels, Strptime) {
std::string input1 = R"(["5/1/2020", null, "12/11/1900"])";
std::string output1 = R"(["2020-05-01", null, "1900-12-11"])";
Expand Down
1 change: 1 addition & 0 deletions python/pyarrow/compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ def func(arg):

ascii_length = _simple_unary_function('ascii_length')
ascii_upper = _simple_unary_function('ascii_upper')
ascii_lower = _simple_unary_function('ascii_lower')


def sum(array):
Expand Down