Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
282c6ba
add kernel executor for string binary functions
edponce Sep 3, 2021
b390acd
add comments/notes
edponce Sep 4, 2021
e327e62
add second input to MaxCodeunits
edponce Sep 7, 2021
91baa30
add inheritance to have PreExec and InvalidStatus
edponce Sep 7, 2021
673b413
add RepeatOptions
edponce Aug 27, 2021
543174c
add str repeat kernel
edponce Aug 27, 2021
00be5a0
add tests
edponce Aug 27, 2021
cabe6a7
update docs
edponce Aug 27, 2021
2d4fcb6
fix linter error
edponce Aug 28, 2021
dfa931d
add doubling approach and benchmark
edponce Aug 28, 2021
d32574d
remove naive approach and add check for repeats option
edponce Aug 29, 2021
8daa01f
add support for array of repeats
edponce Aug 30, 2021
2e22648
add RepeatOptions to PyArrow
edponce Aug 30, 2021
8e394c6
add R bindings
edponce Aug 30, 2021
8f3c977
set repeats to std::vector<int>
edponce Aug 30, 2021
f895769
update pyarrow bindings and tests
edponce Aug 30, 2021
512d5b8
fix lint error
edponce Aug 30, 2021
e2744cc
fix typo
edponce Aug 30, 2021
81afdaf
remove RepeatOptions
edponce Sep 4, 2021
b652a07
update kernel to conform to StringBinaryTransformExec
edponce Sep 4, 2021
a527b08
fix lint errors
edponce Sep 4, 2021
7173929
remove benchmark case
edponce Sep 6, 2021
86ed8d3
cast output of string transform
edponce Sep 6, 2021
0a7cea0
remove StrRepeat benchmark
edponce Sep 6, 2021
faf1612
use Unbox to get scalar C values
edponce Sep 6, 2021
b508d89
updates based on StringBinaryTransform updates
edponce Sep 7, 2021
4998c08
fix lint errors
edponce Sep 7, 2021
5626938
remove checks for repeats and array length, update tests
edponce Sep 9, 2021
d4d2df1
add cast to transform output
edponce Sep 9, 2021
830be80
remove RepeatOptions from docs
edponce Sep 9, 2021
0e7bb2f
remove RepeatOptions from R bindings
edponce Sep 9, 2021
fdad773
remove inheritance from StringTransformBase
edponce Sep 9, 2021
c70fb6e
(WIP) resolving reviewer comments
edponce Sep 10, 2021
662722a
remove tests with invalid repeats (negative)
edponce Sep 13, 2021
3b4cbc5
update string binary infrastructure
edponce Sep 13, 2021
e418b3f
remove invalid tests (negative repeats)
edponce Sep 13, 2021
e5e43bd
add <vector> and remove support for negative repeats
edponce Sep 13, 2021
bb2e6d3
still getting undefined ref for XTypes()
edponce Sep 13, 2021
4d47a5d
add virtual destructor as per ARROW-13670
edponce Sep 20, 2021
d20fcd8
rename to long form (string_repeat) and fix test
edponce Sep 27, 2021
60f0a23
update kernel registration and StringTransformBase virtual methods
edponce Oct 18, 2021
00204f3
support negative repeat counts
edponce Oct 18, 2021
e35926d
add transform wrappers for simple/doubling repeat implementations
edponce Oct 18, 2021
1df1c71
support FixedSizeBinary, remove static_casts, and update comments
edponce Oct 19, 2021
2a38b36
use visitor for ArrayScalar exec
edponce Oct 19, 2021
a2b5044
refactor StringBinaryTransformExecBase with visitors
edponce Oct 19, 2021
ac56168
update docs
edponce Oct 19, 2021
54e6d37
support numeric types for repeat count and rename to num_repeat
edponce Oct 19, 2021
871a7f6
add R binding and tests
edponce Oct 19, 2021
9ce7319
delete old FunctionOptions from R binding
edponce Oct 19, 2021
38cc27e
delete pyarrow test
edponce Oct 19, 2021
609f055
remove comment on generic lambdas in C++14
edponce Oct 19, 2021
d0c6130
add boolean test
edponce Oct 19, 2021
092a335
remove implicit casting for repeat count and add validation check for…
edponce Oct 20, 2021
34c366e
update R binding, remove validation
edponce Oct 20, 2021
0da3c47
rebase and add more invalid tests (negative and float repeat count)
edponce Oct 26, 2021
e95fcfd
fix lint errors
edponce Oct 27, 2021
f5d18c5
remove invalid redundant strrep test from R
edponce Oct 27, 2021
a2b1c15
update comments and fix RETURN_NOT_OK
edponce Oct 27, 2021
a994c21
add Status data member to StringTransformBase and checks for invalid …
edponce Oct 27, 2021
156ae79
improve comments, add Status as a parameter, and minor consistency ch…
edponce Oct 30, 2021
454bd49
wrap long lines and rename tests
edponce Nov 1, 2021
72ec626
add Result return types and minor changes
edponce Nov 2, 2021
df17787
add StringRepeat benchmark
edponce Nov 2, 2021
eade790
remove std::function indirection
edponce Nov 2, 2021
8de20dc
add static_cast to Transform return value
edponce Nov 2, 2021
9e28894
update R test
edponce Nov 2, 2021
7c0673d
fix lint error
edponce Nov 2, 2021
f441b15
revert include statement
edponce Nov 2, 2021
225cc68
change xor to subtraction and rename var in doubling approach
edponce Nov 2, 2021
62969b3
rename function to binary_repeat
edponce Nov 2, 2021
6ec68ae
update function name in benchmark
edponce Nov 2, 2021
9dbc421
fix R func name error
edponce Nov 2, 2021
73a6cfc
R changes: input -> .input, transmute -> mutate
edponce Nov 2, 2021
a28f25d
update function name in R expressions
edponce Nov 2, 2021
9ea2d4d
add R str_dup binding and test
edponce Nov 3, 2021
eb72d63
use different vars in R tests
edponce Nov 3, 2021
6aabb4f
add test with num_repeat=null
edponce Nov 3, 2021
cc47c65
use transmute instead of mutate?
edponce Nov 3, 2021
6bce7c2
add str_dup to test title and revert to mutate()
edponce Nov 3, 2021
c555bd8
add L to number range
edponce Nov 3, 2021
29df265
split tests
edponce Nov 3, 2021
c5e83a9
use different var names
edponce Nov 3, 2021
97a15ad
rename R func str_dup
edponce Nov 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
567 changes: 531 additions & 36 deletions cpp/src/arrow/compute/kernels/scalar_string.cc

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,29 @@ static void BinaryJoinElementWiseArrayArray(benchmark::State& state) {
});
}

static void BinaryRepeat(benchmark::State& state) {
const int64_t array_length = 1 << 20;
const int64_t value_min_size = 0;
const int64_t value_max_size = 32;
const double null_probability = 0.01;
const int64_t repeat_min_size = 0;
const int64_t repeat_max_size = 8;
random::RandomArrayGenerator rng(kSeed);

// NOTE: this produces only-Ascii data
auto values =
rng.String(array_length, value_min_size, value_max_size, null_probability);
auto num_repeats = rng.Int64(array_length, repeat_min_size, repeat_max_size, 0);
// Make sure lookup tables are initialized before measuring
ABORT_NOT_OK(CallFunction("binary_repeat", {values, num_repeats}));

for (auto _ : state) {
ABORT_NOT_OK(CallFunction("binary_repeat", {values, num_repeats}));
}
state.SetItemsProcessed(state.iterations() * array_length);
state.SetBytesProcessed(state.iterations() * values->data()->buffers[2]->size());
}

BENCHMARK(AsciiLower);
BENCHMARK(AsciiUpper);
BENCHMARK(IsAlphaNumericAscii);
Expand All @@ -236,5 +259,7 @@ BENCHMARK(BinaryJoinArrayArray);
BENCHMARK(BinaryJoinElementWiseArrayScalar)->RangeMultiplier(8)->Range(2, 128);
BENCHMARK(BinaryJoinElementWiseArrayArray)->RangeMultiplier(8)->Range(2, 128);

BENCHMARK(BinaryRepeat);

} // namespace compute
} // namespace arrow
79 changes: 71 additions & 8 deletions cpp/src/arrow/compute/kernels/scalar_string_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

#include <memory>
#include <string>
#include <utility>
#include <vector>

#include <gmock/gmock.h>
#include <gtest/gtest.h>
Expand All @@ -26,8 +28,10 @@
#endif

#include "arrow/compute/api_scalar.h"
#include "arrow/compute/kernels/codegen_internal.h"
#include "arrow/compute/kernels/test_util.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/type.h"

namespace arrow {
namespace compute {
Expand Down Expand Up @@ -64,14 +68,6 @@ class BaseTestStringKernels : public ::testing::Test {
CheckScalar(func_name, {Datum(input)}, Datum(expected), options);
}

void CheckBinaryScalar(std::string func_name, std::string json_left_input,
std::string json_right_scalar, std::shared_ptr<DataType> out_ty,
std::string json_expected,
const FunctionOptions* options = nullptr) {
CheckScalarBinaryScalar(func_name, type(), json_left_input, json_right_scalar, out_ty,
json_expected, options);
}

void CheckVarArgsScalar(std::string func_name, std::string json_input,
std::shared_ptr<DataType> out_ty, std::string json_expected,
const FunctionOptions* options = nullptr) {
Expand Down Expand Up @@ -1041,6 +1037,73 @@ TYPED_TEST(TestStringKernels, Utf8Title) {
R"([null, "", "B", "Aaaz;Zææ&", "Ɑɽɽow", "Ii", "Ⱥ.Ⱥ.Ⱥ..Ⱥ", "Hello, World!", "Foo Bar;Héhé0Zop", "!%$^.,;"])");
}

TYPED_TEST(TestStringKernels, BinaryRepeatWithScalarRepeat) {
auto values = ArrayFromJSON(this->type(),
R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI",
"ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
std::vector<std::pair<int, std::string>> nrepeats_and_expected{{
{0, R"(["", null, "", "", "", "", "", "", "", ""])"},
{1, R"(["aAazZæÆ&", null, "", "b", "ɑɽⱤoW", "ıI", "ⱥⱥⱥȺ", "hEllO, WoRld!",
"$. A3", "!ɑⱤⱤow"])"},
{4, R"(["aAazZæÆ&aAazZæÆ&aAazZæÆ&aAazZæÆ&", null, "", "bbbb",
"ɑɽⱤoWɑɽⱤoWɑɽⱤoWɑɽⱤoW", "ıIıIıIıI", "ⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺⱥⱥⱥȺ",
"hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!",
"$. A3$. A3$. A3$. A3", "!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])"},
}};

for (const auto& pair : nrepeats_and_expected) {
auto num_repeat = pair.first;
auto expected = pair.second;
for (const auto& ty : IntTypes()) {
this->CheckVarArgs("binary_repeat",
{values, Datum(*arrow::MakeScalar(ty, num_repeat))},
this->type(), expected);
}
}

// Negative repeat count
for (auto num_repeat_ : {-1, -2, -5}) {
auto num_repeat = *arrow::MakeScalar(int64(), num_repeat_);
EXPECT_RAISES_WITH_MESSAGE_THAT(
Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
CallFunction("binary_repeat", {values, num_repeat}));
}

// Floating-point repeat count
for (auto num_repeat_ : {0.0, 1.2, -1.3}) {
auto num_repeat = *arrow::MakeScalar(float64(), num_repeat_);
EXPECT_RAISES_WITH_MESSAGE_THAT(
NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
CallFunction("binary_repeat", {values, num_repeat}));
}
}

TYPED_TEST(TestStringKernels, BinaryRepeatWithArrayRepeat) {
auto values = ArrayFromJSON(this->type(),
R"([null, "aAazZæÆ&", "", "b", "ɑɽⱤoW", "ıI",
"ⱥⱥⱥȺ", "hEllO, WoRld!", "$. A3", "!ɑⱤⱤow"])");
for (const auto& ty : IntTypes()) {
auto num_repeats = ArrayFromJSON(ty, R"([100, 1, 2, 5, 2, 0, 1, 3, null, 3])");
std::string expected =
R"([null, "aAazZæÆ&", "", "bbbbb", "ɑɽⱤoWɑɽⱤoW", "", "ⱥⱥⱥȺ",
"hEllO, WoRld!hEllO, WoRld!hEllO, WoRld!", null,
"!ɑⱤⱤow!ɑⱤⱤow!ɑⱤⱤow"])";
this->CheckVarArgs("binary_repeat", {values, num_repeats}, this->type(), expected);
}

// Negative repeat count
auto num_repeats = ArrayFromJSON(int64(), R"([100, -1, 2, -5, 2, -1, 3, -2, 3, -100])");
EXPECT_RAISES_WITH_MESSAGE_THAT(
Invalid, ::testing::HasSubstr("Repeat count must be a non-negative integer"),
CallFunction("binary_repeat", {values, num_repeats}));

// Floating-point repeat count
num_repeats = ArrayFromJSON(float64(), R"([0.0, 1.2, -1.3])");
EXPECT_RAISES_WITH_MESSAGE_THAT(
NotImplemented, ::testing::HasSubstr("has no kernel matching input types"),
CallFunction("binary_repeat", {values, num_repeats}));
}

TYPED_TEST(TestStringKernels, IsAlphaNumericUnicode) {
// U+08BE (utf8: \xE0\xA2\xBE) is undefined, but utf8proc things it is
// UTF8PROC_CATEGORY_LO
Expand Down
48 changes: 48 additions & 0 deletions cpp/src/arrow/util/bit_block_counter.h
Original file line number Diff line number Diff line change
Expand Up @@ -491,6 +491,54 @@ static void VisitBitBlocksVoid(const std::shared_ptr<Buffer>& bitmap_buf, int64_
}
}

template <typename VisitNotNull, typename VisitNull>
static Status VisitTwoBitBlocks(const std::shared_ptr<Buffer>& left_bitmap_buf,
int64_t left_offset,
const std::shared_ptr<Buffer>& right_bitmap_buf,
int64_t right_offset, int64_t length,
VisitNotNull&& visit_not_null, VisitNull&& visit_null) {
if (left_bitmap_buf == NULLPTR || right_bitmap_buf == NULLPTR) {
// At most one bitmap is present
if (left_bitmap_buf == NULLPTR) {
return VisitBitBlocks(right_bitmap_buf, right_offset, length,
std::forward<VisitNotNull>(visit_not_null),
std::forward<VisitNull>(visit_null));
} else {
return VisitBitBlocks(left_bitmap_buf, left_offset, length,
std::forward<VisitNotNull>(visit_not_null),
std::forward<VisitNull>(visit_null));
}
}
// Both bitmaps are present
const uint8_t* left_bitmap = left_bitmap_buf->data();
const uint8_t* right_bitmap = right_bitmap_buf->data();
BinaryBitBlockCounter bit_counter(left_bitmap, left_offset, right_bitmap, right_offset,
length);
int64_t position = 0;
while (position < length) {
BitBlockCount block = bit_counter.NextAndWord();
if (block.AllSet()) {
for (int64_t i = 0; i < block.length; ++i, ++position) {
ARROW_RETURN_NOT_OK(visit_not_null(position));
}
} else if (block.NoneSet()) {
for (int64_t i = 0; i < block.length; ++i, ++position) {
ARROW_RETURN_NOT_OK(visit_null());
}
} else {
for (int64_t i = 0; i < block.length; ++i, ++position) {
if (BitUtil::GetBit(left_bitmap, left_offset + position) &&
BitUtil::GetBit(right_bitmap, right_offset + position)) {
ARROW_RETURN_NOT_OK(visit_not_null(position));
} else {
ARROW_RETURN_NOT_OK(visit_null());
}
}
}
}
return Status::OK();
}

template <typename VisitNotNull, typename VisitNull>
static void VisitTwoBitBlocksVoid(const std::shared_ptr<Buffer>& left_bitmap_buf,
int64_t left_offset,
Expand Down
Loading