[C++] Performance of building up HashTable (MemoTable) in is_in kernel

From some benchmarking in pandas, @phofl noticed that the "is_in" implementation of Arrow is considerably slower compared to pandas' (khash-based) implementation, _if_ the value_set is big (and thus when the execution time is mostly dominated by creating the hashtable, and not by the actual lookups). 


Small illustration comparing `pc.is_in` with `pandas.core.algorithms.isin` (the equivalent pandas function) using a larger value_set with a small array (so that we are mostly timing the hashtable creation):
```
arr = pa.array(np.random.choice(np.arange(1000), 100))
for n in [10_000, 100_000, 1_000_000, 10_000_000]:
    print(n)
    value_set = pa.array(np.arange(n))
    %timeit pc.is_in(arr, value_set)
    %timeit pd.core.algorithms.isin(np.asarray(arr), np.asarray(value_set))
```

gives

```
10000
158 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
66.8 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
100000
8.51 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
627 µs ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1000000
103 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28.5 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10000000
1.26 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
171 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

So pandas is often 4-10x faster. I am not familiar with our HashTable implementation or whether it's a fully fair comparison, but it seems this is at least an indication there is room for performance improvement (and I suppose this might be for HashTable-based functions in general, not just "is_in").

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Performance of building up HashTable (MemoTable) in is_in kernel #36059

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Performance of building up HashTable (MemoTable) in is_in kernel #36059

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions