Skip to content

[C++] Performance of building up HashTable (MemoTable) in is_in kernel #36059

@jorisvandenbossche

Description

@jorisvandenbossche

From some benchmarking in pandas, @phofl noticed that the "is_in" implementation of Arrow is considerably slower compared to pandas' (khash-based) implementation, if the value_set is big (and thus when the execution time is mostly dominated by creating the hashtable, and not by the actual lookups).

Small illustration comparing pc.is_in with pandas.core.algorithms.isin (the equivalent pandas function) using a larger value_set with a small array (so that we are mostly timing the hashtable creation):

arr = pa.array(np.random.choice(np.arange(1000), 100))
for n in [10_000, 100_000, 1_000_000, 10_000_000]:
    print(n)
    value_set = pa.array(np.arange(n))
    %timeit pc.is_in(arr, value_set)
    %timeit pd.core.algorithms.isin(np.asarray(arr), np.asarray(value_set))

gives

10000
158 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
66.8 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
100000
8.51 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
627 µs ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1000000
103 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28.5 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10000000
1.26 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
171 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So pandas is often 4-10x faster. I am not familiar with our HashTable implementation or whether it's a fully fair comparison, but it seems this is at least an indication there is room for performance improvement (and I suppose this might be for HashTable-based functions in general, not just "is_in").

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions