-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Labels
Description
Description
Performance when reading columns using feather.read_table on Arrow 7.0.0-9.0.0 is drastically slower than it was in 6.0.0.
Profiling the code below shows that the bottleneck is somewhere in the read_names function of pyarrow._feather.FeatherReader.
Example
Setup code:
import pandas as pd
from pyarrow import feather
rows, cols = (1_000_000, 10)
data = {f'c{c}': range(rows) for c in range(cols)}
df = pd.DataFrame(data=data)
feather.write_feather(df, 'test.feather', compression="uncompressed")Benchmarks Arrow 9.0.0:
%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> 178 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
33.8 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)Benchmarks Arrow 6.0.0:
%timeit feather.read_table('test.feather', memory_map=True)
%timeit feather.read_table('test.feather', columns=list(df.columns), memory_map=True)
> 173 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
224 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)Environment: python 3.9, ubuntu 20.04
Reporter: Håkon Magne Holmen
Related issues:
- [C++] Implement a read range process without caching (is related to)
Note: This issue was originally created as ARROW-17913. Please see the migration documentation for further details.