-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Labels
Description
on current master branch
$ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
...
lseek(3, -8, SEEK_END) = 2937
read(3, ",\10\0\0PAR1", 8192) = 8
lseek(3, 845, SEEK_SET) = 845
read(3, "\25\2\31\334H schema"..., 8192) = 2100
...
lseek(5, 4, SEEK_SET) = 4
read(5, "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000"..., 8192) = 2941
lseek(5, 5, SEEK_SET) = 5
read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000"..., 8192) = 2940
lseek(5, 6, SEEK_SET) = 6
read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000"..., 8192) = 2939
lseek(5, 7, SEEK_SET) = 7
read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000000000000000"..., 8192) = 2938
lseek(5, 8, SEEK_SET) = 8
read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020000000000000000"..., 8192) = 2937
lseek(5, 9, SEEK_SET) = 9
read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004"..., 8192) = 2936
lseek(5, 10, SEEK_SET) = 10
read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\0200000000000000004\30"..., 8192) = 2935Notice the seek position being incremented by one, despite reading up to 8192 bytes at a time. Interestingly this does not seem to have a big performance impact on a local file system with linux, but becomes a problem when working with a custom implementation of ParquetReader, for example for reading from s3.
The problem seems to be in
impl<R: ParquetReader> Read for FileSource<R>which is unconditionally calling
reader.seek(SeekFrom::Start(self.start as u64))?Instead it should probably keep track of the current position and only seek on the first read.
Reporter: Jörn Horstmann / @jhorstmann
Related issues:
Note: This issue was originally created as ARROW-7574. Please see the migration documentation for further details.