-
Notifications
You must be signed in to change notification settings - Fork 53
Description
We are using Shrine for handling large CSV uploads that are then stream-processed. We have a progress meter for this which works off the underlying IO object's #pos values. For local files, this works perfectly. Once we went into our Staging environment with S3 as the storage engine, using Down under-the-hood, it all broke. It seems that after the first 1K of data, Down::ChunkedIO#pos starts returning values much, much higher than they should be - far beyond the end of the file.
For a particular test file of only 3669 bytes comprising around 55 CSV rows plus header, the size reported by the IO object was consistently correct. However, inside the CSV row iterator, the results of #pos were:
0
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
1024
3736
6268
8732
11134
13466
15730
17923
20045
22103
24087
26017
27888
29698
31455
33155
34794
36363
37878
39313
40687
41998
43249
44431
45549
46598
47581
48498
49349
50137
50861
51519
52117
52656
53138
53562
53924
54220
54465
54647
54774
54840
54840
The start offset is 0. The 1024 offset was presumed to be a chunk size from the CSV processor, but if I tried to rewind to zero and read 1024 bytes, I actually got a very strange 1057 bytes, perfectly aligned to a row end, instead. In any event, it then sits at 1024 for a while and once the CSV parsing seems to have gone past that first "chunk" - be it 1024 or 1057 bytes - then the positions reported become, as you can see, very wrong.
The above was generated with no rewinding or other shenanigans; in psuedocode we have:
# shrine_file is our Shrine subclass instance representing the S3 object. The
# encoding specifier is typically UTF-8.
#
# Inside the iterator, io_obj is the Down::ChunkedIO instance. CSV options are:
#
# {:headers=>true, :header_converters=>:symbol, :liberal_parsing=>true}
#
shrine_file.open(encoding: encoding_specifier) do | io_obj |
csv = CSV.new(io_obj, **options)
csv.each do |row|
puts io_obj.pos
end
end