-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner #13820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner #13820
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I agree with Antoine - this approach is cleaner.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: redundant parens
|
Ok, I'll close the other PR then, and let's focus on this. I'm seeing some errors in pytest, but I'm also getting those on the commit that I've branched off of. But more concerning is this error in a Windows build here https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/44413284/job/ic6l7a0ehrnpk21g#L2654 : It seems like the linker is not able to locate the added C++ function |
|
There is 1 failure in the CI, it seems to have been a strange time-out. |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose what's left is adding a test of the transcoder?
|
I added a test by copying and modifying the one for the csv reader, but ran into the following problem: But this looks to me like a different problem with detecting the schema of a UTF16 encoded file. Should I try to create a reproducible example and file a new JIRA? Or is this something we should address here? |
|
That sounds like dataset inspection is being done without the transcoder actually being set. I think we do need the test to work. I would expect latin-1 happens to work because it happens that the header row has identical encoding between UTF-8 and latin-1. |
|
I added a test with a non-utf8 character in the column name (latin-1). That works. Looks like something a bit more specific to utf16. I'll investigate further. |
|
After having a better look, here's what seems to be happening:
I've removed those 2 additional checks, and just check if the data is transcoded properly. The 2nd check is still present in the new |
|
Hi guys, I think the current state should be ok, the failures seem unrelated to me. Is there anything else left for me to do? |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. This looks good to me. @pitrou what do you think about the approach here?
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach looks fine to me. Just a couple comments.
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be visible to the user, how about renaming it to stress it's an internal detail?
| public ReadOptions read_options_py | |
| public ReadOptions _read_options_py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! changed in b9982c8
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in b9982c8
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that there can be aliases, for example:
>>> codecs.lookup('utf-8').name
'utf-8'
>>> codecs.lookup('utf8').name
'utf-8'
>>> codecs.lookup('UTF8').name
'utf-8'There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a lookup to deal with this in 1e621fa
python/pyarrow/tests/test_dataset.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected_table here is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that, it's been removed in b3ac697
Instead of duplicating the encoding field in the CsvFileFormat, we store the encoding in a private field in the CsvFragmentScanOptions. In that class, the read_options.encoding field gets lost when initializing it by using the C struct (which doesn't have the encoding field). So when the read_options are read, we restore it again.
It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent. 1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding. To the user it would be unclear which of these values would be eventually used by the transcoding.
…now we're setting the transcoder in the read_options setter
Schema detection does not seem to be working properly for UTF16.
This reverts commit 47a3462b756cf92594470cedcd0f56eaf6248016.
Testing if reading a utf16 file as binary works fails, because the column names are not utf8 causing issues parsing the schema. Testing if reading a utf16 file without transcoder fails does not work, because the characters are not invalid utf8 (meaning no error is triggered)
3768fe1 to
1e621fa
Compare
|
Anything I can do to help move this forward? The failure seems unrelated to me ( |
|
Benchmark runs are scheduled for baseline = a5ecb0f and contender = cbf0ec0. cbf0ec0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…ding transcoding function option to CSV scanner (apache#13820) This is an alternative version of apache#13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This is an alternative version of #13709, to compare what the best approach is.
Instead of extending the C++ ReadOptions struct with an
encodingfield, this implementations adds a python version of the ReadOptions object to bothCsvFileFormatandCsvFragmentScanOptions. The reason it is needed in both places, is to prevent these kinds of inconsistencies: