Skip to content

Conversation

@joosthooz
Copy link
Contributor

This is an alternative version of #13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an encoding field, this implementations adds a python version of the ReadOptions object to both CsvFileFormat and CsvFragmentScanOptions. The reason it is needed in both places, is to prevent these kinds of inconsistencies:

>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'

@github-actions
Copy link

github-actions bot commented Aug 9, 2022

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree with Antoine - this approach is cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundant parens

@joosthooz
Copy link
Contributor Author

joosthooz commented Aug 9, 2022

Ok, I'll close the other PR then, and let's focus on this. I'm seeing some errors in pytest, but I'm also getting those on the commit that I've branched off of. But more concerning is this error in a Windows build here https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/44413284/job/ic6l7a0ehrnpk21g#L2654 :

FAILED: lib.cp38-win_amd64.pyd 
cmd.exe /C "cd . && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -E vs_link_dll --intdir=CMakeFiles\lib.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100183~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100183~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo CMakeFiles\lib.dir\lib.cpp.obj  /out:lib.cp38-win_amd64.pyd /implib:lib.lib /pdb:lib.pdb /dll /version:0.0 /machine:x64  /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO  C:\Miniconda37-x64\envs\arrow\libs\python38.lib  C:\Miniconda37-x64\envs\arrow\Library\lib\arrow.lib  C:\Miniconda37-x64\envs\arrow\Library\lib\arrow_python.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib  && cd ."
LINK: command "C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo CMakeFiles\lib.dir\lib.cpp.obj /out:lib.cp38-win_amd64.pyd /implib:lib.lib /pdb:lib.pdb /dll /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO C:\Miniconda37-x64\envs\arrow\libs\python38.lib C:\Miniconda37-x64\envs\arrow\Library\lib\arrow.lib C:\Miniconda37-x64\envs\arrow\Library\lib\arrow_python.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:lib.cp38-win_amd64.pyd.manifest" failed (exit code 1120) with the following output:
   Creating library lib.lib and object lib.exp
lib.cpp.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class std::function<class arrow::Result<class std::shared_ptr<class arrow::io::InputStream> > __cdecl(class std::shared_ptr<class arrow::io::InputStream>)> > __cdecl arrow::py::MakeStreamTransformFunc(struct arrow::py::TransformInputStreamVTable,struct _object *)" (?MakeStreamTransformFunc@py@arrow@@YA?AV?$shared_ptr@V?$function@$$A6A?AV?$Result@V?$shared_ptr@VInputStream@io@arrow@@@std@@@arrow@@V?$shared_ptr@VInputStream@io@arrow@@@std@@@Z@std@@@std@@UTransformInputStreamVTable@12@PEAU_object@@@Z) referenced in function "class std::shared_ptr<class std::function<class arrow::Result<class std::shared_ptr<class arrow::io::InputStream> > __cdecl(class std::shared_ptr<class arrow::io::InputStream>)> > __cdecl __pyx_f_7pyarrow_3lib_make_streamwrap_func(struct _object *,struct _object *)" (?__pyx_f_7pyarrow_3lib_make_streamwrap_func@@YA?AV?$shared_ptr@V?$function@$$A6A?AV?$Result@V?$shared_ptr@VInputStream@io@arrow@@@std@@@arrow@@V?$shared_ptr@VInputStream@io@arrow@@@std@@@Z@std@@@std@@PEAU_object@@0@Z)
lib.cp38-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

It seems like the linker is not able to locate the added C++ function arrow::py::MakeStreamTransformFunc when called from the generated cython code. I do notice that the namespace in C++ is just arrow, not arrow::py, but the same goes for MakeTransformInputStream so I don't know why the discrepancy should cause problems for one but not the other.

@joosthooz joosthooz marked this pull request as ready for review August 10, 2022 10:07
@joosthooz
Copy link
Contributor Author

There is 1 failure in the CI, it seems to have been a strange time-out.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose what's left is adding a test of the transcoder?

@joosthooz
Copy link
Contributor Author

joosthooz commented Aug 15, 2022

I added a test by copying and modifying the one for the csv reader, but ran into the following problem:
The test for latin-1 encoding works fine. However, the UTF16 test fails because there seems to be something wrong with the schema detection. The equality test between the detected pyarrow.schema and the expected schema fails, even if they seem to be identical from a python point of view (the diffs that pytest prints are identical). Adding some printfs in the C++ code shows that the fields of dataset.schema seem to be empty:

Schema Equals():
this fp: 'S{Fn', other fp: 'S{Fna{@N};Fnb{@O};L}'
this field(0): '', other field(0): 'a: string'

But this looks to me like a different problem with detecting the schema of a UTF16 encoded file. Should I try to create a reproducible example and file a new JIRA? Or is this something we should address here?
In the meantime, I removed the test in 47a3462

@lidavidm
Copy link
Member

That sounds like dataset inspection is being done without the transcoder actually being set. I think we do need the test to work. I would expect latin-1 happens to work because it happens that the header row has identical encoding between UTF-8 and latin-1.

@joosthooz
Copy link
Contributor Author

I added a test with a non-utf8 character in the column name (latin-1). That works. Looks like something a bit more specific to utf16. I'll investigate further.

@joosthooz
Copy link
Contributor Author

After having a better look, here's what seems to be happening:

  • The first part of the test checks if parsing the file as binary still works. But that doesn't work for utf16 because the column names are not utf8. So parsing the column names into the schema fails (silently!).
  • The second part tries to read the file, without specifying an encoding. It expects an exception. However, apparently the dataset reader has no problems with the null values every other character; it will just interpret it as a strange utf8 string.

I've removed those 2 additional checks, and just check if the data is transcoded properly. The 2nd check is still present in the new test_column_names_encoding test (that only tests latin-1)

@joosthooz
Copy link
Contributor Author

Hi guys, I think the current state should be ok, the failures seem unrelated to me. Is there anything else left for me to do?

@joosthooz joosthooz requested a review from lidavidm August 19, 2022 10:12
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. This looks good to me. @pitrou what do you think about the approach here?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks fine to me. Just a couple comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be visible to the user, how about renaming it to stress it's an internal detail?

Suggested change
public ReadOptions read_options_py
public ReadOptions _read_options_py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! changed in b9982c8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in b9982c8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there can be aliases, for example:

>>> codecs.lookup('utf-8').name
'utf-8'
>>> codecs.lookup('utf8').name
'utf-8'
>>> codecs.lookup('UTF8').name
'utf-8'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a lookup to deal with this in 1e621fa

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected_table here is unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that, it's been removed in b3ac697

Instead of duplicating the encoding field in the CsvFileFormat, we store the encoding in a private field in the CsvFragmentScanOptions.
In that class, the read_options.encoding field gets lost when initializing it by using the C struct (which doesn't have the encoding field).
So when the read_options are read, we restore it again.
It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent.
1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding.
To the user it would be unclear which of these values would be eventually used by the transcoding.
…now we're setting the transcoder in the read_options setter
Schema detection does not seem to be working properly for UTF16.
This reverts commit 47a3462b756cf92594470cedcd0f56eaf6248016.
Testing if reading a utf16 file as binary works fails, because the column names are not utf8 causing issues parsing the schema.
Testing if reading a utf16 file without transcoder fails does not work, because the characters are not invalid utf8 (meaning no error is triggered)
@joosthooz joosthooz force-pushed the ARROW-16000-duplicate-python-readoptions branch from 3768fe1 to 1e621fa Compare August 26, 2022 14:10
@joosthooz
Copy link
Contributor Author

Anything I can do to help move this forward? The failure seems unrelated to me (cat: r/check/arrow.Rcheck/00install.out: No such file or directory)

@lidavidm lidavidm merged commit cbf0ec0 into apache:master Sep 6, 2022
@ursabot
Copy link

ursabot commented Sep 7, 2022

Benchmark runs are scheduled for baseline = a5ecb0f and contender = cbf0ec0. cbf0ec0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.61% ⬆️0.41%] test-mac-arm
[Failed ⬇️2.74% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] cbf0ec0d ec2-t3-xlarge-us-east-2
[Failed] cbf0ec0d test-mac-arm
[Failed] cbf0ec0d ursa-i9-9960x
[Finished] cbf0ec0d ursa-thinkcentre-m75q
[Finished] a5ecb0ff ec2-t3-xlarge-us-east-2
[Failed] a5ecb0ff test-mac-arm
[Failed] a5ecb0ff ursa-i9-9960x
[Finished] a5ecb0ff ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Sep 7, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…ding transcoding function option to CSV scanner (apache#13820)

This is an alternative version of apache#13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies:
```
>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants