ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner #13820

joosthooz · 2022-08-09T08:10:23Z

This is an alternative version of #13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an encoding field, this implementations adds a python version of the ReadOptions object to both CsvFileFormat and CsvFragmentScanOptions. The reason it is needed in both places, is to prevent these kinds of inconsistencies:

>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'

github-actions · 2022-08-09T08:33:12Z

https://issues.apache.org/jira/browse/ARROW-16000

lidavidm

Ok, I agree with Antoine - this approach is cleaner.

lidavidm · 2022-08-09T13:00:19Z

python/pyarrow/_dataset.pyx

nit: redundant parens

joosthooz · 2022-08-09T14:25:00Z

Ok, I'll close the other PR then, and let's focus on this. I'm seeing some errors in pytest, but I'm also getting those on the commit that I've branched off of. But more concerning is this error in a Windows build here https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/44413284/job/ic6l7a0ehrnpk21g#L2654 :

FAILED: lib.cp38-win_amd64.pyd 
cmd.exe /C "cd . && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -E vs_link_dll --intdir=CMakeFiles\lib.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100183~1.0\x64\rc.exe --mt=C:\PROGRA~2\WI3CF2~1\10\bin\100183~1.0\x64\mt.exe --manifests  -- C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo CMakeFiles\lib.dir\lib.cpp.obj  /out:lib.cp38-win_amd64.pyd /implib:lib.lib /pdb:lib.pdb /dll /version:0.0 /machine:x64  /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO  C:\Miniconda37-x64\envs\arrow\libs\python38.lib  C:\Miniconda37-x64\envs\arrow\Library\lib\arrow.lib  C:\Miniconda37-x64\envs\arrow\Library\lib\arrow_python.lib  kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib  && cd ."
LINK: command "C:\PROGRA~2\MIB055~1\2017\COMMUN~1\VC\Tools\MSVC\1416~1.270\bin\Hostx64\x64\link.exe /nologo CMakeFiles\lib.dir\lib.cpp.obj /out:lib.cp38-win_amd64.pyd /implib:lib.lib /pdb:lib.pdb /dll /version:0.0 /machine:x64 /NODEFAULTLIB:LIBCMT /INCREMENTAL:NO C:\Miniconda37-x64\envs\arrow\libs\python38.lib C:\Miniconda37-x64\envs\arrow\Library\lib\arrow.lib C:\Miniconda37-x64\envs\arrow\Library\lib\arrow_python.lib kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib /MANIFEST /MANIFESTFILE:lib.cp38-win_amd64.pyd.manifest" failed (exit code 1120) with the following output:
   Creating library lib.lib and object lib.exp
lib.cpp.obj : error LNK2019: unresolved external symbol "class std::shared_ptr<class std::function<class arrow::Result<class std::shared_ptr<class arrow::io::InputStream> > __cdecl(class std::shared_ptr<class arrow::io::InputStream>)> > __cdecl arrow::py::MakeStreamTransformFunc(struct arrow::py::TransformInputStreamVTable,struct _object *)" (?MakeStreamTransformFunc@py@arrow@@YA?AV?$shared_ptr@V?$function@$$A6A?AV?$Result@V?$shared_ptr@VInputStream@io@arrow@@@std@@@arrow@@V?$shared_ptr@VInputStream@io@arrow@@@std@@@Z@std@@@std@@UTransformInputStreamVTable@12@PEAU_object@@@Z) referenced in function "class std::shared_ptr<class std::function<class arrow::Result<class std::shared_ptr<class arrow::io::InputStream> > __cdecl(class std::shared_ptr<class arrow::io::InputStream>)> > __cdecl __pyx_f_7pyarrow_3lib_make_streamwrap_func(struct _object *,struct _object *)" (?__pyx_f_7pyarrow_3lib_make_streamwrap_func@@YA?AV?$shared_ptr@V?$function@$$A6A?AV?$Result@V?$shared_ptr@VInputStream@io@arrow@@@std@@@arrow@@V?$shared_ptr@VInputStream@io@arrow@@@std@@@Z@std@@@std@@PEAU_object@@0@Z)
lib.cp38-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals

It seems like the linker is not able to locate the added C++ function arrow::py::MakeStreamTransformFunc when called from the generated cython code. I do notice that the namespace in C++ is just arrow, not arrow::py, but the same goes for MakeTransformInputStream so I don't know why the discrepancy should cause problems for one but not the other.

cpp/src/arrow/python/io.h

joosthooz · 2022-08-10T10:08:33Z

There is 1 failure in the CI, it seems to have been a strange time-out.

lidavidm

I suppose what's left is adding a test of the transcoder?

joosthooz · 2022-08-15T11:28:08Z

I added a test by copying and modifying the one for the csv reader, but ran into the following problem:
The test for latin-1 encoding works fine. However, the UTF16 test fails because there seems to be something wrong with the schema detection. The equality test between the detected pyarrow.schema and the expected schema fails, even if they seem to be identical from a python point of view (the diffs that pytest prints are identical). Adding some printfs in the C++ code shows that the fields of dataset.schema seem to be empty:

Schema Equals():
this fp: 'S{Fn', other fp: 'S{Fna{@N};Fnb{@O};L}'
this field(0): '', other field(0): 'a: string'

But this looks to me like a different problem with detecting the schema of a UTF16 encoded file. Should I try to create a reproducible example and file a new JIRA? Or is this something we should address here?
In the meantime, I removed the test in 47a3462

lidavidm · 2022-08-15T11:38:48Z

That sounds like dataset inspection is being done without the transcoder actually being set. I think we do need the test to work. I would expect latin-1 happens to work because it happens that the header row has identical encoding between UTF-8 and latin-1.

joosthooz · 2022-08-16T11:41:24Z

I added a test with a non-utf8 character in the column name (latin-1). That works. Looks like something a bit more specific to utf16. I'll investigate further.

joosthooz · 2022-08-16T13:57:26Z

After having a better look, here's what seems to be happening:

The first part of the test checks if parsing the file as binary still works. But that doesn't work for utf16 because the column names are not utf8. So parsing the column names into the schema fails (silently!).
The second part tries to read the file, without specifying an encoding. It expects an exception. However, apparently the dataset reader has no problems with the null values every other character; it will just interpret it as a strange utf8 string.

I've removed those 2 additional checks, and just check if the data is transcoded properly. The 2nd check is still present in the new test_column_names_encoding test (that only tests latin-1)

joosthooz · 2022-08-19T10:12:36Z

Hi guys, I think the current state should be ok, the failures seem unrelated to me. Is there anything else left for me to do?

lidavidm

Sorry for the delay. This looks good to me. @pitrou what do you think about the approach here?

pitrou

The approach looks fine to me. Just a couple comments.

pitrou · 2022-08-22T15:06:14Z

python/pyarrow/_dataset.pyx

This does not need to be visible to the user, how about renaming it to stress it's an internal detail?

Suggested change

public ReadOptions read_options_py

public ReadOptions _read_options_py

Thanks! changed in b9982c8

pitrou · 2022-08-22T15:07:49Z

python/pyarrow/_dataset.pyx

Changed in b9982c8

pitrou · 2022-08-22T15:09:03Z

python/pyarrow/_dataset.pyx

Note that there can be aliases, for example:

>>> codecs.lookup('utf-8').name 'utf-8' >>> codecs.lookup('utf8').name 'utf-8' >>> codecs.lookup('UTF8').name 'utf-8'

I've added a lookup to deal with this in 1e621fa

pitrou · 2022-08-22T15:14:54Z

python/pyarrow/tests/test_dataset.py

expected_table here is unused.

Thanks for catching that, it's been removed in b3ac697

…m function

Instead of duplicating the encoding field in the CsvFileFormat, we store the encoding in a private field in the CsvFragmentScanOptions. In that class, the read_options.encoding field gets lost when initializing it by using the C struct (which doesn't have the encoding field). So when the read_options are read, we restore it again.

It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent. 1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding. To the user it would be unclear which of these values would be eventually used by the transcoding.

…ons object

…now we're setting the transcoder in the read_options setter

Schema detection does not seem to be working properly for UTF16.

This reverts commit 47a3462b756cf92594470cedcd0f56eaf6248016.

Testing if reading a utf16 file as binary works fails, because the column names are not utf8 causing issues parsing the schema. Testing if reading a utf16 file without transcoder fails does not work, because the characters are not invalid utf8 (meaning no error is triggered)

joosthooz · 2022-09-06T18:18:27Z

Anything I can do to help move this forward? The failure seems unrelated to me (cat: r/check/arrow.Rcheck/00install.out: No such file or directory)

ursabot · 2022-09-07T05:02:09Z

Benchmark runs are scheduled for baseline = a5ecb0f and contender = cbf0ec0. cbf0ec0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.61% ⬆️0.41%] test-mac-arm
[Failed ⬇️2.74% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] cbf0ec0d ec2-t3-xlarge-us-east-2
[Failed] cbf0ec0d test-mac-arm
[Failed] cbf0ec0d ursa-i9-9960x
[Finished] cbf0ec0d ursa-thinkcentre-m75q
[Finished] a5ecb0ff ec2-t3-xlarge-us-east-2
[Failed] a5ecb0ff test-mac-arm
[Failed] a5ecb0ff ursa-i9-9960x
[Finished] a5ecb0ff ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-09-07T05:02:28Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

…ding transcoding function option to CSV scanner (apache#13820) This is an alternative version of apache#13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>

joosthooz mentioned this pull request Aug 9, 2022

ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner #13709

Closed

github-actions bot added Component: C++ Component: Python labels Aug 9, 2022

lidavidm reviewed Aug 9, 2022

View reviewed changes

python/pyarrow/_dataset.pyx Outdated

Copy link

Member

lidavidm Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: redundant parens

lidavidm reviewed Aug 9, 2022

View reviewed changes

cpp/src/arrow/python/io.h Outdated Show resolved Hide resolved

joosthooz marked this pull request as ready for review August 10, 2022 10:07

lidavidm reviewed Aug 10, 2022

View reviewed changes

joosthooz requested a review from lidavidm August 19, 2022 10:12

lidavidm approved these changes Aug 22, 2022

View reviewed changes

pitrou reviewed Aug 22, 2022

View reviewed changes

joosthooz added 12 commits August 26, 2022 15:38

Added field to CsvFragmentScanOptions that holds an optional transfor…

ffbe933

…m function

WIP wrapping a trancoder around all input streams of a dataset

e7486c4

Added input stream wrapping to CsvFileFormat::CountRows too

2c11549

Moved make_streamwrap_func into io.pxi, removed duplicated code

1bb77ba

Use UpperCamelCase in function name

4e511a6

Additional name change, formatting fix

4c3a87e

Use GetReadOptions(), because it overrides the use_threads field

e16e9c1

Only add a transcoder for csv files

3fbe359

Processed some review comments regarding documentation

6f59e67

Moved encoding parameter from dataset() into CsvFileFormat

c5ac8bc

Removed a left-over occurrence of the workaround encoding parameter

01027f8

Setting default encoding to utf8

b584869

joosthooz added 19 commits August 26, 2022 15:38

Always creating a default_fragment_scan_options

d5dc249

Using a different way of checking the file format

a3edfcc

Added documentation about added encoding parameter

ef19af0

Removed encoding from CsvFragmentScanOptions

5d066c1

Instead a python encoding field, now passing around a Python ReadOpti…

e0943fa

…ons object

Removed redundant parentheses

ecee10f

Added ARROW_PYTHON_EXPORT to new function

cd9ec33

Always creating default_fragment_scan_options is not needed anymore, …

c8d886a

…now we're setting the transcoder in the read_options setter

Formatting

bea7fe7

Added python tests for transcoding dataset.

d0110c0

Schema detection does not seem to be working properly for UTF16.

Removed failing UTF16 test

8d27c12

Revert "Removed failing UTF16 test"

154ed23

This reverts commit 47a3462b756cf92594470cedcd0f56eaf6248016.

Added a test for a csv file with non-utf8 characters in the column names

2c2776b

Renamed read_options_py to _read_options_py

b9982c8

Removed unused test data

b3ac697

Handling codec aliases by using codecs.lookup

1e621fa

joosthooz force-pushed the ARROW-16000-duplicate-python-readoptions branch from 3768fe1 to 1e621fa Compare August 26, 2022 14:10

lidavidm approved these changes Sep 6, 2022

View reviewed changes

lidavidm merged commit cbf0ec0 into apache:master Sep 6, 2022

asfimport mentioned this pull request Sep 7, 2022

[C++][Dataset] Support Latin-1 encoding #31423

Closed

	public ReadOptions read_options_py
	public ReadOptions _read_options_py

ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner #13820

ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner #13820

Uh oh!

Conversation

joosthooz commented Aug 9, 2022

Uh oh!

github-actions bot commented Aug 9, 2022

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joosthooz commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

joosthooz commented Aug 10, 2022

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

joosthooz commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lidavidm commented Aug 15, 2022

Uh oh!

joosthooz commented Aug 16, 2022

Uh oh!

joosthooz commented Aug 16, 2022

Uh oh!

joosthooz commented Aug 19, 2022

Uh oh!

lidavidm left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joosthooz commented Sep 6, 2022

Uh oh!

ursabot commented Sep 7, 2022

Uh oh!

ursabot commented Sep 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joosthooz commented Aug 9, 2022 •

edited

Loading

joosthooz commented Aug 15, 2022 •

edited

Loading