ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner #13709

joosthooz · 2022-07-26T11:28:35Z

WIP Adding an optional function that wraps all input streams with a user-supplied transcoding function.

…m function

python/pyarrow/_dataset.pyx

cpp/src/arrow/python/io.h

github-actions · 2022-07-26T13:02:06Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

westonpace · 2022-07-27T19:39:46Z

python/pyarrow/_dataset.pyx

+
+
+# from io.pxi
+class Transcoder:


Hmm, do we really want to jump back to python here instead of using C++ utilities for decoding? (I'm not sure if there are any good standard utilities so maybe the answer is yes).

We mostly discussed this in the JIRA - you'd have to pull in a library like icu if you want to do it on the C++ side, and also Python (at least) has 'special' encodings like 'unicodereplace' that users may or may not expect to be able to use

I hope to add that as a possibility in the future, but for now I wanted to mimic the behavior of read_csv as much as possible. We'll have to see how bad of a bottleneck this will create. But for scanning a single file it shouldn't matter, and that is good enough for my use case because I just want to be able to deal with files that are larger than memory (which pyarrow.dataset will allow me to do and read_csv will not)

github-actions · 2022-07-28T19:55:20Z

https://issues.apache.org/jira/browse/ARROW-16000

github-actions · 2022-07-28T19:55:22Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

joosthooz · 2022-07-28T20:05:56Z

The current state is that it works, but it relies on the workaround of adding an encoding parameter to pyarrow.dataset().
That needs to be dealt with before proceeding.

westonpace

Is it possible to add the encoding option in CsvFileFormat? I think that is the entry point to "fragment scan options" for pyarrow datasets and it appears to be a thin wrapper around CsvFileFormatOptions:

l1_csv_format = ds.CsvFileFormat(read_options=..., parse_options=..., convert_options=..., encoding='latin-1')
my_dataset = ds.dataset([my_files], format=l1_csv_format)

westonpace · 2022-07-28T21:45:50Z

python/pyarrow/io.pxi

+    Parameters
+    ----------
+    src_encoding : str
+        The codec to use when reading data data.


Suggested change

The codec to use when reading data data.

The codec to use when reading data.

westonpace · 2022-07-28T21:47:16Z

python/pyarrow/io.pxi

+    Create a function that will add a transcoding transformation to a stream.
+    Data from that stream will be decoded according to ``src_encoding`` and
+    then re-encoded according to ``dest_encoding``.
+    The created function can be used to wrap streams once they are created.


Suggested change

The created function can be used to wrap streams once they are created.

The created function can be used to wrap streams.

joosthooz · 2022-07-29T13:54:20Z

Thank you for the comments and suggestions.
I implemented Weston's suggestion and added an encoding field to CsvFileFormat instead of the added parameter to dataset(). This works, but I dislike it a lot. The field is a duplicate to the same one in ReadOptions. So when using read_csv, users need to use the field in read_options, but when using dataset, they need to use the field in format. They can also use both of them, 1 is silently discarded. Here's what this looks like:

>>> fo = ds.CsvFileFormat(default_fragment_scan_options=ds.CsvFragmentScanOptions(read_options=csv.ReadOptions(encoding='iso-8259')), encoding='cp1252')
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'
>>> fo.encoding
'cp1252'

Instead of duplicating the encoding field in the CsvFileFormat, we store the encoding in a private field in the CsvFragmentScanOptions. In that class, the read_options.encoding field gets lost when initializing it by using the C struct (which doesn't have the encoding field). So when the read_options are read, we restore it again.

joosthooz · 2022-07-29T15:09:58Z

I pushed an alternative way of passing the encoding in 22eff73. For the user it works the same way as in read_csv: it is a field in read_options. I store the value in CsvFragmentScanOptions, and then restore it.
How do you feel about this?

Edit: Hm, it is not working the way I want it yet. The value still gets lost when creating a CsvFileFormat.
Is it an option to add encoding as a field to the C struct ReadOptions?

westonpace · 2022-07-29T18:53:23Z

I store the value in CsvFragmentScanOptions, and then restore it.
How do you feel about this?

I like this approach if you can get it working. Can you add this to the CsvFileFormat constructor?

        else :
            # default_fragment_scan_options is needed to add a transcoder
            self.default_fragment_scan_options = CsvFragmentScanOptions()
        if read_options is not None:
            self.default_fragment_scan_options.encoding = read_options.encoding

Is it an option to add encoding as a field to the C struct ReadOptions?

That seems undesirable. The C++ csv reader doesn't have the field because it has no ability to handle encodings. So I'm not sure we want to add a field that is completely ignored.

It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent. 1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding. To the user it would be unclear which of these values would be eventually used by the transcoding.

joosthooz · 2022-08-01T14:26:37Z

(4d819aa should be Removed encoding from CsvFragmentScanOptions.equals())

joosthooz · 2022-08-01T14:30:34Z

Ok I pushed something completely different. I added encoding as a field in the C struct and some wrapper code that tries to dlopen the libiconv library. I haven't really tested it beyond seeing that it doesn't crash when I read some data from a dataset. Now the question is how do we let the user specify what they want to do? As in choose between a Python transcoder or a library on his system. And how do we show what libraries we have available? Should we create an example about how peopple can add their own wrappers?

lidavidm · 2022-08-01T14:36:10Z

I think we're getting a bit far afield…Dynamic linking needs platform-specific code and usually we configure optional dependencies with build flags.

What if we add the C++-side field, have it error in C++ if not set to the default, and in python, we can reset the value to the default and configure the transcoder? That leaves us the path to upgrade and should avoid excessive python-side hacks. If we decide it's valuable to have built-in C++ side transcoding, then we have the option there already.

An alternative would be to have the Python wrappers for these structs no longer actually wrap the C++ structs, so that we aren't limited to the C++ fields. But that would lead to some code duplication/messiness as well.

I'm not sure we can avoid some messiness: the fundamental issue is that we have a Python-only field but are trying to directly wrap the C++ structs. That extra field needs to be mirrored somewhere. Either we do work to pass it around on the Python side or we give in and add it in C++.

…scoder was supplied

joosthooz · 2022-08-02T19:26:10Z

That sums it up very nicely. Both alternatives are fine with me. I just pushed an update that aims to do what you suggest, which is adding an encoding field to the C++ struct. The CSV reader returns an Invalid error when the user has specified an encoding other than UTF-8 but the stream_transform_func is empty.
Is that the right error type or would an IOError be more suitable?
How do you feel about the name of the set_transcoder function in CsvFragmentScanOptions? Should I add a docstring to it? Should the stream_transform_func be added to the equals() function? (in that case I think I need to add a getter/setting for it too)

cpp/src/arrow/csv/options.h

cpp/src/arrow/dataset/file_csv.cc

python/pyarrow/dataset.py

python/pyarrow/io.pxi

lidavidm · 2022-08-02T21:49:52Z

How do you feel about the name of the set_transcoder function in CsvFragmentScanOptions? Should I add a docstring to it?

It feels like it shouldn't be publicly accessible? Or else it should mirror the C++ side option name 1:1

Should the stream_transform_func be added to the equals() function? (in that case I think I need to add a getter/setting for it too)

I guess we can only get pointer equality, but yes

…ding wrapper

joosthooz · 2022-08-04T09:37:40Z

Somewhere in the CSV reader itself we should also validate the option

I tried this, but it doesn't work, because in that case we would need to re-set the field back to utf8 when adding a transcoder in python. Otherwise, the error is triggered even though we are transcoding to utf8. But then, we will again run into the issue where the ReadOptions object that the user created is changed:

>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> dataset = ds.dataset("file.csv", format=fo)
>>> ro.encoding
'utf8'

This would be really strange if you ask me. And if we accept this strange behavior, we didn't need to add the encoding field in the first place.
So now, the field is basically ignored in the CSV reader, there only is the check in the dataset CSV reader that there must be a wrapping function set if the encoding is not utf8.

lidavidm · 2022-08-05T18:56:55Z

Ah, thanks for explaining.

Wonder if we should/could pass a copy of the ReadOptions then?

lidavidm · 2022-08-05T18:57:10Z

But it's not a big deal, I think so long as the field is clearly documented

… setter

pitrou · 2022-08-08T15:37:45Z

@joosthooz Do you want reviewing at this point or are you looking to polish this PR first?

joosthooz · 2022-08-09T08:22:07Z

Thanks for checking in, @pitrou ! Most important is to choose which approach to take, to make that easier I opened #13820 to compare against.
After that I need to check why some of the tests are failing (they seem unrelated) and maybe polish a bit and then I'll move it out of the draft state.

pitrou · 2022-08-09T08:41:10Z

I would favor #13820, which pushes complexity into Python, over this one, which introduces a dummy option in C++ that has no effect.

joosthooz · 2022-08-09T14:25:22Z

Continuing here: #13820

…ding transcoding function option to CSV scanner (#13820) This is an alternative version of #13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>

…ding transcoding function option to CSV scanner (apache#13820) This is an alternative version of apache#13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>

joosthooz added 2 commits July 26, 2022 12:19

Added field to CsvFragmentScanOptions that holds an optional transfor…

5d7c2bd

…m function

WIP wrapping a trancoder around all input streams of a dataset

a82239d

lidavidm reviewed Jul 26, 2022

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

cpp/src/arrow/python/io.h Outdated Show resolved Hide resolved

github-actions bot added Component: C++ Component: Python labels Jul 26, 2022

westonpace reviewed Jul 27, 2022

View reviewed changes

joosthooz added 3 commits July 28, 2022 21:06

Added input stream wrapping to CsvFileFormat::CountRows too

f67590c

Moved make_streamwrap_func into io.pxi, removed duplicated code

a717ddc

Use UpperCamelCase in function name

9284eb6

joosthooz changed the title ~~Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner~~ ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner Jul 28, 2022

Additional name change, formatting fix

b2061bf

westonpace reviewed Jul 28, 2022

View reviewed changes

joosthooz added 6 commits July 29, 2022 14:25

Use GetReadOptions(), because it overrides the use_threads field

bbdaa07

Only add a transcoder for csv files

c959112

Processed some review comments regarding documentation

7fceb84

Moved encoding parameter from dataset() into CsvFileFormat

401f67d

Removed a left-over occurrence of the workaround encoding parameter

9816e46

Setting default encoding to utf8

888b3e8

joosthooz added 4 commits July 29, 2022 16:17

Always creating a default_fragment_scan_options

0f67e8c

Using a different way of checking the file format

47c536d

Added documentation about added encoding parameter

a539dd0

joosthooz added 2 commits August 1, 2022 14:59

Removed encoding from CsvFragmentScanOptions

4d819aa

Added transcoding functionality by dlopening libiconv

1534bd1

Removed encoding library wrapper code, now returning error if no tran…

d2c9a06

…scoder was supplied

lidavidm reviewed Aug 2, 2022

View reviewed changes

joosthooz added 5 commits August 3, 2022 11:54

Changed default for csv encoding to a constant

a82a32a

Generating an error in C++ when the encoding is not UTF-8

21f1202

In python, setting the encoding back to utf8 after creating a transco…

e06f03f

…ding wrapper

'UTF-8' -> 'utf8' to make it consistent with python

24b1cc5

Disabled ReadOptions encoding validation

c2c4b22

joosthooz added 2 commits August 8, 2022 09:52

Formatting

30124c2

Moved the creation of the stream wrapper function into the readoption…

9a5bf20

… setter

joosthooz mentioned this pull request Aug 9, 2022

ARROW-16000: [C++][Python] Dataset: Alternative implementation for adding transcoding function option to CSV scanner #13820

Merged

joosthooz closed this Aug 9, 2022

asfimport mentioned this pull request Sep 7, 2022

[C++][Dataset] Support Latin-1 encoding #31423

Closed

	The codec to use when reading data data.
	The codec to use when reading data.

	The created function can be used to wrap streams once they are created.
	The created function can be used to wrap streams.

ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner #13709

ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner #13709

Uh oh!

Conversation

joosthooz commented Jul 26, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 26, 2022

Uh oh!

westonpace Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

lidavidm Jul 27, 2022

Choose a reason for hiding this comment

Uh oh!

joosthooz Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 28, 2022

Uh oh!

github-actions bot commented Jul 28, 2022

Uh oh!

joosthooz commented Jul 28, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

westonpace Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

joosthooz commented Jul 29, 2022

Uh oh!

joosthooz commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

westonpace commented Jul 29, 2022

Uh oh!

joosthooz commented Aug 1, 2022

Uh oh!

joosthooz commented Aug 1, 2022

Uh oh!

lidavidm commented Aug 1, 2022

Uh oh!

joosthooz commented Aug 2, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lidavidm commented Aug 2, 2022

Uh oh!

joosthooz commented Aug 4, 2022

Uh oh!

lidavidm commented Aug 5, 2022

Uh oh!

lidavidm commented Aug 5, 2022

Uh oh!

pitrou commented Aug 8, 2022

Uh oh!

joosthooz commented Aug 9, 2022

Uh oh!

pitrou commented Aug 9, 2022

Uh oh!

joosthooz commented Aug 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joosthooz commented Jul 29, 2022 •

edited

Loading