Skip to content

Conversation

@joosthooz
Copy link
Contributor

WIP Adding an optional function that wraps all input streams with a user-supplied transcoding function.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:



# from io.pxi
class Transcoder:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we really want to jump back to python here instead of using C++ utilities for decoding? (I'm not sure if there are any good standard utilities so maybe the answer is yes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mostly discussed this in the JIRA - you'd have to pull in a library like icu if you want to do it on the C++ side, and also Python (at least) has 'special' encodings like 'unicodereplace' that users may or may not expect to be able to use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to add that as a possibility in the future, but for now I wanted to mimic the behavior of read_csv as much as possible. We'll have to see how bad of a bottleneck this will create. But for scanning a single file it shouldn't matter, and that is good enough for my use case because I just want to be able to deal with files that are larger than memory (which pyarrow.dataset will allow me to do and read_csv will not)

@joosthooz joosthooz changed the title Arrow-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner Jul 28, 2022
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@joosthooz
Copy link
Contributor Author

The current state is that it works, but it relies on the workaround of adding an encoding parameter to pyarrow.dataset().
That needs to be dealt with before proceeding.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add the encoding option in CsvFileFormat? I think that is the entry point to "fragment scan options" for pyarrow datasets and it appears to be a thin wrapper around CsvFileFormatOptions:

l1_csv_format = ds.CsvFileFormat(read_options=..., parse_options=..., convert_options=..., encoding='latin-1')
my_dataset = ds.dataset([my_files], format=l1_csv_format)

Parameters
----------
src_encoding : str
The codec to use when reading data data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The codec to use when reading data data.
The codec to use when reading data.

Create a function that will add a transcoding transformation to a stream.
Data from that stream will be decoded according to ``src_encoding`` and
then re-encoded according to ``dest_encoding``.
The created function can be used to wrap streams once they are created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The created function can be used to wrap streams once they are created.
The created function can be used to wrap streams.

@joosthooz
Copy link
Contributor Author

Thank you for the comments and suggestions.
I implemented Weston's suggestion and added an encoding field to CsvFileFormat instead of the added parameter to dataset(). This works, but I dislike it a lot. The field is a duplicate to the same one in ReadOptions. So when using read_csv, users need to use the field in read_options, but when using dataset, they need to use the field in format. They can also use both of them, 1 is silently discarded. Here's what this looks like:

>>> fo = ds.CsvFileFormat(default_fragment_scan_options=ds.CsvFragmentScanOptions(read_options=csv.ReadOptions(encoding='iso-8259')), encoding='cp1252')
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'
>>> fo.encoding
'cp1252'

Instead of duplicating the encoding field in the CsvFileFormat, we store the encoding in a private field in the CsvFragmentScanOptions.
In that class, the read_options.encoding field gets lost when initializing it by using the C struct (which doesn't have the encoding field).
So when the read_options are read, we restore it again.
@joosthooz
Copy link
Contributor Author

joosthooz commented Jul 29, 2022

I pushed an alternative way of passing the encoding in 22eff73. For the user it works the same way as in read_csv: it is a field in read_options. I store the value in CsvFragmentScanOptions, and then restore it.
How do you feel about this?

Edit: Hm, it is not working the way I want it yet. The value still gets lost when creating a CsvFileFormat.
Is it an option to add encoding as a field to the C struct ReadOptions?

@westonpace
Copy link
Member

I store the value in CsvFragmentScanOptions, and then restore it.
How do you feel about this?

I like this approach if you can get it working. Can you add this to the CsvFileFormat constructor?

        else :
            # default_fragment_scan_options is needed to add a transcoder
            self.default_fragment_scan_options = CsvFragmentScanOptions()
        if read_options is not None:
            self.default_fragment_scan_options.encoding = read_options.encoding

Is it an option to add encoding as a field to the C struct ReadOptions?

That seems undesirable. The C++ csv reader doesn't have the field because it has no ability to handle encodings. So I'm not sure we want to add a field that is completely ignored.

It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent.
1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding.
To the user it would be unclear which of these values would be eventually used by the transcoding.
@joosthooz
Copy link
Contributor Author

(4d819aa should be Removed encoding from CsvFragmentScanOptions.equals())

@joosthooz
Copy link
Contributor Author

Ok I pushed something completely different. I added encoding as a field in the C struct and some wrapper code that tries to dlopen the libiconv library. I haven't really tested it beyond seeing that it doesn't crash when I read some data from a dataset. Now the question is how do we let the user specify what they want to do? As in choose between a Python transcoder or a library on his system. And how do we show what libraries we have available? Should we create an example about how peopple can add their own wrappers?

@lidavidm
Copy link
Member

lidavidm commented Aug 1, 2022

I think we're getting a bit far afield…Dynamic linking needs platform-specific code and usually we configure optional dependencies with build flags.

What if we add the C++-side field, have it error in C++ if not set to the default, and in python, we can reset the value to the default and configure the transcoder? That leaves us the path to upgrade and should avoid excessive python-side hacks. If we decide it's valuable to have built-in C++ side transcoding, then we have the option there already.

An alternative would be to have the Python wrappers for these structs no longer actually wrap the C++ structs, so that we aren't limited to the C++ fields. But that would lead to some code duplication/messiness as well.

I'm not sure we can avoid some messiness: the fundamental issue is that we have a Python-only field but are trying to directly wrap the C++ structs. That extra field needs to be mirrored somewhere. Either we do work to pass it around on the Python side or we give in and add it in C++.

@joosthooz
Copy link
Contributor Author

That sums it up very nicely. Both alternatives are fine with me. I just pushed an update that aims to do what you suggest, which is adding an encoding field to the C++ struct. The CSV reader returns an Invalid error when the user has specified an encoding other than UTF-8 but the stream_transform_func is empty.
Is that the right error type or would an IOError be more suitable?
How do you feel about the name of the set_transcoder function in CsvFragmentScanOptions? Should I add a docstring to it? Should the stream_transform_func be added to the equals() function? (in that case I think I need to add a getter/setting for it too)

@lidavidm
Copy link
Member

lidavidm commented Aug 2, 2022

How do you feel about the name of the set_transcoder function in CsvFragmentScanOptions? Should I add a docstring to it?

It feels like it shouldn't be publicly accessible? Or else it should mirror the C++ side option name 1:1

Should the stream_transform_func be added to the equals() function? (in that case I think I need to add a getter/setting for it too)

I guess we can only get pointer equality, but yes

@joosthooz
Copy link
Contributor Author

Somewhere in the CSV reader itself we should also validate the option

I tried this, but it doesn't work, because in that case we would need to re-set the field back to utf8 when adding a transcoder in python. Otherwise, the error is triggered even though we are transcoding to utf8. But then, we will again run into the issue where the ReadOptions object that the user created is changed:

>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> dataset = ds.dataset("file.csv", format=fo)
>>> ro.encoding
'utf8'

This would be really strange if you ask me. And if we accept this strange behavior, we didn't need to add the encoding field in the first place.
So now, the field is basically ignored in the CSV reader, there only is the check in the dataset CSV reader that there must be a wrapping function set if the encoding is not utf8.

@lidavidm
Copy link
Member

lidavidm commented Aug 5, 2022

Ah, thanks for explaining.

Wonder if we should/could pass a copy of the ReadOptions then?

@lidavidm
Copy link
Member

lidavidm commented Aug 5, 2022

But it's not a big deal, I think so long as the field is clearly documented

@pitrou
Copy link
Member

pitrou commented Aug 8, 2022

@joosthooz Do you want reviewing at this point or are you looking to polish this PR first?

@joosthooz
Copy link
Contributor Author

Thanks for checking in, @pitrou ! Most important is to choose which approach to take, to make that easier I opened #13820 to compare against.
After that I need to check why some of the tests are failing (they seem unrelated) and maybe polish a bit and then I'll move it out of the draft state.

@pitrou
Copy link
Member

pitrou commented Aug 9, 2022

I would favor #13820, which pushes complexity into Python, over this one, which introduces a dummy option in C++ that has no effect.

@joosthooz
Copy link
Contributor Author

Continuing here: #13820

@joosthooz joosthooz closed this Aug 9, 2022
lidavidm pushed a commit that referenced this pull request Sep 6, 2022
…ding transcoding function option to CSV scanner (#13820)

This is an alternative version of #13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies:
```
>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…ding transcoding function option to CSV scanner (apache#13820)

This is an alternative version of apache#13709, to compare what the best approach is.

Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies:
```
>>> import pyarrow.dataset as ds
>>> import pyarrow.csv as csv
>>> ro =csv.ReadOptions(encoding='iso8859')
>>> fo = ds.CsvFileFormat(read_options=ro)
>>> fo.default_fragment_scan_options.read_options.encoding
'utf8'
```

Authored-by: Joost Hoozemans <joosthooz@msn.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants