-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16000: [C++][Python] Dataset: Added transcoding function option to CSV scanner #13709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
python/pyarrow/_dataset.pyx
Outdated
|
|
||
|
|
||
| # from io.pxi | ||
| class Transcoder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we really want to jump back to python here instead of using C++ utilities for decoding? (I'm not sure if there are any good standard utilities so maybe the answer is yes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mostly discussed this in the JIRA - you'd have to pull in a library like icu if you want to do it on the C++ side, and also Python (at least) has 'special' encodings like 'unicodereplace' that users may or may not expect to be able to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to add that as a possibility in the future, but for now I wanted to mimic the behavior of read_csv as much as possible. We'll have to see how bad of a bottleneck this will create. But for scanning a single file it shouldn't matter, and that is good enough for my use case because I just want to be able to deal with files that are larger than memory (which pyarrow.dataset will allow me to do and read_csv will not)
|
|
|
The current state is that it works, but it relies on the workaround of adding an |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add the encoding option in CsvFileFormat? I think that is the entry point to "fragment scan options" for pyarrow datasets and it appears to be a thin wrapper around CsvFileFormatOptions:
l1_csv_format = ds.CsvFileFormat(read_options=..., parse_options=..., convert_options=..., encoding='latin-1')
my_dataset = ds.dataset([my_files], format=l1_csv_format)
python/pyarrow/io.pxi
Outdated
| Parameters | ||
| ---------- | ||
| src_encoding : str | ||
| The codec to use when reading data data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The codec to use when reading data data. | |
| The codec to use when reading data. |
python/pyarrow/io.pxi
Outdated
| Create a function that will add a transcoding transformation to a stream. | ||
| Data from that stream will be decoded according to ``src_encoding`` and | ||
| then re-encoded according to ``dest_encoding``. | ||
| The created function can be used to wrap streams once they are created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The created function can be used to wrap streams once they are created. | |
| The created function can be used to wrap streams. |
|
Thank you for the comments and suggestions. |
Instead of duplicating the encoding field in the CsvFileFormat, we store the encoding in a private field in the CsvFragmentScanOptions. In that class, the read_options.encoding field gets lost when initializing it by using the C struct (which doesn't have the encoding field). So when the read_options are read, we restore it again.
|
I pushed an alternative way of passing the encoding in 22eff73. For the user it works the same way as in Edit: Hm, it is not working the way I want it yet. The value still gets lost when creating a |
I like this approach if you can get it working. Can you add this to the
That seems undesirable. The C++ csv reader doesn't have the field because it has no ability to handle encodings. So I'm not sure we want to add a field that is completely ignored. |
It needs to be stored in both CsvFileFormat and CsvFragmentScanOptions because if the user has a reference to these separate objects, they would otherwise become inconsistent. 1 would report the default 'utf8' (forgetting the user's encoding choice), while the other would still properly report the requested encoding. To the user it would be unclear which of these values would be eventually used by the transcoding.
|
(4d819aa should be |
|
Ok I pushed something completely different. I added encoding as a field in the C struct and some wrapper code that tries to |
|
I think we're getting a bit far afield…Dynamic linking needs platform-specific code and usually we configure optional dependencies with build flags. What if we add the C++-side field, have it error in C++ if not set to the default, and in python, we can reset the value to the default and configure the transcoder? That leaves us the path to upgrade and should avoid excessive python-side hacks. If we decide it's valuable to have built-in C++ side transcoding, then we have the option there already. An alternative would be to have the Python wrappers for these structs no longer actually wrap the C++ structs, so that we aren't limited to the C++ fields. But that would lead to some code duplication/messiness as well. I'm not sure we can avoid some messiness: the fundamental issue is that we have a Python-only field but are trying to directly wrap the C++ structs. That extra field needs to be mirrored somewhere. Either we do work to pass it around on the Python side or we give in and add it in C++. |
…scoder was supplied
|
That sums it up very nicely. Both alternatives are fine with me. I just pushed an update that aims to do what you suggest, which is adding an |
It feels like it shouldn't be publicly accessible? Or else it should mirror the C++ side option name 1:1
I guess we can only get pointer equality, but yes |
I tried this, but it doesn't work, because in that case we would need to re-set the field back to This would be really strange if you ask me. And if we accept this strange behavior, we didn't need to add the |
|
Ah, thanks for explaining. Wonder if we should/could pass a copy of the ReadOptions then? |
|
But it's not a big deal, I think so long as the field is clearly documented |
|
@joosthooz Do you want reviewing at this point or are you looking to polish this PR first? |
|
I would favor #13820, which pushes complexity into Python, over this one, which introduces a dummy option in C++ that has no effect. |
|
Continuing here: #13820 |
…ding transcoding function option to CSV scanner (#13820) This is an alternative version of #13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…ding transcoding function option to CSV scanner (apache#13820) This is an alternative version of apache#13709, to compare what the best approach is. Instead of extending the C++ ReadOptions struct with an `encoding` field, this implementations adds a python version of the ReadOptions object to both `CsvFileFormat` and `CsvFragmentScanOptions`. The reason it is needed in both places, is to prevent these kinds of inconsistencies: ``` >>> import pyarrow.dataset as ds >>> import pyarrow.csv as csv >>> ro =csv.ReadOptions(encoding='iso8859') >>> fo = ds.CsvFileFormat(read_options=ro) >>> fo.default_fragment_scan_options.read_options.encoding 'utf8' ``` Authored-by: Joost Hoozemans <joosthooz@msn.com> Signed-off-by: David Li <li.davidm96@gmail.com>
WIP Adding an optional function that wraps all input streams with a user-supplied transcoding function.