-
Notifications
You must be signed in to change notification settings - Fork 70
Add large_string_map data file #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The file is too large which is 96MB. Would you mind generate it just in your patch, or make it much more smaller? |
|
I agree with @mapleFU that 96M is too large to be a test file. What about adding a roundtrip test directly in the arrow repo? |
|
@mapleFU @wgtmac you mean generate it in the test itself? If so, do you know how I can generate the equivalent to the below using the cpp API? |
|
It may help, though a bit lengthy compared to the python code. |
It's also not a complex struct like a map, so most likely it wouldn't even throw in this case |
|
The file size looks good. But it does not seems to be necessary to add a test file as all files in this repo are for interoperability across different parquet implementations. |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file is ok, but you should add discription for this file
I added a line in the README |
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me, but I wonder if this file is required and description for arrow is ok in parquet library. @pitrou would you mind take a look?
data/README.md
Outdated
| | rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC | | ||
| | plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC | | ||
| | rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC | | ||
| | chunked_string_map.parquet | Map(String, int32) containing string that won't fit arrow Binary. Asserts arrow LargeBinary can read it [Issue](https://github.com/apache/arrow/issues/32723) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts:
- File name could be more clear, like
large_string_map.brotli.parquet. - The link of github issue may be invalid in the future. What about adding a separate md file to include its generation script, file metadata (optional if the script is clear enough) and explain this issue in more detail?
- arrow Binary -> arrow BinaryArray
- arrow LargeBinary -> arrow LargeBinaryArray
- Have you tried other codec like gzip (and higher levels)? It may help further reduce the file size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Will rename the file
- If that's a must, I'll try to do it
- ok
- ok
- I took a quick look at this linkedin article that says
BROTLIhas the higher compression rates among all the compression types, see https://www.linkedin.com/pulse/comparison-compression-methods-parquet-file-format-saurav-mohapatra/. In any case, I just triedGZIPand it yields a 2085290 bytes file as opposed toBROTLIthat produces a 4325 bytes file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding #2, there are some other files in this list that point to JIRA issues. This GH issue has an equivalent JIRA issue, maybe I can point to the JIRA one instead?
|
@arthurpassos Thanks for doing this, and I agree that adding a test file here can be useful for other implementations as well. Why did you create a MAP node, though? Can we just have a regular string column? |
@pitrou Regular string column does not suffer from this issue. In a nutshell, it's an issue that pops up when an arrow ChunkedArray with more than one chunk is produced for columns of complex types like maps. You can find more info on: apache/arrow#32723 |
|
@arthurpassos Makes sense, thanks! |
|
That said... we might go ahead and create several columns here:
The file will remain small anyway thanks to compress. But we can also keep the file as-is if you'd prefer so. |
We have discussed in the PR and confirmed that primitive string column does not have any issues: apache/arrow#35825 (comment). But yes adding it here may benefit other implementations like rust to verify their capability. |
|
Can we keep it like this? I would like to get this merged sooner than later, I feel like this is something a bit out of the scope and can be addressed in the future |
|
@arthurpassos No problem, I'll merge then. |
Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment) |
|
LGTM. Thanks @arthurpassos! Sorry that it is a little bit late in my timezone. I just approved but not merged it in case there is any remaining issue. It would be great if @pitrou can take a final pass. |
Whats the result of the discussion? Does it means that a |
This version is ok, ready to be merged I believe |
|
I've verified: |
|
@mapleFU Can this be merged then? |
|
I've no permission to merge it, but seems that it was merged :) |
Generated with below python script:
Required by apache/arrow#35825