Skip to content

Conversation

@arthurpassos
Copy link
Contributor

@arthurpassos arthurpassos commented May 31, 2023

Generated with below python script:

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet", compression='BROTLI')

Required by apache/arrow#35825

@mapleFU
Copy link
Member

mapleFU commented Jun 12, 2023

The file is too large which is 96MB. Would you mind generate it just in your patch, or make it much more smaller?

@wgtmac
Copy link
Member

wgtmac commented Jun 12, 2023

I agree with @mapleFU that 96M is too large to be a test file. What about adding a roundtrip test directly in the arrow repo?

@arthurpassos
Copy link
Contributor Author

@mapleFU @wgtmac you mean generate it in the test itself? If so, do you know how I can generate the equivalent to the below using the cpp API?

import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
arr = pa.chunked_array([arr, arr])
tab = pa.table({ "arr": arr })

pq.write_table(tab, "test.parquet")

@wgtmac
Copy link
Member

wgtmac commented Jun 12, 2023

https://github.com/apache/arrow/blob/ae655c5ccb8d4bec1acd0f6d50855a6dea1590c1/cpp/src/arrow/table_test.cc#L294

It may help, though a bit lengthy compared to the python code.

@arthurpassos
Copy link
Contributor Author

https://github.com/apache/arrow/blob/ae655c5ccb8d4bec1acd0f6d50855a6dea1590c1/cpp/src/arrow/table_test.cc#L294

It may help, though a bit lengthy compared to the python code.

It's also not a complex struct like a map, so most likely it wouldn't even throw in this case

@arthurpassos
Copy link
Contributor Author

@mapleFU @wgtmac I have used BROTLI compression and it now occupies only 4.22KB, that should be acceptable right?

@wgtmac
Copy link
Member

wgtmac commented Jun 13, 2023

The file size looks good. But it does not seems to be necessary to add a test file as all files in this repo are for interoperability across different parquet implementations.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file is ok, but you should add discription for this file

@arthurpassos
Copy link
Contributor Author

I think this file is ok, but you should add discription for this file

I added a line in the README

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me, but I wonder if this file is required and description for arrow is ok in parquet library. @pitrou would you mind take a look?

data/README.md Outdated
| rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
| plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
| chunked_string_map.parquet | Map(String, int32) containing string that won't fit arrow Binary. Asserts arrow LargeBinary can read it [Issue](https://github.com/apache/arrow/issues/32723) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • File name could be more clear, like large_string_map.brotli.parquet.
  • The link of github issue may be invalid in the future. What about adding a separate md file to include its generation script, file metadata (optional if the script is clear enough) and explain this issue in more detail?
  • arrow Binary -> arrow BinaryArray
  • arrow LargeBinary -> arrow LargeBinaryArray
  • Have you tried other codec like gzip (and higher levels)? It may help further reduce the file size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Will rename the file
  2. If that's a must, I'll try to do it
  3. ok
  4. ok
  5. I took a quick look at this linkedin article that says BROTLI has the higher compression rates among all the compression types, see https://www.linkedin.com/pulse/comparison-compression-methods-parquet-file-format-saurav-mohapatra/. In any case, I just tried GZIP and it yields a 2085290 bytes file as opposed to BROTLI that produces a 4325 bytes file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding #2, there are some other files in this list that point to JIRA issues. This GH issue has an equivalent JIRA issue, maybe I can point to the JIRA one instead?

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

@arthurpassos Thanks for doing this, and I agree that adding a test file here can be useful for other implementations as well. Why did you create a MAP node, though? Can we just have a regular string column?

@arthurpassos
Copy link
Contributor Author

@arthurpassos Thanks for doing this, and I agree that adding a test file here can be useful for other implementations as well. Why did you create a MAP node, though? Can we just have a regular string column?

@pitrou Regular string column does not suffer from this issue. In a nutshell, it's an issue that pops up when an arrow ChunkedArray with more than one chunk is produced for columns of complex types like maps. You can find more info on: apache/arrow#32723

@pitrou pitrou changed the title Add chunked_string_map data file Add large_string_map data file Jun 21, 2023
@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

@arthurpassos Makes sense, thanks!

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

That said... we might go ahead and create several columns here:

  • a toplevel large_string column
  • a large_string_map column

The file will remain small anyway thanks to compress.

But we can also keep the file as-is if you'd prefer so.

@wgtmac
Copy link
Member

wgtmac commented Jun 21, 2023

That said... we might go ahead and create several columns here:

  • a toplevel large_string column
  • a large_string_map column

The file will remain small anyway thanks to compress.

But we can also keep the file as-is if you'd prefer so.

We have discussed in the PR and confirmed that primitive string column does not have any issues: apache/arrow#35825 (comment). But yes adding it here may benefit other implementations like rust to verify their capability.

@arthurpassos
Copy link
Contributor Author

Can we keep it like this? I would like to get this merged sooner than later, I feel like this is something a bit out of the scope and can be addressed in the future

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

@arthurpassos No problem, I'll merge then.

@arthurpassos
Copy link
Contributor Author

@arthurpassos No problem, I'll merge then.

Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment)

@wgtmac
Copy link
Member

wgtmac commented Jun 21, 2023

LGTM. Thanks @arthurpassos!

Sorry that it is a little bit late in my timezone. I just approved but not merged it in case there is any remaining issue. It would be great if @pitrou can take a final pass.

@mapleFU
Copy link
Member

mapleFU commented Jun 21, 2023

Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment)

Whats the result of the discussion? Does it means that a RLE_DICTIONARY and a PLAIN for testing is required? Or just currently is ok?

@arthurpassos
Copy link
Contributor Author

Thanks, but let's just wait until this discussion gets completed: apache/arrow#35825 (comment)

Whats the result of the discussion? Does it means that a RLE_DICTIONARY and a PLAIN for testing is required? Or just currently is ok?

This version is ok, ready to be merged I believe

@mapleFU
Copy link
Member

mapleFU commented Jun 21, 2023

I've verified:

{
  "Version": "2.6",
  "CreatedBy": "parquet-cpp-arrow version 11.0.0",
  "TotalRows": "2",
  "NumberOfRowGroups": "1",
  "NumberOfRealColumns": "1",
  "NumberOfColumns": "2",
  "Columns": [
     { "Id": "0", "Name": "arr.key_value.key", "PhysicalType": "BYTE_ARRAY", "ConvertedType": "UTF8", "LogicalType": {"Type": "String"} },
     { "Id": "1", "Name": "arr.key_value.value", "PhysicalType": "INT32", "ConvertedType": "NONE", "LogicalType": {"Type": "None"} }
  ],
  "RowGroups": [
     {
       "Id": "0",  "TotalBytes": "2147483827",  "TotalCompressedBytes": "3427",  "Rows": "2",
       "ColumnChunks": [
          {"Id": "0", "Values": "2", "StatsSet": "True", "Stats": {"NumNulls": "0" },
           "Compression": "BROTLI", "Encodings": "PLAIN(DICT_PAGE) PLAIN RLE_DICTIONARY", "UncompressedSize": "2147483749", "CompressedSize": "3346" },
          {"Id": "1", "Values": "2", "StatsSet": "True", "Stats": {"NumNulls": "0", "Max": "1", "Min": "1" },
           "Compression": "BROTLI", "Encodings": "PLAIN(DICT_PAGE) RLE_DICTIONARY", "UncompressedSize": "78", "CompressedSize": "81" }
        ]
     }
  ]

@arthurpassos
Copy link
Contributor Author

@mapleFU Can this be merged then?

@pitrou pitrou merged commit d79a010 into apache:master Jun 21, 2023
@mapleFU
Copy link
Member

mapleFU commented Jun 21, 2023

I've no permission to merge it, but seems that it was merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants