Skip to content

How to handle snappy files generated by Trino? #140

@jfNasciment0

Description

@jfNasciment0

Hello,

With the new release to 0.7.1 the I can't decompress CSV files generated by Trino, I think the issue is related with the Hadoop_snappy. Does anyone know how it can fixed?

from snappy import snappy_formats

csv_file = 'csv_67dba65a.snappy'

def read_file(file_path):
    return open(file_path, 'rb')

decompress_func, read_chunk  = snappy_formats.get_decompress_function(
    'auto',
    read_file(csv_file)
)
decompressed_stream = io.BytesIO()
# Decompress the data
decompress_func(
    read_file(csv_file),
    decompressed_stream,
    start_chunk=read_chunk
)
decompressed_stream.seek(0)

print(f"Compressed file: {read_file(csv_file).read()}")
print(f"DeCompressed file: {decompressed_stream.read()}")

This code has different outputs based on the version:

  • 0.7.0
    Compressed file: b'\x00\x00\x00\x04\x00\x00\x00\x06\x04\x0c"a"\n'
    DeCompressed file: b'"a"\n"a"\n'

  • 0.7.1

  .venv/lib/python3.12/site-packages/snappy/snappy_formats.py", line 64, in get_decompress_function
      decompress_func, read_chunk = guess_format_by_header(fin)

  .venv/lib/python3.12/site-packages/snappy/snappy_formats.py", line 59, in guess_format_by_header
      raise UncompressError("Can't detect archive format")
  snappy.snappy.UncompressError: Can't detect archive format

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions