Skip to content

Parquet files that don't end in ".parquet" are ignored #69

@loisaidasam

Description

@loisaidasam

Imagine I have a hive table stored in parquet format (on GCS for my example), and all of the partitions live in files that look like this:

gs://[my bucket here]/data/part-00000-e71c8691-bf80-4535-af13-25c494ecf119-c000
gs://[my bucket here]/data/part-00001-e71c8691-bf80-4535-af13-25c494ecf119-c000
gs://[my bucket here]/data/part-00002-e71c8691-bf80-4535-af13-25c494ecf119-c000

Due to the nature of how get_file_list filters for acceptable embeddings paths on this line:

glob_pattern = path.rstrip("/") + f"/**/*.{file_format}"

It will only look for files that end in .parquet, totally missing all of my partitions.

For a test, I modified that line to say:

glob_pattern = path.rstrip("/") + f"/**/*"

and it worked great! But I imagine that's not a viable solution for everyone.

Maybe some extra parameters could be passed and the filtering could happen in a more generic way, something like:

# ...

    glob_pattern = path.rstrip("/") + f"/**/*"  # Match all files
    file_paths = fs.glob(glob_pattern)
    file_paths = [f for f in file_paths if is_valid(f, fs, file_format, check_file_format_ending)]

def is_valid(file_path: str, fs: fsspec.AbstractFileSystem, file_format: str, check_file_format_ending=True) -> bool:
    """Check if a file is valid."""
    if not fs.isfile(file_path):
        return False
    if check_file_format_ending:
        if not filename.endswith(f".{file_format}"):
            return False
    # TODO: Add other checks here
    return True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions