generated from rom1504/python-template
-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Imagine I have a hive table stored in parquet format (on GCS for my example), and all of the partitions live in files that look like this:
gs://[my bucket here]/data/part-00000-e71c8691-bf80-4535-af13-25c494ecf119-c000
gs://[my bucket here]/data/part-00001-e71c8691-bf80-4535-af13-25c494ecf119-c000
gs://[my bucket here]/data/part-00002-e71c8691-bf80-4535-af13-25c494ecf119-c000
Due to the nature of how get_file_list filters for acceptable embeddings paths on this line:
| glob_pattern = path.rstrip("/") + f"/**/*.{file_format}" |
It will only look for files that end in .parquet, totally missing all of my partitions.
For a test, I modified that line to say:
glob_pattern = path.rstrip("/") + f"/**/*"and it worked great! But I imagine that's not a viable solution for everyone.
Maybe some extra parameters could be passed and the filtering could happen in a more generic way, something like:
# ...
glob_pattern = path.rstrip("/") + f"/**/*" # Match all files
file_paths = fs.glob(glob_pattern)
file_paths = [f for f in file_paths if is_valid(f, fs, file_format, check_file_format_ending)]
def is_valid(file_path: str, fs: fsspec.AbstractFileSystem, file_format: str, check_file_format_ending=True) -> bool:
"""Check if a file is valid."""
if not fs.isfile(file_path):
return False
if check_file_format_ending:
if not filename.endswith(f".{file_format}"):
return False
# TODO: Add other checks here
return TrueMetadata
Metadata
Assignees
Labels
No labels