-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Sort filenames when reading parquet to ensure consistent schema #6629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a5e94a7 to
51ca29f
Compare
|
Thanks @thomas-k-cameron -- given my reading of #5779 it seems like perhaps a more direct solution might be to sort the list of files so they are always processed in the same order Perhaps somehow figure out how to sort the let schemas: Vec<_> = futures::stream::iter(objects)
.map(|object| fetch_schema(store.as_ref(), object, self.metadata_size_hint))
.boxed() // Workaround https://github.com/rust-lang/rust/issues/64552
.buffered(SCHEMA_INFERENCE_CONCURRENCY)
.try_collect()
.await?; |
|
Thank you for your advice. I couldn't figure out what I was getting it wrong. |
51ca29f to
5e0f653
Compare
1cf246b to
2d0f9bb
Compare
|
Let me give you an update. I tried to sort by location, but it hasn't been working. It seems like this is coming from how parquet files for tests are created with random names. I have few other ideas. I will see if it works out. |
938773b to
ba6ab9e
Compare
da1c922 to
ac00cfa
Compare
cdba47b to
03690ab
Compare
|
@alamb I updated the PR comment to reflect my changes. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @thomas-k-cameron -- I think this looks great. The only potential downside of this change I can think of is that this will change behavior for any object store that lists files sorted deterministicially in a different order than sort()
However, I think most object stores sort their locations lexographically so this PR is likely to simply make local filesystems consistent.
Thanks again
|
Thank you very much for your feedback and approval. Change that I just pushed simply applies a |
|
Thanks @thomas-k-cameron -- once the check pass I'll plan to merge this PR. |
|
🚀 |
…he#6629) * update * FIXed * add parquet * update * update * update * try2 * update * update * Add comments * cargo fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…he#6629) * update * FIXed * add parquet * update * update * update * try2 * update * update * Add comments * cargo fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…he#6629) * update * FIXed * add parquet * update * update * update * try2 * update * update * Add comments * cargo fmt --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #5779.
Rationale for this change
Sorting is passed to ensure that files returned is in the same order when the OS is different.
What changes are included in this PR?
This PR adds a lines of code to
FileFormat::infer_schematoParquetFormat; It sorts parquet file by it's location.On
test_util::store_parquetis changed to sort temp file by it's path and and write batches in the sorted order.This change was introduced as it made the some tests, such as
datasource::file_format::parquet::tests::read_merged_batchesto fail.It appears that these tests assumed that
infer_schemaprocessed the files in the same order thatstore_parquet's returning value files.Are these changes tested?
Yes. This PR introduces new test case
async fn issue_5779().Are there any user-facing changes?
The order of
Fieldsmay change.