Skip to content

Conversation

@jorisvandenbossche
Copy link
Member

The C++ FileSystemDatasetFactory::Finish method handles the schema inference or validation with two options: InspectOptions::fragments to indicate the number of fragments to use when inferring or validating the schema (default of 1), and the FinishOptions::validate_fragments to indicate whether to validate the specified schema (when not inferred).

For now, I decided to combine this in a single keyword on the Python side (validate_schema). This avoids adding 2 inter-dependent keywords for this, and makes it easier to express some typical use cases (eg validate the specified schema with all fragments is now validate_schema=True instead of validate_schema=True, fragments=-1). On the other hand, it gives a single keyword that accepts both boolean or int (which is not super clean). So this is certainly up for discussion.

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do datasets guarantee that the first file in alphabetical order is used to infer the schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the file paths get sorted:

std::sort(files.begin(), files.end(), fs::FileInfo::ByPath());

(now, whether this should maybe rather be a "natural" sort is another issue ..)

@lidavidm
Copy link
Member

Is this still targeted for 4.0? It needs a rebase if so, but otherwise looks good.

…ctory options through the validate_schema keyword
@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-8221-dataset-factory-validate-schema branch from 0422334 to 1839de0 Compare April 15, 2021 14:43
@jorisvandenbossche
Copy link
Member Author

Rebased now.

@lidavidm I am still a bit in doubt about the exact API.
For example, R has a unify_schemas keyword in their open_dataset function (https://arrow.apache.org/docs/r/reference/open_dataset.html). I find that name a bit more descriptive for the use case of "infer schema by checking all fragments", since then in practice it unifies the schema of all fragments (while with the current PR, you get that by using validate_schema=True).
On the other hand, R doesn't (currently) give a way to validate a specified schema, but not sure how useful that is in practice (since you would get the same error when starting to scan as well, if the specified schema doesn't match).

@lidavidm
Copy link
Member

Naming it unify_schemas or infer_schemas or similar makes sense to me. I see only three main cases:

  • Schema is both given and inferred (and validation should always take place, else why do both?)
  • Schema is given (nothing to validate against, at least until scan time)
  • Schema is inferred (nothing to validate against!)
    So to me it makes more sense to have it be about inference than validation.

@pitrou
Copy link
Member

pitrou commented Jun 23, 2021

@jorisvandenbossche What is the status on this?

@lidavidm
Copy link
Member

@jorisvandenbossche just a gentle nudge :)

@jorisvandenbossche
Copy link
Member Author

Yeah, sorry for the slow follow-up here. It was on my to do list to have a look at today.

Naming it unify_schemas or infer_schemas or similar makes sense to me. I see only three main cases:
...

  • Schema is inferred (nothing to validate against!)

But for this last case, you might still have the options of inferring from the first fragment, or reading the schema of all fragments and unifying them (or erroring when they can't be unified).

So if we have eg a infer_schemas, I suppose we want to keep the different types of arguments what I now did for validate_schema (True meaning infer schema of all fragments, and an integer meaning the number of fragments to check).

@lidavidm
Copy link
Member

That sounds reasonable to me.

@pitrou
Copy link
Member

pitrou commented Nov 22, 2021

@jorisvandenbossche Are you planning to push this forward?

@kszucs
Copy link
Member

kszucs commented Apr 21, 2022

@jorisvandenbossche shall we close this as stale?

@pitrou
Copy link
Member

pitrou commented May 4, 2022

Ping @jorisvandenbossche : can you make a decision on this?

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants