Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

This PR aims to improve the latency of the discovery process. Notably, it selects "fast" defaults over "safe" defaults.

  • Add InspectOptions which limits the number of fragments inspected to infer the schema, it defaults to one fragment.
  • Add FinishOptions which toggles if validation of the optional schema and also controls the number of fragments it validates with. It defaults to disabling validation.

This gives a noticeable speedup when the fragments have a uniform schema.

@fsaintjacques fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from 15be905 to 3e70fb7 Compare March 23, 2020 15:15
@github-actions
Copy link

@fsaintjacques fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from 3e70fb7 to 9a8dcea Compare March 23, 2020 16:01
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach looks good, a few comments:

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked at the doc comments, the interaction between the different options is not yet fully clear to me

@fsaintjacques fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from ca2a6cb to 4a621de Compare March 25, 2020 13:09
@fsaintjacques fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from 4a621de to 73310eb Compare March 25, 2020 14:07
@fsaintjacques
Copy link
Contributor Author

I don't think the c/glib failure is related.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging

(the CI failure is https://issues.apache.org/jira/browse/ARROW-8215 )

@bkietz bkietz closed this in 17b9980 Mar 25, 2020
assert options.partition_base_dir == 'subdir'
assert options.ignore_prefixes == ['.', '_']
assert options.exclude_invalid_files is True
assert options.exclude_invalid_files is False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default changed here?

Copy link
Contributor Author

@fsaintjacques fsaintjacques Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, validates require reading files.

@jorisvandenbossche
Copy link
Member

This still needs more detailed Python (and R) bindings, right?

@nealrichardson
Copy link
Member

That was my understanding from when we discussed Monday as well.

@jorisvandenbossche
Copy link
Member

Opened an issue for that https://issues.apache.org/jira/browse/ARROW-8221 (at least for the Python side, R probably needs a similar one?)

@jorisvandenbossche
Copy link
Member

Still some questions about this:

  • What happens if you put fragments=0? Then the dataset has no known schema? (or is that only possible when specifying a schema manually?)
  • The number of fragments (fragments option) is used to determine how many fragments to use for both when inferring the schema as when validating the schema, right? (but, this number is only used for validating when validate_fragments is set to True)

/// Indicate if the given Schema (when specified), should be validated against
/// the fragments' schemas. `inspect_options` will control how many fragments
/// are checked.
bool validate_fragments = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would validate_schema be a better name for this?

@fsaintjacques fsaintjacques deleted the ARROW-8058-optional-discovery-validation branch March 26, 2020 12:07
@fsaintjacques
Copy link
Contributor Author

Still some questions about this:

  • What happens if you put fragments=0? Then the dataset has no known schema? (or is that only possible when specifying a schema manually?)

You get no schema, or only the one in the partitioning if there's one.

  • The number of fragments (fragments option) is used to determine how many fragments to use for both when inferring the schema as when validating the schema, right? (but, this number is only used for validating when validate_fragments is set to True)

For FinalizeOptions.inspect_options yes. When the schema is not specified, there's no need to validate since the Inspect does some validation while trying to unify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants