ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687

fsaintjacques · 2020-03-23T15:10:24Z

This PR aims to improve the latency of the discovery process. Notably, it selects "fast" defaults over "safe" defaults.

Add InspectOptions which limits the number of fragments inspected to infer the schema, it defaults to one fragment.
Add FinishOptions which toggles if validation of the optional schema and also controls the number of fragments it validates with. It defaults to disabling validation.

This gives a noticeable speedup when the fragments have a uniform schema.

github-actions · 2020-03-23T15:16:30Z

https://issues.apache.org/jira/browse/ARROW-8058

bkietz

General approach looks good, a few comments:

cpp/src/arrow/dataset/discovery.h

cpp/src/arrow/dataset/discovery.cc

cpp/src/arrow/dataset/discovery.h

cpp/src/arrow/dataset/discovery.cc

cpp/src/arrow/dataset/discovery_test.cc

cpp/src/arrow/dataset/discovery.cc

jorisvandenbossche

Just looked at the doc comments, the interaction between the different options is not yet fully clear to me

cpp/src/arrow/dataset/discovery.h

fsaintjacques · 2020-03-25T15:04:58Z

I don't think the c/glib failure is related.

bkietz

LGTM, merging

(the CI failure is https://issues.apache.org/jira/browse/ARROW-8215 )

jorisvandenbossche · 2020-03-25T19:48:19Z

python/pyarrow/tests/test_dataset.py

    assert options.partition_base_dir == 'subdir'
    assert options.ignore_prefixes == ['.', '_']
-    assert options.exclude_invalid_files is True
+    assert options.exclude_invalid_files is False


the default changed here?

Yes, validates require reading files.

jorisvandenbossche · 2020-03-25T19:50:05Z

This still needs more detailed Python (and R) bindings, right?

nealrichardson · 2020-03-25T19:53:39Z

That was my understanding from when we discussed Monday as well.

jorisvandenbossche · 2020-03-25T19:57:25Z

Opened an issue for that https://issues.apache.org/jira/browse/ARROW-8221 (at least for the Python side, R probably needs a similar one?)

jorisvandenbossche · 2020-03-25T20:02:46Z

Still some questions about this:

What happens if you put fragments=0? Then the dataset has no known schema? (or is that only possible when specifying a schema manually?)
The number of fragments (fragments option) is used to determine how many fragments to use for both when inferring the schema as when validating the schema, right? (but, this number is only used for validating when validate_fragments is set to True)

jorisvandenbossche · 2020-03-25T19:58:02Z

cpp/src/arrow/dataset/discovery.h

+  /// Indicate if the given Schema (when specified), should be validated against
+  /// the fragments' schemas. `inspect_options` will control how many fragments
+  /// are checked.
+  bool validate_fragments = false;


Would validate_schema be a better name for this?

fsaintjacques · 2020-03-26T12:10:42Z

Still some questions about this:

What happens if you put fragments=0? Then the dataset has no known schema? (or is that only possible when specifying a schema manually?)

You get no schema, or only the one in the partitioning if there's one.

The number of fragments (fragments option) is used to determine how many fragments to use for both when inferring the schema as when validating the schema, right? (but, this number is only used for validating when validate_fragments is set to True)

For FinalizeOptions.inspect_options yes. When the schema is not specified, there's no need to validate since the Inspect does some validation while trying to unify.

fsaintjacques requested review from bkietz and jorisvandenbossche March 23, 2020 15:10

fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from 15be905 to 3e70fb7 Compare March 23, 2020 15:15

fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from 3e70fb7 to 9a8dcea Compare March 23, 2020 16:01

bkietz requested changes Mar 23, 2020

View reviewed changes

jorisvandenbossche reviewed Mar 23, 2020

View reviewed changes

cpp/src/arrow/dataset/discovery.h Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/discovery.h Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/discovery.h Outdated Show resolved Hide resolved

fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from ca2a6cb to 4a621de Compare March 25, 2020 13:09

fsaintjacques added 4 commits March 25, 2020 10:07

Implement InspectOptions

13ba9bc

Add FinishOptions to DatasetFactory::Finish

af1fb72

Review

ab5bf32

Add python interface

73310eb

fsaintjacques force-pushed the ARROW-8058-optional-discovery-validation branch from 4a621de to 73310eb Compare March 25, 2020 14:07

bkietz approved these changes Mar 25, 2020

View reviewed changes

bkietz closed this in 17b9980 Mar 25, 2020

jorisvandenbossche reviewed Mar 25, 2020

View reviewed changes

fsaintjacques deleted the ARROW-8058-optional-discovery-validation branch March 26, 2020 12:07

This was referenced Jun 8, 2020

[C++][Python][Dataset] Provide an option to toggle validation and schema inference in FileSystemDatasetFactoryOptions #24271

Closed

[Python][Dataset] Expose schema inference / validation options in the factory #24418

Closed

AlenkaF mentioned this pull request Apr 29, 2025

[Python] pyarrow.dataset.dataset exclude_invalid_files parameter does not adhere to documented default value #46181

Open

ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687

ARROW-8058: [Dataset] Relax DatasetFactory discovery validation #6687

Uh oh!

Conversation

fsaintjacques commented Mar 23, 2020

Uh oh!

github-actions bot commented Mar 23, 2020

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fsaintjacques commented Mar 25, 2020

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 25, 2020

Choose a reason for hiding this comment

Uh oh!

fsaintjacques Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 25, 2020

Uh oh!

nealrichardson commented Mar 25, 2020

Uh oh!

jorisvandenbossche commented Mar 25, 2020

Uh oh!

jorisvandenbossche commented Mar 25, 2020

Uh oh!

jorisvandenbossche Mar 25, 2020

Choose a reason for hiding this comment

Uh oh!

fsaintjacques commented Mar 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fsaintjacques Mar 26, 2020 •

edited

Loading