Skip to content

Conversation

@adam2392
Copy link
Member

Reference issue

First part of #9736

What does this implement/fix?

Moves dataset_checksums file into a dictionary of dictionaries that define each dataset uniquely.

Additional information

The next phase would be to:

  • create a public facing function: mne.datasets.fetch_dataset(), which takes in a dictionary of parameters that must have archive_name, folder_name, url, hash defined and optionally can pass in config_key and token
  • refactor _data_path to accept these parameters instead
  • refactor functions that call _data_path internally to leverage MNE_DATASETS to pass in the necessary parameters to _data_path.

@adam2392 adam2392 changed the title [WIP] Move files into config to [WIP] Move files into config to a python file Sep 20, 2021
@adam2392 adam2392 changed the title [WIP] Move files into config to a python file [MRG] Move files into config to a python file Sep 20, 2021
Copy link
Member

@drammock drammock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. Mostly I'm suggesting clarifying the code comments, but one important improvement to make is to avoid writing the temporary file altogether.

@adam2392 adam2392 changed the title [MRG] Move files into config to a python file [MRG] Move files in config to a python file Sep 20, 2021
adam2392 and others added 3 commits September 20, 2021 13:45
@adam2392
Copy link
Member Author

Kay I think this is good to go now @drammock

FYI: #9763 will have the downstream PR that starts to look at generalizing the dataset fetcher. Trying things out, I realized we also need to pull the processor for each dataset out. E.g. the Unzip/UnTar/etc.

@drammock
Copy link
Member

drammock commented Sep 20, 2021

the downstream PR that starts to look at generalizing the dataset fetcher. Trying things out, I realized we also need to pull the processor for each dataset out. E.g. the Unzip/UnTar/etc.

Most of the datasets work with the default values of the pooch processors, so for the generic function I suggest:

processor : None | "zip" | "tar" | instance of pooch.Unzip | instance of pooch.Untar

where None is for files that aren't compressed at all, and the string arguments will use our internally defined pooch.Unzip or Untar similar to how we do it in datasets/utils.py line ~280

For this PR, it would probably be good to pull out the processors into the individual dicts that define each dataset. I can push a commit to do that if you're not sure how best to do it.

@adam2392
Copy link
Member Author

Most of the datasets work with the default values of the pooch processors, so for the generic function I suggest:

processor : None | "zip" | "tar" | instance of pooch.Unzip | instance of pooch.Untar

where None is for files that aren't compressed at all, and the string arguments will use our internally defined pooch.Unzip or Untar similar to how we do it in datasets/utils.py line ~280

Sounds good. That looks good to me.

For this PR, it would probably be good to pull out the processors into the individual dicts that define each dataset. I can push a commit to do that if you're not sure how best to do it.

How would we do this actually since the extract_dir=path relies on the path variable defined internally?

@drammock
Copy link
Member

How would we do this actually since the extract_dir=path relies on the path variable defined internally?

Oops, you're right. I was thinking that when the path variable is passed in as a param to _data_path() if it's already a string (not None) it is left untouched, see the logic in the _get_path() helper function. But we're not doing that --- I forgot we're building the path from the config_key for the internal datasets. So never mind that suggestion.

@adam2392
Copy link
Member Author

I think we need to then separate out _get_path to within fetch_dataset before calling _data_path. That way, the path must be explicitly defined before calling _data_path. Then the processor can be defined outside _data_path and already have the path it needs.

Probably best to save that for the next PR since that depends on creation of the generic public API function?

If so, this is good to go by me.

@adam2392
Copy link
Member Author

Looks like the azure pipeline failures are timing out. I don't think due to this PR at least?

@drammock
Copy link
Member

Looks like the azure pipeline failures are timing out. I don't think due to this PR at least?

that's been happening for several days, I haven't had a chance to look into it yet

@drammock drammock merged commit 9627d09 into mne-tools:main Sep 20, 2021
@adam2392 adam2392 deleted the dataset branch September 21, 2021 00:05
# of the downloaded dataset (ex: "MNE_DATASETS_EEGBCI_PATH").
# - (optional) token : An authorization token, if one is needed to access the
# download URL.
fnirs_motor = dict(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam2392 sorry I was slow to review here. I would suggest you write it as:

MNE_DATASETS = dict()

MNE_DATASETS['fnirs_motor'] = dict(

this way you don't need to replicate below the list of datasets so you won't forget one.

can you open a new PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants