Add classes for BIDS assets by jwodder · Pull Request #1076 · dandi/dandi-cli

jwodder · 2022-07-19T15:08:47Z

Closes #1044.

To do:

Add docstrings to new classes
Implement validation for BIDS datasets/assets
Implement metadata calculation for BIDS datasets
Replace code from Built-in BIDS support for dandi upload #1011 and User notification if datasets are invalid. #1080

codecov · 2022-07-19T15:14:33Z

Codecov Report

Attention: Patch coverage is 85.07282% with 123 lines in your changes missing coverage. Please review.

Project coverage is 88.84%. Comparing base (84f3ead) to head (e66abfd).
Report is 956 commits behind head on master.

Files	Patch %	Lines
dandi/files/bases.py	77.95%	54 Missing ⚠️
dandi/files/zarr.py	85.38%	44 Missing ⚠️
dandi/files/__init__.py	78.48%	17 Missing ⚠️
dandi/files/bids.py	93.81%	6 Missing ⚠️
dandi/files/_private.py	95.83%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
+ Coverage   88.53%   88.84%   +0.31%     
==========================================
  Files          73       78       +5     
  Lines        9295     9459     +164     
==========================================
+ Hits         8229     8404     +175     
+ Misses       1066     1055      -11

Flag	Coverage Δ
unittests	`88.84% <85.07%> (+0.31%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jwodder · 2022-08-04T14:04:00Z

@yarikoptic The test dandi/cli/tests/test_ls.py::test_ls_bids_file is currently failing due to how I moved some of the code around, and I do not want to move the code back in order to make it pass.

First, note that the code currently works with metadata in two forms:
1. A flat dict mapping strings to strings & other primitive types; let us call this "flatdata". This is the format returned by the dandi.metadata.get_metadata() function, and I believe it was the format used for asset data on the Archive prior to the introduction of dandischema
2. A dandischema object or a dict of the fields thereof (containing nested objects and/or dicts); let us call this "schemadata". This is the format returned by dandi.metadata.nwb2asset().
"Flatdata" is converted into "schemadata" by the prepare_metadata() function (called metadata2asset() prior to Factor out common fields in nwb2asset() and get_default_metadata() #1088).
When uploading, metadata for NWB files is generated using nwb2asset(), which calls get_metadata() and passes the results to prepare_metadata(). Non-NWB files get their metadata from get_default_metadata(), which returns "schemadata" directly and does not call get_metadata().
When Horea added preliminary support for BIDS metadata, he simply added a piece of code in get_metadata() that checks whether the file belongs to a BIDS dataset and, if so, runs the BIDS validation routine on just that one file, returning some BIDS-specific "flatdata" instead of the NWB-specific "flatdata".
When dandi ls displays metadata for a file, it uses nwb2asset() if the --schema option was given or get_metadata() if it was not.
The docstrings in dandi ls imply that the only files it should operate on are NWB files, but if a path to a non-NWB file is supplied directly, dandi ls will treat it like an NWB file. This is what the failing test does.
This PR currently moves the handling of BIDS metadata extraction to a set of classes in dandi.files. In this new setup, files are only recognized as BIDS files if they are next to or underneath a dataset_description.json file found while traversing directories in search of files to upload. BIDS asset classes then determine their metadata by calling the BIDS validation routine on all files in a BIDS dataset once per BIDS dataset, saving the results, and combining the relevant parts per asset with the asset metadata determined through other means.
As part of this PR, I deleted Horea's code in get_metadata() for extracting BIDS metadata, as that is now handled by the BIDS asset classes. However, this causes the abovementioned test to fail, as it checks that running dandi ls on an asset in a BIDS dataset outputs certain BIDS metadata fields.
I can't simply make dandi ls use the dandi.files classes for fetching metadata, as those only produce "schemadata", yet dandi ls produces "flatdata" by default.
I can't add Horea's code back to get_metadata(), as then NWBs in BIDS datasets would lack NWB metadata, and even if I fixed that, we'd be running the BIDS validation on every BIDS asset twice.

Suggestions?

satra · 2022-08-04T15:44:42Z

@jwodder - this might be a good time to make dandi ls not use flatdata but the asset structure, rendered as flat (since it is a dictionary, that should still work, and could come with a default set of filters). that should be a separate PR though. out of curiosity, since dandi ls can work on a remote asset, how does it do flat metadata for that?

jwodder · 2022-08-04T15:45:58Z

@satra

out of curiosity, since dandi ls can work on a remote asset, how does it do flat metadata for that?

dandi ls on a remote asset just shows the "schemadata" from the server.

satra · 2022-08-04T15:47:59Z

thanks @jwodder - in that case it may be time to converge on a single model (the schemadata one).

yarikoptic · 2022-08-05T17:14:49Z

Suggestions?

I guess disable ls test for now if needed. Or is there any other specific question ATM? I am yet to review to provide more meaningful overall feedback.

yarikoptic · 2022-08-05T17:20:36Z

BTW I liked terminology, so we could may be DOC it somewhere

flatdata - a flat dict of ad-hoc key: value pairs
raw_metadata (as we have *_raw_metadata) - dict representation of dandischema metadata
metadata -- dandischema object

Later we might need to make some flat_metadata for ls (or elsewhere) which would be flat dict with flattened keys like "digest.dandi:dandi-etag" or alike.

yarikoptic

Initial review/comments.

Overall I think it is a good direction. I did leave my desires for some code (eg. in zarr) which was merely moved but apparently is not unittested at all. If not to add tests here, please create and self-assign dedicated issues for those pieces. I have initiated a separate #1096 where I would like to decide on get_metadata aspect . Just wanted to submit this portion of the review first.

yarikoptic · 2022-08-04T19:05:01Z

dandi/files/__init__.py

+"""
+.. versionadded:: 0.36.0
+
+This module defines functionality for working with local files & directories


Suggested change

This module defines functionality for working with local files & directories

This package defines functionality for working with local files & directories

to be more precise in that files/ is now upgraded to a package from a single file module?

See 8e15843.

yarikoptic · 2022-08-04T19:16:23Z

dandi/files/__init__.py

+                try:
+                    df = dandi_file(p, dandiset_path, bids_dataset_description=bidsdd)
+                except UnknownAssetError:
+                    if (p / BIDS_DATASET_DESCRIPTION).exists():


please add a comment explaining when such UnknownAssetError is expected to happen.
Also this code block is identical (but not test covered) to the one above within prior elif condition for when it is "the top of the dandiset" if I get it right... but then (because not clear when exception is to happen) may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?

Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top. ATM with this code we are more lenient and seems to be happily associating with nested bids datasets (correctly) even if not on top of dandiset

may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?

That would lead to wrong behavior if a Zarr contained a dataset_description.json.

Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.

If there's no dataset_description.json, how would we know there's a BIDS dataset there?

Comment added: a654c8b.

Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.

If there's no dataset_description.json, how would we know there's a BIDS dataset there?

I was not clear. In " there is no BIDS_DATASET_DESCRIPTION on top" I meant that it is "on top of the dandiset". I.e. when we have dandiset.yaml and rawdata/dataset_description.json but no dataset_description.json, i.e. BIDS dataset is embedded without entire dandiset being a BIDS dataset.

In a case like that, the contents of rawdata/ should still be recognized as part of a BIDS dataset.

dandi/files/__init__.py

dandi/tests/test_files.py

yarikoptic · 2022-08-05T17:32:48Z

dandi/files/zarr.py

+        entries within it.
+        """
+        if self.is_dir():
+            return sum(p.size for p in self.iterdir())


so this is just an immediate size , ie. nor recursively adding the sizes?!

please add a unittest for this case with some nested folder

It adds up the .size properties of LocalZarrEntrys inside, so it does end up being recursive.

Tested: b43612d

yarikoptic · 2022-08-05T17:33:32Z

dandi/files/zarr.py

+        else:
+            return Digest(
+                algorithm=DigestType.md5, value=get_digest(self.filepath, "md5")
+            )


I think we have unittests for digesting zarrs, please unittest this as well.

Tested: b43612d

yarikoptic · 2022-08-05T17:34:24Z

dandi/files/zarr.py

+                files=files,
+            )
+
+        return dirstat(self.filetree)


seems needs unittesting

Tested: b43612d

dandi/metadata.py

jwodder · 2022-08-05T18:13:16Z

@yarikoptic

BTW I liked terminology, so we could may be DOC it somewhere

See commit 3c26f38.

jwodder · 2022-08-05T19:16:27Z

@yarikoptic This PR eliminates the need for the special handling of BIDS assets during upload introduced in #1011 and #1080. The latter PR causes an entire upload to fail early if it contains any invalid BIDS datasets, whereas this PR would simply cause invalid BIDS datasets to be treated as normal assets failing validation (i.e., they're not uploaded, and they're marked as "ERROR" in pyout). What behavior should upload have going forwards?

yarikoptic · 2022-08-08T14:17:22Z

whereas this PR would simply cause invalid BIDS datasets to be treated as normal assets failing validation (i.e., they're not uploaded, and they're marked as "ERROR" in pyout). What behavior should upload have going forwards?

I think such behavior (of this PR) is consistent with how we treat other assets, so all good with me!

yarikoptic · 2022-08-08T14:32:36Z

dandi/metadata.py

+    Convert "flatdata" [1]_ for an asset into raw [2]_ "schemadata" [3]_
+
+    .. [1] a flat `dict` mapping strings to strings & other primitive types;
+       returned by `get_metadata()`


I wondered if we are to unify should we just then rename the get_metadata into get_flatdata and deprecate get_metadata to make it all consistent... but then I realized that get_flatdata excluding the context of "metadata" is somewhat misleading and get_flatdata_metadata is odd. So may be we should make it flatmetadata and get_flatmetadata or better even adhocmetadata and get_adhocmetadata.... let's sleep on that... for now good and we could do such massive RF later ...

yarikoptic · 2022-08-09T16:20:47Z

dandi/tests/test_upload.py

    iter_upload_spy.assert_not_called()
    # Does validation ignoring work?
-    bids_dandiset_invalid.upload(existing="forced", validation="ignore")
+    bids_dandiset_invalid.upload(existing="force", validation="ignore")


How come all of that worked with a wrong value (forced instead of force) ? I would have assumed it was important to specify forcing in this test. Or it doesn't matter?

I believe existing is only checked if there are any local assets being uploaded with the same path as a remote asset. The BIDS sample Dandiset fixtures don't upload anything as part of their setup, so that code was never triggered.

yarikoptic · 2022-08-09T16:24:36Z

I have remaining confusion question above, but overall I would say -- let's proceed! Thank you @jwodder !

jwodder added the BIDS label Jul 19, 2022

This was referenced Jul 22, 2022

RF "files.py" validation/metadata-loading to support BIDS #1044

Closed

RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

Open

jwodder added 6 commits August 3, 2022 14:49

Split up files.py

920b40d

Add classes for BIDS assets

a3c4a27

Give BIDSDatasetDescriptionAsset a bids_root property

6777183

Add docstrings to BIDS classes

c931923

BIDS validation

bce33d1

Fix circular import

5d30d49

jwodder force-pushed the gh-1044 branch from 849330d to 204c41a Compare August 3, 2022 19:05

Metadata for BIDS assets

ab4733f

jwodder force-pushed the gh-1044 branch from 204c41a to ab4733f Compare August 3, 2022 19:11

Make dataset_description.json files have default metadata

962eaf1

This will probably be useful later.

6f23007

yarikoptic mentioned this pull request Aug 4, 2022

make find_dandi_files raise FileNotFoundError exception if path is invalid #1094

Open

yarikoptic mentioned this pull request Aug 5, 2022

(DRAFT/placeholder) get_metadata destiny #1096

Closed

yarikoptic reviewed Aug 5, 2022

View reviewed changes

jwodder mentioned this pull request Aug 5, 2022

Restore BIDS metadata integration for dandi ls #1097

Closed

jwodder added 3 commits August 5, 2022 13:49

Mark test_ls_bids_file as xfailing

be78904

Describe some structures in terms of "flatdata" and "schemadata"

3c26f38

Adjust opening of dandi.files docstring

8e15843

Add a comment explaining when an exception can happen

a654c8b

jwodder added the minor Increment the minor version when merged label Aug 5, 2022

Test properties of local Zarr assets and their entries

b43612d

jwodder force-pushed the gh-1044 branch from 7d07424 to b43612d Compare August 5, 2022 19:27

yarikoptic reviewed Aug 8, 2022

View reviewed changes

jwodder added the Python API label Aug 8, 2022

jwodder added 2 commits August 8, 2022 10:42

Remove previous BIDS code from upload.py

2aa92f4

Add versionadded:: directives

550cc7c

jwodder marked this pull request as ready for review August 8, 2022 14:45

Move private dandi.files classes to _private.py

e66abfd

yarikoptic reviewed Aug 9, 2022

View reviewed changes

yarikoptic merged commit 7c42ed9 into master Aug 9, 2022

yarikoptic deleted the gh-1044 branch August 9, 2022 16:24

	This module defines functionality for working with local files & directories
	This package defines functionality for working with local files & directories

Conversation

jwodder commented Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jwodder commented Aug 4, 2022

Uh oh!

satra commented Aug 4, 2022

Uh oh!

jwodder commented Aug 4, 2022

Uh oh!

satra commented Aug 4, 2022

Uh oh!

yarikoptic commented Aug 5, 2022

Uh oh!

yarikoptic commented Aug 5, 2022

Uh oh!

yarikoptic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwodder Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwodder Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwodder Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jwodder commented Aug 5, 2022

Uh oh!

jwodder commented Aug 5, 2022

Uh oh!

yarikoptic commented Aug 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarikoptic commented Aug 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

jwodder commented Jul 19, 2022 •

edited

Loading

codecov bot commented Jul 19, 2022 •

edited

Loading

jwodder Aug 5, 2022 •

edited

Loading

jwodder Aug 5, 2022 •

edited

Loading

jwodder Aug 5, 2022 •

edited

Loading