Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1076 +/- ##
==========================================
+ Coverage 88.53% 88.84% +0.31%
==========================================
Files 73 78 +5
Lines 9295 9459 +164
==========================================
+ Hits 8229 8404 +175
+ Misses 1066 1055 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
@yarikoptic The test
Suggestions? |
|
@jwodder - this might be a good time to make dandi ls not use flatdata but the asset structure, rendered as flat (since it is a dictionary, that should still work, and could come with a default set of filters). that should be a separate PR though. out of curiosity, since dandi ls can work on a remote asset, how does it do flat metadata for that? |
|
|
thanks @jwodder - in that case it may be time to converge on a single model (the schemadata one). |
I guess disable |
|
BTW I liked terminology, so we could may be DOC it somewhere
Later we might need to make some |
yarikoptic
left a comment
There was a problem hiding this comment.
Initial review/comments.
Overall I think it is a good direction. I did leave my desires for some code (eg. in zarr) which was merely moved but apparently is not unittested at all. If not to add tests here, please create and self-assign dedicated issues for those pieces. I have initiated a separate #1096 where I would like to decide on get_metadata aspect . Just wanted to submit this portion of the review first.
dandi/files/__init__.py
Outdated
| """ | ||
| .. versionadded:: 0.36.0 | ||
|
|
||
| This module defines functionality for working with local files & directories |
There was a problem hiding this comment.
| This module defines functionality for working with local files & directories | |
| This package defines functionality for working with local files & directories |
to be more precise in that files/ is now upgraded to a package from a single file module?
| try: | ||
| df = dandi_file(p, dandiset_path, bids_dataset_description=bidsdd) | ||
| except UnknownAssetError: | ||
| if (p / BIDS_DATASET_DESCRIPTION).exists(): |
There was a problem hiding this comment.
please add a comment explaining when such UnknownAssetError is expected to happen.
Also this code block is identical (but not test covered) to the one above within prior elif condition for when it is "the top of the dandiset" if I get it right... but then (because not clear when exception is to happen) may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?
Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top. ATM with this code we are more lenient and seems to be happily associating with nested bids datasets (correctly) even if not on top of dandiset
There was a problem hiding this comment.
may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?
That would lead to wrong behavior if a Zarr contained a dataset_description.json.
Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.
If there's no dataset_description.json, how would we know there's a BIDS dataset there?
There was a problem hiding this comment.
Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.
If there's no
dataset_description.json, how would we know there's a BIDS dataset there?
I was not clear. In " there is no BIDS_DATASET_DESCRIPTION on top" I meant that it is "on top of the dandiset". I.e. when we have dandiset.yaml and rawdata/dataset_description.json but no dataset_description.json, i.e. BIDS dataset is embedded without entire dandiset being a BIDS dataset.
There was a problem hiding this comment.
In a case like that, the contents of rawdata/ should still be recognized as part of a BIDS dataset.
| entries within it. | ||
| """ | ||
| if self.is_dir(): | ||
| return sum(p.size for p in self.iterdir()) |
There was a problem hiding this comment.
so this is just an immediate size , ie. nor recursively adding the sizes?!
please add a unittest for this case with some nested folder
There was a problem hiding this comment.
It adds up the .size properties of LocalZarrEntrys inside, so it does end up being recursive.
| else: | ||
| return Digest( | ||
| algorithm=DigestType.md5, value=get_digest(self.filepath, "md5") | ||
| ) |
There was a problem hiding this comment.
I think we have unittests for digesting zarrs, please unittest this as well.
| files=files, | ||
| ) | ||
|
|
||
| return dirstat(self.filetree) |
See commit 3c26f38. |
|
@yarikoptic This PR eliminates the need for the special handling of BIDS assets during upload introduced in #1011 and #1080. The latter PR causes an entire upload to fail early if it contains any invalid BIDS datasets, whereas this PR would simply cause invalid BIDS datasets to be treated as normal assets failing validation (i.e., they're not uploaded, and they're marked as "ERROR" in pyout). What behavior should upload have going forwards? |
I think such behavior (of this PR) is consistent with how we treat other assets, so all good with me! |
| Convert "flatdata" [1]_ for an asset into raw [2]_ "schemadata" [3]_ | ||
|
|
||
| .. [1] a flat `dict` mapping strings to strings & other primitive types; | ||
| returned by `get_metadata()` |
There was a problem hiding this comment.
I wondered if we are to unify should we just then rename the get_metadata into get_flatdata and deprecate get_metadata to make it all consistent... but then I realized that get_flatdata excluding the context of "metadata" is somewhat misleading and get_flatdata_metadata is odd. So may be we should make it flatmetadata and get_flatmetadata or better even adhocmetadata and get_adhocmetadata.... let's sleep on that... for now good and we could do such massive RF later ...
| iter_upload_spy.assert_not_called() | ||
| # Does validation ignoring work? | ||
| bids_dandiset_invalid.upload(existing="forced", validation="ignore") | ||
| bids_dandiset_invalid.upload(existing="force", validation="ignore") |
There was a problem hiding this comment.
How come all of that worked with a wrong value (forced instead of force) ? I would have assumed it was important to specify forcing in this test. Or it doesn't matter?
There was a problem hiding this comment.
I believe existing is only checked if there are any local assets being uploaded with the same path as a remote asset. The BIDS sample Dandiset fixtures don't upload anything as part of their setup, so that code was never triggered.
|
I have remaining confusion question above, but overall I would say -- let's proceed! Thank you @jwodder ! |
Closes #1044.
To do:
dandi upload#1011 and User notification if datasets are invalid. #1080