Skip to content

Replace the use of bidsschematools with the use of deno-compiled BIDS validator in obtaining validation result#1599

Merged
yarikoptic merged 11 commits intodandi:masterfrom
candleindark:bids-validator-deno
May 19, 2025
Merged

Replace the use of bidsschematools with the use of deno-compiled BIDS validator in obtaining validation result#1599
yarikoptic merged 11 commits intodandi:masterfrom
candleindark:bids-validator-deno

Conversation

@candleindark
Copy link
Member

@candleindark candleindark commented Mar 28, 2025

This PR addresses #1597 partly. It doesn't replace the use of bidsschematools completely with the use of the deno-compiled BIDS validator. bidsschematools is still needed in obtaining BIDS related metadata for individual assets. However, BIDS validation results for a dataset is now obtained completely through the deno-compiled BIDS validator.

Additionally, by modifying the validation error message regarding the dandiset.yaml file in BIDS validation, the PR includes changes to inform the user about dandiset.yaml in BIDS validation in the context of DANDI. This PR thus also closes #1602.

Remaining TODOs:

  • Have the bids_validate() func support the --ignoreNiftiHeaders option in the underlying deno-compiled BIDS validator
  • Have the bids_validate() func support the --recursive option in the underlying deno-compiled BIDS validator
  • Ensure the validation result of the deno-compiled BIDS validator is output to a file before it is extracted. (This way will eliminate the pollution of the result by other output to stdout).
  • Have the bids_validate() func support the --config option in the underlying deno-compiled BIDS validator
  • Ensure no validation error in the test of validation against the selected examples in https://github.com/bids-standard/bids-examples
  • Ensure existence of validation errors in the test of validation against the selected examples in https://github.com/bids-standard/bids-error-examples
  • Port the use of BIDS validation by the deno-compiled validator to the codebase.

Note:

@candleindark candleindark force-pushed the bids-validator-deno branch 4 times, most recently from 7e4d870 to 0ca8b1f Compare April 1, 2025 19:43
@yarikoptic
Copy link
Member

I wonder if while at this we should add an option which would pretty much list which validators are to be used or not used. This way we could still keep bidsschematools as a choice for a while longer, while defaulting to the bids-validator-deno by default. Could be simply a comma separated list of validator enums you defined in #1514 with allowing for - prefix to exclude, and all to signal all of them.... WDYT?

@candleindark
Copy link
Member Author

I wonder if while at this we should add an option which would pretty much list which validators are to be used or not used. This way we could still keep bidsschematools as a choice for a while longer, while defaulting to the bids-validator-deno by default. Could be simply a comma separated list of validator enums you defined in #1514 with allowing for - prefix to exclude, and all to signal all of them.... WDYT?

I am not sure I understand exactly what you are proposing. Are these the enums you are referring to. They are validators for different standards not just for BIDS, so I don't know what you mean by having them as alternatives to bids-validator-deno.

For as far as I know, there is currently only one use of bidsschematools for BIDS validation in the dandi-cli repo. The need for choosing between using bidsschematools and bids-validator-deno for BIDS validation may be low at the moment. Additionally, since the return of validation result of bidsschematools and bids-validator-deno are different, if you want to keep both, I will have to harmonize the results from bidsschematools validation to our standard ValidationResult as well.

I think we can talk about this more in tomorrow's meeting with some clarifications.

@codecov
Copy link

codecov bot commented Apr 2, 2025

Codecov Report

❌ Patch coverage is 99.51574% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.78%. Comparing base (a922eb6) to head (e33159f).
⚠️ Report is 77 commits behind head on master.

Files with missing lines Patch % Lines
dandi/bids_validator_deno/__init__.py 75.00% 1 Missing ⚠️
...i/tests/test_bids_validator_deno/test_validator.py 99.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1599      +/-   ##
==========================================
+ Coverage   88.29%   88.78%   +0.49%     
==========================================
  Files          78       82       +4     
  Lines       11045    11399     +354     
==========================================
+ Hits         9752    10121     +369     
+ Misses       1293     1278      -15     
Flag Coverage Δ
unittests 88.78% <99.51%> (+0.49%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@candleindark candleindark force-pushed the bids-validator-deno branch 6 times, most recently from f6ebc3e to 0505ffc Compare April 7, 2025 20:00
# end of ad-hoc fix.

results = validate_bids(self.bids_root)
results = bids_validate(self.bids_root)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here, if input path was not the top of the bids dataset -- issue a lgr.WARNING that we will use bidsschematools validation only per each path, and it would validate only the path names. To validate full bids dataset -- point to the top of the bids dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: here we might not already have a clue what were those original paths! ...

@candleindark candleindark force-pushed the bids-validator-deno branch 7 times, most recently from fb5e0ea to 023ab9c Compare April 14, 2025 04:26
@candleindark candleindark force-pushed the bids-validator-deno branch 5 times, most recently from 6fb2f5b to 23b796c Compare April 20, 2025 21:11
@candleindark
Copy link
Member Author

@CodyCBakerPhD I have completed the refactoring you suggested. This PR is ready for another look. The tests are failing mostly because of https://ontobee.org/ is currently half dead.

@CodyCBakerPhD
Copy link
Contributor

CodyCBakerPhD commented May 11, 2025

We initially started with the goal of replacing the use of bidsschematools completely, but it turned out that it was not possible since the BIDS validator does not provide the same kind of metadata information, so we ended up keeping dandi.validate.validate_bids(), which provides an interface to bidsschematools.

This is good contextual information, I will keep an eye for how the metadata forms differ

Might be worth renaming the PR then to focus on the 'new feature addition' since there is, after all, no replacement (hence I was confused where the replacing was 😆)

I will try this hands on tomorrow and come back with anything of note

@CodyCBakerPhD
Copy link
Contributor

From @candleindark

I think the current naming of the PR is fine. The replacement is in a qualified sense, "in obtaining validation result". bidsschematools is kept to obtain the metadata.

@CodyCBakerPhD
Copy link
Contributor

Testing this out now on a real life BIDS dataset

Also here is another point in favor of updating installation procedure 😉

image

Comment on lines +81 to 90
def _get_metadata(self) -> None:
"""
Get metadata for all assets in the dataset

This populates `self._asset_metadata`
"""
with self._lock:
if self._dataset_errors is None:
if self._asset_metadata is None:
# Import here to avoid circular import
from dandi.validate import validate_bids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that an apt summary of the 'integration' of the new feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my, OK so to my understanding, _get_metadata still relies on the old dandi.validate.validate_bids?

Yes. There is no equivalent metadata provided by the deno-compiled validator, so we are keeping the bidsschematools to provide the same kind of metadata.

Is that an apt summary of the 'integration' of the new feature?

Yes, that's the gist of it. I found the current design of validation (metadata gathering as well) in DANDI complex. This is a method of integration that I found that adhere to the current design.

@yarikoptic may have some plan for the validation process in dandi-cli in the future. What I see is that any design change in the validation process would entail significant changes in other parts of the program.

@CodyCBakerPhD
Copy link
Contributor

CodyCBakerPhD commented May 12, 2025

From PR description

In review, please pay particular attention to the todos in bids_validator_deno/init.py. Those todos denote options that may need to be adjusted.

There do not seem to be any left? (good thing)

@CodyCBakerPhD
Copy link
Contributor

Ran this on the now published Dandiset 001350 to the following effect

Details
[BIDS.SIDECAR_KEY_RECOMMENDED] ...

[BIDS.SIDECAR_KEY_RECOMMENDED] E:\test_bids_validator\001350\sub-375\micr\sub-375_sample-0008_TEM.png — A data file's JSON sidecar is missing a key listed as recommended.
subCode: SampleStaining
issueMessage: Field description: Description(s) of the tissue sample staining (for example: `"Osmium"`).
MAY be an array of strings if different stains are used in each channel of the file
(for example: `["LFB", "PLP"]`).

[BIDS.SIDECAR_KEY_RECOMMENDED] E:\test_bids_validator\001350\sub-375\micr\sub-375_sample-0008_TEM.png — A data file's JSON sidecar is missing a key listed as recommended.
subCode: SamplePrimaryAntibody
issueMessage: Field description: Description(s) of the primary antibody used for immunostaining.
Either an [RRID](https://rrid.site) or the name, supplier and catalog
number of a commercial antibody.
For non-commercial antibodies either an [RRID](https://rrid.site) or the
host-animal and immunogen used (for examples: `"RRID:AB_2122563"` or
`"Rabbit anti-Human HTR5A Polyclonal Antibody, Invitrogen, Catalog # PA1-2453"`).
MAY be an array of strings if different antibodies are used in each channel of the file.

[BIDS.SIDECAR_KEY_RECOMMENDED] E:\test_bids_validator\001350\sub-375\micr\sub-375_sample-0008_TEM.png — A data file's JSON sidecar is missing a key listed as recommended.
subCode: SampleSecondaryAntibody
issueMessage: Field description: Description(s) of the secondary antibody used for immunostaining.
Either an [RRID](https://rrid.site) or the name, supplier and catalog
number of a commercial antibody.
For non-commercial antibodies either an [RRID](https://rrid.site) or the
host-animal and immunogen used (for examples: `"RRID:AB_228322"` or
`"Goat anti-Mouse IgM Secondary Antibody, Invitrogen, Catalog # 31172"`).
MAY be an array of strings if different antibodies are used in each channel of the file.

[BIDS.MISSING_REQUIRED_ENTITY] E:\test_bids_validator\001350\sub-375\micr\sub-375_TEM.json — Missing required entity for files with this suffix.
issueMessage: sample missing from rule rules.files.raw.micr.microscopy

With a ballpark total of many dozen such messages repeated across various assets and none of these repeated recommendations being aggregated (and thus rather hard to read)

Will try again for comparison with last official release of CLI

@candleindark
Copy link
Member Author

From PR description

In review, please pay particular attention to the todos in bids_validator_deno/init.py. Those todos denote options that may need to be adjusted.

There do not seem to by any left? (good thing)

That file should be bids_validator_deno/_validator.py now. There is one todo left.

# TODO: If we want to include these issues, we will have to add a new value
# to the Severity enum.

@yarikoptic What's your opinion on that todo? The default behavior of the BIDS validator itself is to skip all the ignored issues on the regular, human facing, output, but include the ignored issues on JSON output. This PR currently skips all the ignore issues, not packing them up as ValidationResult objects. If you want the ignored issues to be packed as ValidationResult objects and included, we will need to introduce an IGNORE = 0 member in

class Severity(IntEnum):
"""Severity levels for validation results"""
INFO = 10
"""Not an indication of problem but information of status or confirmation"""
HINT = 20
"""Data is valid but could be improved"""
WARNING = 30
"""Data is not recognized as valid. Changes are needed to ensure validity"""
ERROR = 40
"""Data is recognized as invalid"""
CRITICAL = 50
"""
A serious invalidity in data.
E.g., an invalidity that prevents validation of other aspects of the data such
as when validating against the BIDS standard, the data is without a `BIDSVersion`
field or has an invalid `BIDSVersion` field.
"""

@candleindark
Copy link
Member Author

candleindark commented May 12, 2025

Ran this on the now published Dandiset 001350 to the following effect

[BIDS.SIDECAR_KEY_RECOMMENDED] ...

[BIDS.SIDECAR_KEY_RECOMMENDED] E:\test_bids_validator\001350\sub-375\micr\sub-375_sample-0008_TEM.png — A data file's JSON sidecar is missing a key listed as recommended.
subCode: SampleStaining
issueMessage: Field description: Description(s) of the tissue sample staining (for example: `"Osmium"`).
MAY be an array of strings if different stains are used in each channel of the file
(for example: `["LFB", "PLP"]`).

[BIDS.SIDECAR_KEY_RECOMMENDED] E:\test_bids_validator\001350\sub-375\micr\sub-375_sample-0008_TEM.png — A data file's JSON sidecar is missing a key listed as recommended.
subCode: SamplePrimaryAntibody
issueMessage: Field description: Description(s) of the primary antibody used for immunostaining.
Either an [RRID](https://rrid.site) or the name, supplier and catalog
number of a commercial antibody.
For non-commercial antibodies either an [RRID](https://rrid.site) or the
host-animal and immunogen used (for examples: `"RRID:AB_2122563"` or
`"Rabbit anti-Human HTR5A Polyclonal Antibody, Invitrogen, Catalog # PA1-2453"`).
MAY be an array of strings if different antibodies are used in each channel of the file.

[BIDS.SIDECAR_KEY_RECOMMENDED] E:\test_bids_validator\001350\sub-375\micr\sub-375_sample-0008_TEM.png — A data file's JSON sidecar is missing a key listed as recommended.
subCode: SampleSecondaryAntibody
issueMessage: Field description: Description(s) of the secondary antibody used for immunostaining.
Either an [RRID](https://rrid.site) or the name, supplier and catalog
number of a commercial antibody.
For non-commercial antibodies either an [RRID](https://rrid.site) or the
host-animal and immunogen used (for examples: `"RRID:AB_228322"` or
`"Goat anti-Mouse IgM Secondary Antibody, Invitrogen, Catalog # 31172"`).
MAY be an array of strings if different antibodies are used in each channel of the file.

[BIDS.MISSING_REQUIRED_ENTITY] E:\test_bids_validator\001350\sub-375\micr\sub-375_TEM.json — Missing required entity for files with this suffix.
issueMessage: sample missing from rule rules.files.raw.micr.microscopy

With a ballpark total of many dozen such messages repeated across various assets and none of these repeated recommendations being aggregated (and thus rather hard to read)

Will try again for comparison with last official release of CLI

The output looks about right.

Running (dandi-cli) ➜ tmp_ds bids-validator-deno --ignoreWarnings --json -o out.json 001350. I got the following out.json file.

out.json
{
    "issues": {
        "issues": [
            {
                "code": "NOT_INCLUDED",
                "location": "/dandiset.yaml",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-372/micr/sub-372_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-367A/micr/sub-367A_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-375/micr/sub-375_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-374/micr/sub-374_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-373C/micr/sub-373C_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-366A/micr/sub-366A_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-369B/micr/sub-369B_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-368A/micr/sub-368A_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-371/micr/sub-371_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            },
            {
                "code": "MISSING_REQUIRED_ENTITY",
                "location": "/sub-370/micr/sub-370_TEM.json",
                "issueMessage": "sample missing from rule rules.files.raw.micr.microscopy",
                "rule": "rules.files.raw.micr.microscopy",
                "severity": "error"
            }
        ],
        "codeMessages": {
            "NOT_INCLUDED": "Files with such naming scheme are not part of BIDS specification. This error is most commonly caused by typos in file names that make them not BIDS compatible. Please consult the specification and make sure your files are named correctly. If this is not a file naming issue (for example when including files not yet covered by the BIDS specification) you should include a \".bidsignore\" file in your dataset (see https://github.com/bids-standard/bids-validator#bidsignore for details). Please note that derived (processed) data should be placed in /derivatives folder and source data (such as DICOMS or behavioural logs in proprietary formats) should be placed in the /sourcedata folder.",
            "MISSING_REQUIRED_ENTITY": "Missing required entity for files with this suffix."
        }
    },
    "summary": {
        "sessions": [],
        "subjects": [
            "372",
            "367A",
            "375",
            "374",
            "373C",
            "366A",
            "369B",
            "368A",
            "371",
            "370"
        ],
        "subjectMetadata": [
            {
                "participantId": "366A"
            },
            {
                "participantId": "367A"
            },
            {
                "participantId": "368A"
            },
            {
                "participantId": "369B"
            },
            {
                "participantId": "370"
            },
            {
                "participantId": "371"
            },
            {
                "participantId": "372"
            },
            {
                "participantId": "373C"
            },
            {
                "participantId": "374"
            },
            {
                "participantId": "375"
            }
        ],
        "tasks": [],
        "modalities": [
            "micr"
        ],
        "secondaryModalities": [],
        "totalFiles": 103,
        "size": 1033321299,
        "dataProcessed": false,
        "pet": {},
        "dataTypes": [
            "micr"
        ],
        "schemaVersion": "1.0.4"
    }
}

Each error issue is mapped to a validation error, red, in dandi validate output. Warning issues from the BIDS validator are mapped to severity of level HINT. They are blue in the output. The map is located at

_SEVERITY_MAP = {
BidsSeverity.warning: Severity.HINT,
BidsSeverity.error: Severity.ERROR,
}

BIDS validator generates many more errors than bidsschematools in general. We have plan to do filtering of those results by type and location (second todo in #1597). We can do some aggregate data as well, but I think we should do those in other PR(s) since this one is quite big already.

@CodyCBakerPhD
Copy link
Contributor

CodyCBakerPhD commented May 12, 2025

BIDS validator generates many more errors than bidsschematools in general. We have plan to do filtering of those results by type and location (second todo in #1597). We can do some aggregate data as well, but I think we should do those in other PR(s) since this one is quite big already.

Yes, of course - and agreed - I am merely thinking aloud when I note such things

Upon testing validation on main last official release, the dataset indeed returns clear of errors - to be expected? The new validation procedure is 'more rigorous' than before?

EDIT: reading

BIDS validator generates many more errors than bidsschematools in general.

Indeed checks out. Anything is greater than zero, after all

@CodyCBakerPhD
Copy link
Contributor

CodyCBakerPhD commented May 12, 2025

LGTM

Code is very well organized, well written, and well documented! Great job @candleindark

Just need @yarikoptic to weigh in on severity level of ignore

I believe, for comparison, that NWB Inspector would have straight up skipped (just not have run) any checks on things below a specified 'severity' level

@candleindark
Copy link
Member Author

I believe, for comparison, that NWB Inspector would have straight up skipped (just not have run) any checks on things below a specified 'severity' level

In case you have not noticed yet, dandi validate has the --min-severity, but that only filters out errors below a certain level.

candleindark and others added 11 commits May 14, 2025 12:26
This commit implements BIDS validation of a
direction through the deno-compiled BIDS
validator. It provides the `bids_validate()`
function which when called invoke the
deno-compiled validator to validate
a directory. The result of the validation
will be returned as a list of `ValidationResult`
objects.
`bids_validate()` calls the deno-compiled BIDS validator to do
the actual validation
Annotating these attribute in `BIDSDatasetDescriptionAsset`
with `defaultdict`more precisely describe
their behavior
This retains the original logic to obtain the metadata
This allows BIDS schema version to be recorded in
`ValidationResult` instances

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
BIDS version is actually not always available since
the BIDS validator currently doesn't provide info
regarding BIDS version. See
bids-standard/bids-validator#10 (comment)
for details.
To prevent external calls to private members of
`BIDSDatasetDescriptionAsset`
Of the `ValidationResult` objects returned from validation
by the deno-compiled validator

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Not ignoring `dandiset.yaml` in BIDS validation, and adjust validation
error message in the context of
DANDI to direct the user to specify
a `.bidsignore` file to exclude
the `dandiset.yaml` file from
BIDS validation properly
…nit__.py`

This keeps implementation details private
and control the definition of a public API
as recommended by @CodyCBakerPhD
@candleindark candleindark force-pushed the bids-validator-deno branch from cde85ab to e33159f Compare May 14, 2025 19:27
@yarikoptic
Copy link
Member

ok, let's proceed here. Note that I believe (from a trial run) we would still use bidsschematools to "extract" metadata which is a step to validate any asset against dandi schema, and only then we would use deno validator to do even more validation. I guess, let's proceed with this and improve upon if needed. Thank you @candleindark for the PR and @CodyCBakerPhD for the review.

@yarikoptic yarikoptic merged commit ff7ac3a into dandi:master May 19, 2025
46 of 47 checks passed
@github-actions
Copy link

🚀 PR was released in 0.69.0 🚀

@candleindark candleindark deleted the bids-validator-deno branch May 22, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BIDS minor Increment the minor version when merged released

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a hint to add dandiset.yaml into .bidsignore while working on bids dandiset

3 participants