RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

yarikoptic · 2022-07-25T16:26:52Z

Extracted from #1044 (comment) following @jwodder request as a dedicated issue:

ATM, upload allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation of dandiset.yaml was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice.
With that in mind, even though in short term we could associate with some superficial DatasetDescriptionAsset to signal that it is a BIDS dataset, we might better provide more explicit structure.

I thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below Local as in files.py:

Dandiset
- Dandiset must have dandiset.yaml (DandisetMetadataFile) which we know how to validate, it represents metadata about entire Dandiset
- Dandiset contains either DANDIDataset or BIDSDataset (AFAIK cannot be both, @satra please confirm if our thinking aligned). I might even say we must not have any "abstract" Dataset - they all should be DANDIDataset or BIDSDataset(s). Default (if not BIDS) is DANDIDataset.
  - I decided for "contains" instead of just subclassing from Dandiset because nested BIDSDataset might not to "contain" dandiset.yaml. But may be those specific ones could be subclasses (thuse DANDIDandiset and BIDSDandiset), and there would be PlainBIDSDataset implementing Dandiset independent logic (such as validation), and then BIDSDandiset(Dandiset, PlainBIDSDataset) to mix in. For now I will continue with Dataset delegated flavors below:
  - DANDIDataset is the one dandi organize renames into, and the layout (directory/filenames) of which we do not yet validate AFAIK although should (I noted that in my reply here investigate --validation skip and --allow-any-path decision tree #1074 (comment)). Errors will be associated with specific paths.
  - BIDSDataset (like 000026) allows for nested BIDSDatasets underneath (e.g. under derivatives/{name} or rawdata/ or somewhere under sourcedata/). Currently code finds them anywhere (not necessarily in those locations) and validates them independently. That it is a BIDSDatasets (and not DANDIDatasets) decided by having dataset_description.json.
- Each asset should be associated with some specific (not just abstract Dataset) BIDSDataset or DANDIDataset, even if that asset is not "valid".
  - BIDSDataset might be a part of another BIDSDataset. BIDSDataset.subdatasets: dict could provide mapping.
  - each Dataset could have some validation errors associated purely with it, which cannot relate to specific asset(s). An example -- error about missing required asset (e.g. README with some extension, required in BIDS, is missing)
  - some validation errors might relate to groups of assets (e.g. some inconsistency across files). I am not sure yet how/where we should keep them, probably within Dataset level, just record containing all those paths.
  - we should be able to tell what errors are for an asset.
  - given all the above, I feel that we should collect errors at the level of Dataset and allow for querying for errors associated with a specific sub-path (might be within subdataset in BIDS).

So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of dandiset.yaml) and its Dataset (BIDS or DANDI), and subdatasets (if paths provided, for those paths). If there is no Dataset -- nothing to upload. If no paths given, it will be for the type of the dataset to figure out which paths to pick up (like we get .nwb and video files for DANDIDataset).
Then before upload Dandiset.validate would do its self.DandisetMetadataFile-instance.validate() followed by self.Dataset-instance.validate() (which for DANDIDataset would do nothing ATM; for BIDSDataset -- might do something in the future, like may be full dataset validation). Some method like .discover_local_assets(path: list[str]=None) should be available to associate/discover optionally provided paths with corresponding Datasets, might be BIDS vs DANDI dependent, and thus building up the tree. Dandiset.validate_asset(path: str) could lazily (unless already full validation ran, and that path was marked "ok" or have already results associated) initiate corresponding Dataset (if needed, e.g. if nested BIDSDataset), trigger corresponding (depending on Dataset type and type of the asset, e.g. NWBAsset) LocalAsset.validate(), store result of the validation within corresponding Dataset, and return validity status for upload to use.

Relating to WiP in #1076, find_dandi_files and dandi_file should become method of that Dandiset (already takes, albeit optional dandiset_path . If we figure out dandiset, which must exist, first -- can become methods). There it should figure out corresponding to the path Dataset and associate it with that Asset. To contain all validation results in corresponding Dataset, may be LocalAsset.validate should request to update its record within its Dataset.

The text was updated successfully, but these errors were encountered:

satra · 2022-07-25T17:36:45Z

AFAIK cannot be both, @satra please confirm if our thinking aligned

confirming.

yarikoptic · 2023-03-14T11:44:03Z

@jwodder please review current implementation and see what, if anything, to be done to align with this desired back then RF -- there were no follow up to this issue but work was done to make validation work for both BIDS and DANDI layouts.

yarikoptic assigned jwodder Jul 25, 2022

yarikoptic mentioned this issue Jul 25, 2022

RF "files.py" validation/metadata-loading to support BIDS #1044

Closed

jwodder added BIDS Python API labels Jul 27, 2022

yarikoptic mentioned this issue Jul 29, 2022

(re)design metadata extractors and "harmonization" across multiple extractors whenever needed #701

Closed

1 task

yarikoptic-gitmate assigned yarikoptic-gitmate and unassigned jwodder Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

yarikoptic commented Jul 25, 2022

satra commented Jul 25, 2022

yarikoptic commented Mar 14, 2023

RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

Comments

yarikoptic commented Jul 25, 2022

satra commented Jul 25, 2022

yarikoptic commented Mar 14, 2023