Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF of "files" for explicit notion of DANDI vs BIDS dataset(s) layout #1082

Open
yarikoptic opened this issue Jul 25, 2022 · 2 comments
Open
Assignees

Comments

@yarikoptic
Copy link
Member

Extracted from #1044 (comment) following @jwodder request as a dedicated issue:

ATM, upload allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation of dandiset.yaml was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice.
With that in mind, even though in short term we could associate with some superficial DatasetDescriptionAsset to signal that it is a BIDS dataset, we might better provide more explicit structure.

I thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below Local as in files.py:

  • Dandiset
    • Dandiset must have dandiset.yaml (DandisetMetadataFile) which we know how to validate, it represents metadata about entire Dandiset
    • Dandiset contains either DANDIDataset or BIDSDataset (AFAIK cannot be both, @satra please confirm if our thinking aligned). I might even say we must not have any "abstract" Dataset - they all should be DANDIDataset or BIDSDataset(s). Default (if not BIDS) is DANDIDataset.
      • I decided for "contains" instead of just subclassing from Dandiset because nested BIDSDataset might not to "contain" dandiset.yaml. But may be those specific ones could be subclasses (thuse DANDIDandiset and BIDSDandiset), and there would be PlainBIDSDataset implementing Dandiset independent logic (such as validation), and then BIDSDandiset(Dandiset, PlainBIDSDataset) to mix in. For now I will continue with Dataset delegated flavors below:
      • DANDIDataset is the one dandi organize renames into, and the layout (directory/filenames) of which we do not yet validate AFAIK although should (I noted that in my reply here investigate --validation skip and --allow-any-path decision tree #1074 (comment)). Errors will be associated with specific paths.
      • BIDSDataset (like 000026) allows for nested BIDSDatasets underneath (e.g. under derivatives/{name} or rawdata/ or somewhere under sourcedata/). Currently code finds them anywhere (not necessarily in those locations) and validates them independently. That it is a BIDSDatasets (and not DANDIDatasets) decided by having dataset_description.json.
    • Each asset should be associated with some specific (not just abstract Dataset) BIDSDataset or DANDIDataset, even if that asset is not "valid".
      • BIDSDataset might be a part of another BIDSDataset. BIDSDataset.subdatasets: dict could provide mapping.
      • each Dataset could have some validation errors associated purely with it, which cannot relate to specific asset(s). An example -- error about missing required asset (e.g. README with some extension, required in BIDS, is missing)
      • some validation errors might relate to groups of assets (e.g. some inconsistency across files). I am not sure yet how/where we should keep them, probably within Dataset level, just record containing all those paths.
      • we should be able to tell what errors are for an asset.
      • given all the above, I feel that we should collect errors at the level of Dataset and allow for querying for errors associated with a specific sub-path (might be within subdataset in BIDS).

So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of dandiset.yaml) and its Dataset (BIDS or DANDI), and subdatasets (if paths provided, for those paths). If there is no Dataset -- nothing to upload. If no paths given, it will be for the type of the dataset to figure out which paths to pick up (like we get .nwb and video files for DANDIDataset).
Then before upload Dandiset.validate would do its self.DandisetMetadataFile-instance.validate() followed by self.Dataset-instance.validate() (which for DANDIDataset would do nothing ATM; for BIDSDataset -- might do something in the future, like may be full dataset validation). Some method like .discover_local_assets(path: list[str]=None) should be available to associate/discover optionally provided paths with corresponding Datasets, might be BIDS vs DANDI dependent, and thus building up the tree. Dandiset.validate_asset(path: str) could lazily (unless already full validation ran, and that path was marked "ok" or have already results associated) initiate corresponding Dataset (if needed, e.g. if nested BIDSDataset), trigger corresponding (depending on Dataset type and type of the asset, e.g. NWBAsset) LocalAsset.validate(), store result of the validation within corresponding Dataset, and return validity status for upload to use.

Relating to WiP in #1076, find_dandi_files and dandi_file should become method of that Dandiset (already takes, albeit optional dandiset_path . If we figure out dandiset, which must exist, first -- can become methods). There it should figure out corresponding to the path Dataset and associate it with that Asset. To contain all validation results in corresponding Dataset, may be LocalAsset.validate should request to update its record within its Dataset.

@satra
Copy link
Member

satra commented Jul 25, 2022

AFAIK cannot be both, @satra please confirm if our thinking aligned

confirming.

@yarikoptic
Copy link
Member Author

@jwodder please review current implementation and see what, if anything, to be done to align with this desired back then RF -- there were no follow up to this issue but work was done to make validation work for both BIDS and DANDI layouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants