You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ATM, upload allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation of dandiset.yaml was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice.
With that in mind, even though in short term we could associate with some superficial DatasetDescriptionAsset to signal that it is a BIDS dataset, we might better provide more explicit structure.
I thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below Local as in files.py:
Dandiset
Dandiset must have dandiset.yaml (DandisetMetadataFile) which we know how to validate, it represents metadata about entire Dandiset
Dandiset contains either DANDIDataset or BIDSDataset (AFAIK cannot be both, @satra please confirm if our thinking aligned). I might even say we must not have any "abstract" Dataset - they all should be DANDIDataset or BIDSDataset(s). Default (if not BIDS) is DANDIDataset.
I decided for "contains" instead of just subclassing from Dandiset because nested BIDSDataset might not to "contain" dandiset.yaml. But may be those specific ones could be subclasses (thuse DANDIDandiset and BIDSDandiset), and there would be PlainBIDSDataset implementing Dandiset independent logic (such as validation), and then BIDSDandiset(Dandiset, PlainBIDSDataset) to mix in. For now I will continue with Dataset delegated flavors below:
BIDSDataset (like 000026) allows for nested BIDSDatasets underneath (e.g. under derivatives/{name} or rawdata/ or somewhere under sourcedata/). Currently code finds them anywhere (not necessarily in those locations) and validates them independently. That it is a BIDSDatasets (and not DANDIDatasets) decided by having dataset_description.json.
Each asset should be associated with some specific (not just abstract Dataset) BIDSDataset or DANDIDataset, even if that asset is not "valid".
BIDSDataset might be a part of another BIDSDataset. BIDSDataset.subdatasets: dict could provide mapping.
each Dataset could have some validation errors associated purely with it, which cannot relate to specific asset(s). An example -- error about missing required asset (e.g. README with some extension, required in BIDS, is missing)
some validation errors might relate to groups of assets (e.g. some inconsistency across files). I am not sure yet how/where we should keep them, probably within Dataset level, just record containing all those paths.
we should be able to tell what errors are for an asset.
given all the above, I feel that we should collect errors at the level of Dataset and allow for querying for errors associated with a specific sub-path (might be within subdataset in BIDS).
So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of dandiset.yaml) and its Dataset (BIDS or DANDI), and subdatasets (if paths provided, for those paths). If there is no Dataset -- nothing to upload. If no paths given, it will be for the type of the dataset to figure out which paths to pick up (like we get .nwb and video files for DANDIDataset).
Then before upload Dandiset.validate would do its self.DandisetMetadataFile-instance.validate() followed by self.Dataset-instance.validate() (which for DANDIDataset would do nothing ATM; for BIDSDataset -- might do something in the future, like may be full dataset validation). Some method like .discover_local_assets(path: list[str]=None) should be available to associate/discover optionally provided paths with corresponding Datasets, might be BIDS vs DANDI dependent, and thus building up the tree. Dandiset.validate_asset(path: str) could lazily (unless already full validation ran, and that path was marked "ok" or have already results associated) initiate corresponding Dataset (if needed, e.g. if nested BIDSDataset), trigger corresponding (depending on Dataset type and type of the asset, e.g. NWBAsset) LocalAsset.validate(), store result of the validation within corresponding Dataset, and return validity status for upload to use.
Relating to WiP in #1076, find_dandi_files and dandi_file should become method of that Dandiset (already takes, albeit optional dandiset_path . If we figure out dandiset, which must exist, first -- can become methods). There it should figure out corresponding to the path Dataset and associate it with that Asset. To contain all validation results in corresponding Dataset, may be LocalAsset.validate should request to update its record within its Dataset.
The text was updated successfully, but these errors were encountered:
@jwodder please review current implementation and see what, if anything, to be done to align with this desired back then RF -- there were no follow up to this issue but work was done to make validation work for both BIDS and DANDI layouts.
Extracted from #1044 (comment) following @jwodder request as a dedicated issue:
ATM,
upload
allows to skip "invalid" files while uploading "valid" files. Originally it was done because validation was done "in-line" with upload -- it was taking time to validate, so we did not pre-validate all files to say that entire dataset is bad, but discovered/reported bad assets in the course of the upload. Moreover, validation ofdandiset.yaml
was done independently from all other files, although it is more of a "Dandiset level validation". For BIDS we still can do the same since current validation can take individual path at a time I believe. But overall we might want to step away from that practice.With that in mind, even though in short term we could associate with some superficial
DatasetDescriptionAsset
to signal that it is a BIDS dataset, we might better provide more explicit structure.I thought I had described it here or elsewhere but I failed to find it, so coming up with my thinking here. I think we overall have to work with a following structure. I was not thinking yet on how and whether it should relate to existing Dandiset interfaces, may be consider them all below
Local
as infiles.py
:Dandiset
Dandiset
must havedandiset.yaml
(DandisetMetadataFile
) which we know how to validate, it represents metadata about entire DandisetDandiset
contains either DANDIDataset or BIDSDataset (AFAIK cannot be both, @satra please confirm if our thinking aligned). I might even say we must not have any "abstract" Dataset - they all should be DANDIDataset or BIDSDataset(s). Default (if not BIDS) is DANDIDataset.dandiset.yaml
. But may be those specific ones could be subclasses (thuseDANDIDandiset
andBIDSDandiset
), and there would be PlainBIDSDataset implementing Dandiset independent logic (such as validation), and thenBIDSDandiset(Dandiset, PlainBIDSDataset)
to mix in. For now I will continue withDataset
delegated flavors below:DANDIDataset
is the onedandi organize
renames into, and the layout (directory/filenames) of which we do not yet validate AFAIK although should (I noted that in my reply here investigate--validation skip
and--allow-any-path
decision tree #1074 (comment)). Errors will be associated with specific paths.BIDSDataset
(like 000026) allows for nested BIDSDatasets underneath (e.g. underderivatives/{name}
orrawdata/
or somewhere undersourcedata/
). Currently code finds them anywhere (not necessarily in those locations) and validates them independently. That it is a BIDSDatasets (and not DANDIDatasets) decided by havingdataset_description.json
.BIDSDataset.subdatasets: dict
could provide mapping.README
with some extension, required in BIDS, is missing)So, I am thinking, that for upload/validation, given the optional paths (limiting what we can find) we first figure out the Dandiset (location of
dandiset.yaml
) and its Dataset (BIDS or DANDI), and subdatasets (if paths provided, for those paths). If there is no Dataset -- nothing to upload. If no paths given, it will be for the type of the dataset to figure out which paths to pick up (like we get .nwb and video files for DANDIDataset).Then before upload
Dandiset.validate
would do itsself.DandisetMetadataFile-instance.validate()
followed byself.Dataset-instance.validate()
(which forDANDIDataset
would do nothing ATM; for BIDSDataset -- might do something in the future, like may be full dataset validation). Some method like.discover_local_assets(path: list[str]=None)
should be available to associate/discover optionally provided paths with corresponding Datasets, might be BIDS vs DANDI dependent, and thus building up the tree.Dandiset.validate_asset(path: str)
could lazily (unless already full validation ran, and that path was marked "ok" or have already results associated) initiate corresponding Dataset (if needed, e.g. if nested BIDSDataset), trigger corresponding (depending on Dataset type and type of the asset, e.g. NWBAsset)LocalAsset.validate()
, store result of the validation within correspondingDataset
, and return validity status for upload to use.Relating to WiP in #1076,
find_dandi_files
anddandi_file
should become method of thatDandiset
(already takes, albeit optionaldandiset_path
. If we figure out dandiset, which must exist, first -- can become methods). There it should figure out corresponding to the pathDataset
and associate it with thatAsset
. To contain all validation results in corresponding Dataset, may beLocalAsset.validate
should request to update its record within itsDataset
.The text was updated successfully, but these errors were encountered: