-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Directory structure organization and file naming convention #4
Comments
We can of course (and should) get inspiration from how BIDS do things for human neuroimaging. Here is an example from the BIDS main webpage (on the right handside): https://bids.neuroimaging.io/assets/img/dicom-reorganization-transparent-white_1000x477.png . But we obviously might deviate from this! |
I would like to suggest the following hierarchy: |
Learning from the BIDS issues to not repeat them would be good as well, there is a project to have a json schema to describe BIDS that I think should be used as a model, to avoid the inconsitencies that we have in BIDS. This PR is pointing to the json schema code in BIDS |
thanks for this suggestion! an interesting (and necessary) topic we'll have to discuss, will be to come to an agreement on what is a "project", what is a "dataset" etc. ; from all our discussions with the experimenters (mostly electrophysiologists working on non-human primate and rodents) within our institute, this was not as simple as one could expect ;) |
Yes, I can understand this, we've also had our discussions about this. So a bit of explanation. For us Project is like the introduction to an article, it covers the background or your research, affiliations, authors, hypotheses. and related work. I've included a distinction between raw and derived, because when you share data you could decide only to share raw or only derived, and having this separation in the basis would make this easier. (I think our aim should be to make data more accessible and shareable.) |
thanks for your explanations!
We're actually very close to what you're suggesting in what we're locally trying to implement... Here is what we're suggesting for now: Interestingly, we did not use the concepts of Project and Dataset, but we used the concept of Experiment, with one less level of hierarchy compared to what you describe... Our specs are here: https://int-nit.github.io/AnDOChecker/ . And we have the equivalent of the BIDS validator running for these specs: https://andocheck.int.univ-amu.fr/ . But we're still considering this is under development (we only have a few beta testers internally for now...) so we're totally up for merging / fusing / adapting / tweeking, which is what we hope can happen with this discussion!!! Some details:
|
pinging @satra @yarikoptic @bendichter: you guys made the chose to go flatter (instead of using several levels of hierarchy as in BIDS and as the two aforementioned suggestions) with dandi, right? could you quickly summarize why? |
pinging @tgbugs; could you also tell us what you came up with for sparc? |
pinging @lepmik; also, how do you think such directory structure would interact with your exdir? |
@SylvainTakerkart NWB files have internal structure that accommodates raw data, derived data, and metadata all in one file. The original DANDI layout deferred to NWB files for within-session data organization, and the file organization only handles the super-session info: dandiset and subject. That said, we have been talking on the DANDI team about separating raw from derived data as you have here, and we can discuss changing our format to match yours in that aspect. |
@SylvainTakerkart - thanks for the ping. here are a few thoughts on the issue. in bids i made a suggestion for consideration of individual level data vs data that is generated from aggregation of information across individuals (bids-standard/bids-2-devel#43). it makes for a cleaner and objective separation of information that is contained in the folder. with the growing hardware ecosystem in neurophys for free behaving experiments (openephys + bonsai), we are also likely to see data that involves multiple interacting participants. we would still be able to organize information from the point of view of a participant, but that may include data about other participants. the words rawdata, derivatives, and metadata are often linked quite directly to acquisition instruments, experimental techniques, pipelines, and implementation strategies. what is a derivative today could become raw tomorrow, as instruments integrate more processing into them. i would suggest avoiding that nomenclature if possible. this was also another reason for the participant level organization. re: flattening in DANDI: another consideration in DANDI is to consider a world of objects where the data does not hit the filesystem ever or the filesystem is an object store. this is getting increasingly true for larger data with APIs providing access to information. In such a mode the organization in a typical folder level may be a short term consideration for people working on laptops/desktops with storage that is local and in more traditional HPC settings. given some of the data coming into DANDI, data transfer in general to do things would be too expensive (from a time and perhaps from a cost perspective). therefore accessing pieces of information as necessary for computation is where we are likely headed. this doesn't mean you cannot store information in a filesystem, but we are going to be moving a little closer even to our API for our data search clients and use datalad as the filesystem model. if you consider pybids, the first thing it does is generate a database index. but unlike bids where most datasets are a few GB at most, neurophys data is being generated at much larger sizes easily hitting TBs in many situations. our current thinking in dandi is driven to support a changing landscape in neurophysiology over the next 5 - 10 years a bit more than what people have been doing traditionally. we have been asked multiple times to store hundreds of TBs of data for single datasets. that's a scale at which most people will not be looking at filesystem organization. we can always provide a view of the underlying data through a metadata remapping model. i personally consider bids to be a practical and efficient view of a more complex information model (one that we are capturing in a more structured manner in NIDM for example). |
FWIW on the aspect of
which exemplifies the suggestion to have filename itself without reflection of metadata encoded in the upper directories names. I originally also was arguing for BIDS to not duplicate |
Some examples of the hierarchy we came up with are in this folder https://github.com/SciCrunch/sparc-curation/tree/master/resources/DatasetTemplate/primary. At the top level we have 3 folders for different stages of data processing.
Within primary the file folder level entities that we identified are
There is a nasty issue with splitting subjects and samples, which is that they really occupy the same location in the data structure, and they are overly narrow in that they are not sufficient to capture higher levels of organization such as populations or subject groups. Therefore I suggest using We don't have any requirement for the folder naming conventions aside from the fact that they should not have spaces in them. We are also suggesting not doing unfriendly things like giving a subject an identifier with a name that evokes a sample, e.g. |
about project/dataset.With some colleagues, we are working on a folder template for projects inside the gin-tonic project. We are working on our first report. One thing interesting here, is that we got a similar feedback: one project is made of different experiments. We plan on creating a different data folder for each experiment/dataset. We did not go into the details of defining one experiment (yet) or the sub-structure of the data folder. But how it will be defined (example cases: same method but different animals, same animal used but different method, same method and same animal but re-tested at a different age, different animals and different methods but made for the same purpose,...) raw versus derived data question.I like the Source, Primary, Derivative distinction. I have heard that it is a problem with BIDS: researcher are supposed to save and archive the source data, but the BIDS formatted data is what go in the Primary folder here (and BIDS has a raw/derived data distinction..). Derivative would be everything that can be trashed, because one could reproduce it, right? But this would mean that this distinction would only make sense if it is higher order folders, and that this project would define the structure of the primary folder. This would also mean that one can use the source folder to feed data the way the researcher would see fit, and get a data curation (manual or automatic) process to change the source data into the form we want in the primary folder. In terms of data management, it would mean source is archived, Primary is the FAIR (open) version of the data, Derivative is trashed upon project completion. (not that what I mean with derivative here would still be data files, like pivot table, summaries or similar files. Figures and analysis should not be saved with the data) subject/sample versus sessionI still think that for animal study, getting the subject as high level folder make little sense. As noted above by @tgbugs, one file has often data from multiple subjects, animals are often tested in groups (especially in non-rodent research) or data from different subject are recorded in the same spreadsheet/video. I do not see any data management reason to split data by animal/subject. I would be curious to know why BIDS went for subject in the highest level, I would guess it is because people were used to emphasize the human subject in early studies ? (which would call against this structure for animal research). MiscBasic tips in file naming is to have unique file name, the place where the file is present should not be necessary to derive information about the file. Exception can be made for readme files. IgfGUID is meant by this: https://de.wikipedia.org/wiki/Globally_Unique_Identifier, then it would make the name of the file too long. An internally used, shorter ID (RFID chip number, animal ID used in the animal facility,..) would make more sense. As long as there is metadata describing the ID used, we should be safe, while keeping some human readability. The same survey about gin tonic made us think that researchers tend to prefer flat structures. In another project, I also used a strategy where the metadata would indicate the file path, so that the structure is irrelevant for the data analysis (where the code read the metadata to access the data anyway). It is a pretty simple way to analyse data from different source without having to move or rename files manually... |
I have been implementing validation of the SPARC Data Structure directory structure and have some further notes as a result of the process.
|
@SylvainTakerkart - thanks for pushing this along. in addition to @tgbugs comments, we (in DANDI) have been working on the data/information model representing objects of interest (datasets, individual objects) with serialization to disk as a transform of that model. part of this consideration is driven by the sizes of individual datafiles (in the TBs range and growing) that we are expecting over the next year and our changing needs to support on the fly access to data or pieces of data over a network call. we are still tweaking the model, which would be serializable to disk, and will be releasing an updated API server by the end of the year together with datalad datasets. so any discussion of serialization is indeed a good thing to continue, but wanted to put the object + metadata access consideration in play as well. |
Are there example shared datasets to learn from? For example, if you look at the already shared dataset: what is good about it, what would you do differently? Or imagine an already shared dataset: how would that look like if you were to reorganize it in the newly proposed structure? Would all data and metadata find a logical place in the new structure, is there (meta)data that does not fit, or is there (meta)data missing from the already shared dataset that you consider crucial? |
Hi, a few thoughts on the discussion:
|
Regarding 3 and 4: you may want to consider the https://en.wikipedia.org/wiki/Pareto_principle which states (or claims) that about 80% of the value comes from 20% of the cases, or 80% of the revenue comes from 20% of the customers, or 20% of goods in a shop make up 80% of the sales, etc. So rather than making a very broad overview of all possible cases, which will be subsequently very difficult to work with, you could try to identiy those 20% of the cases that represent the most value. I.e., rather than investing in dealing with (relative) exceptional cases, first invest in the bulk. |
When it comes to the number of use-cases I suggested one or more from rat ephys in vivo and a few from other domains. So, to keep things manageable, say a total of around five or six use-cases might be enough to start with and each could be selected as the most representative use-case by those with most knowledge of each domain. |
@robertoostenveld - didn't see this earlier. you can look at the datasets on https://dandiarchive.org (there are about 38 dandisets covering 4 species, intra/extracellular and optical recordings) |
Yes, this is an important question!! This dataset (utah array ephys, several sessions, several animals, exhaustive metadata) could also be useful: |
actually, I've opened a new thread to centralize a list of potentially useful datasets: #7 |
This issue is where we'll discuss the directory structure that will contain the data and metadata, the organization of the directory and sub-directories, the number of levels in the hierarchy of the directories, the naming of the directories and sub-directories, the naming of the files...
The different elements that might be included in this hierarchy are: the experiment itself, the subject, the recording session etc.
The text was updated successfully, but these errors were encountered: