-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File formats for data and metadata #2
Comments
A quick comments on Andrew.
|
About format restriction:
For metadata format, I think what made the wide adoption of bids was the possibility to have metadata in the .tsv format (spreadsheets). Most researchers still think a computer is an advanced typewriter and will just run away when they hear NIX or NWB :) |
This may be one of the most important issues for establishing a new standard. How to deal with metadata. Metadata structures are so overwhelmingly large and diverse, due to the diverse types of experimental research in neuroscience, I do not think we will come up with one standard to fit all. So in general I would propose that researchers save metadata in their preferred format, but associate this with a script that retrieves this metadata for further use. Metadataread could be a formal building block of this new data structure. |
An interesting thought: Could we think in terms of object oriented programming. First define a base class that defines the basic metadata and folder structure, with virtual functions (e.g. to read metadata, and select datasets). This base class could be used within evermore complicated child classes that express different types of research, but always need to implement these base functions and metadata, thus giving users of any dataset access to these basic functions and metadata. |
I guess using a proper schema (like the json schema mentionned here #4 (comment)) would be a first step in the good direction, right? |
in general, a question that emerges from some of the posts above seems to be "restricting to a few formats VS. leaving this open"; we've had this discussion at our institute also, and because we'd like to have support for several modalities (ephys, various forms of optical imaging etc.), we tend towards leaving the possibility to store the raw data in a proprietary format... our opinion is that:
|
About metadata. We certainly at a point need to discuss the probe gemotry. Here a link to a new project that handle this : https://probeinterface.readthedocs.io/en/main/ It could embeded in our structure. |
Regarding probes (or grids or shafts as they are called in ECoG and sEEG respectively): for iEEG there are already some elements in the specifications for that. Perhaps not as detailed as required for animal electrophysiology, but I recommend to try out how the current BIDS specification would work out for some common animal probes. Being able to look at an example "dataset" (no actual data needed) with the metadata will help to identify where the current version is lacking. |
@samuelgarcia and @SylvainTakerkart I have some thoughts on standardizing based on API vs. format. When I started with NWB, I thought every lab would have a different neurophysiology format. In fact, it's even worse- every individual researcher has a different format, and it can be hard for people to collaborate even within a lab! This is made even worse as labs advance from one technology to another, and the knowledge of how to read old data is gradually lost. Trust me, the space of all neurophys data formats is enormous, and causes a lot of friction! There are two possible approaches to wrangling this heterogeneity- standardizing via API and standardizing via data format. The API approach may seem attractive at first, because it requires the least up-front work and does not require copying any data, but it has major downsides that will become huge problems as it scales. Don't get me wrong, NEO is a very useful, important, and high quality tool. We rely heavily on NEO for our NWB conversions, and it has allowed us to move much faster in supporting conversion for a large variety of proprietary formats. However, working with NEO is a constant development process where we are always working to support more formats and format versions. The first problems is validation of supported formats. There are always new data formats, and new versions of old formats, and they are often not very well documented (sometimes not documented at all). Therefore, it is imperative that we be able to validate that any contributed file follows one of the supported versions of the supported file formats. Confirming this would require building validators for every allowable data format, which would be impractical for a large number of formats. In contrast, standardizing based on a small number of formats would require a manageable number of validators. Second of all, if you are standardizing based on a API, you are locking users into a single programming language. According to our surveys, about half of neurophysiologists use MATLAB. You might consider this a forward-thinking initiative that is unconcerned with the MATLAB laggards and wants to push them into Python. I think it is a mistake for this initiative to prescribe data analysis patterns instead of responding to them, but even if you do want this format to push the field forward, you have the problem that you are locking users to Python. What if users want to use Julia, or some other language in the future? Are you going to re-create all of NEO for Julia? In NWB we have run into several applications where users want to access NWB files from outside of our supported APIs, and they have built APIs in C++, C sharp, and R. They were able to do this because there is a standard file format. I love Python, but I do not want us to feel bound to Python 5-10 years from now. The third problem is that each of these varied formats has a different metadata. This is really the crux of NWB, which at its core is essentially a metadata dependency tree- If you have electrophysiology voltage traces, you'll need to say which electrodes it was recorded from, which means you need an electrode table. Then each electrode will need to be assigned an electrode group and each group assigned to a device, etc. etc. This is designed to ensure that all of the metadata necessary for re-analysis is in the NWB file. It is built to handle multiple cooccurring streams of data with different time bases. Proprietary formats, on the other hand, are generally not designed to contain all of the metadata necessary for reanalyses, but rather to report all of the relevant data from a particular acquisition system. The problem is that the gap between this set and the reanalysis set is different for every format. The only way around these problems is to restrict to closed set of allowable data formats that can be validated. It doesn't have to be constrained to NWB and NIX, but it does need to be constrained to some set. There are also downsides to standardizing on the format- You need to copy the data, you need to convert it, and you may be throwing out some acquisition system-specific data that could be important. I remember hearing of a compromise where there would be three folders, source (from the acquisition system), raw (converted to some standard), and processed data. I think this would provide the best of both worlds, because we could allow users to store their original data while providing a way to ensure that it is readable in an archive format.
NWB is capable of storing data for any standard data type, and most raw data shared in NWB is int16, copied directly from the source file. In addition, NWB can apply chunk-wise lossless compression to datasets, which in our hands has reduced the dataset size by up to 66% for Neuropixel electrophysiology voltage traces. This is an HDF5 feature, so it is in theory possible for NIX as well, though I don't know whether their API exposes this feature. |
Hi all. As neo/spikeextractors dev of course, I like a lot the API approach, I won't develop my thoughts here. I think the debate here is not format vs API but which format do we allow in this BIDS ? Here some pros/cons in favor of adding the "raw format" in the possible formats. CONS:
PROS:
|
@samuelgarcia, you know there is a need for representing different time bases. That's exactly what HDF5 and NIX are built for! I can't speak to the details of NIX, but NWB can handle multiple streams that are started at potentially different times, as well as multiple disjoint segments of recording from the same device. Let me explain some of the features of HDF5 that I prefer over raw binary. HDF5 and PyNWB do support parallel I/O. See the tutorial here. They also support features like chunking and lossless compression by chunk, which can save a lot of space and time for large datasets and are not available for stand-alone binary files. We have seen some Neuropixels datasets reduce in size by 66% when using these tools! You also don't necessarily need to use the h5 library to read HDF5 files. We are working on a Zarr library that reads HDF5 NWB files without touching h5 here. If they are non-chunked, reading an HDF5 dataset is as simple as passing the offset, shape, and data type to |
just referencing another discussion (bids-standard/bep021#1) coz there's some interesting overlap with the one here! |
I think we agree that it is desirable to have data in a standardized format that allows rich metadata annotation, like NWB. The question is whether we will make more rapid progress if we require use of such a format rather than just encouraging it? The main disadvantage of requiring it is that people who might otherwise have used a standardized directory layout with simple, minimal metadata will decide it is too much work and not share at all, or just dump everything in a zip file. The main disadvantages of not requiring it are that (i) important metadata will be lost to the passage of time (due to data providers forgetting details, leaving the field, moving labs, etc.); (ii) the possibilities for automation of subsequent data analysis pipelines are reduced. We could imagine having a two-tier validator: datasets using a recommended format are Tier 1 / gold. Datasets with only the source/raw format are still valid, but Tier 2 / silver. (as an aside, referencing Ben's comment above, Neo was originally intended to be language-independent. This is why the Github repo is called "python-neo". We planned a "matlab-neo", but never had the resources to work on it. "julia-neo" would also be interesting). |
indeed i think that is the key question. the issue is reusability. and this is one of those FAIRness concepts that is often overlooked by a lab mostly because the lab focus is not on use by others, but by themselves and presently there is relatively little reuse of other data (in comparison to neuroimaging). also, in neuroimaging, the adoption of NIfTI significantly enhanced reusability and is one of the key reasons why a platform like nipype exists, as people could mix and match software knowing that each of the tools could read Nifti. the situation with neurophysiology is undoubtedly more complex at the moment. however, the role bids and the archives have played is to push towards that common space through validation. i don't think we should allow the random format as a standard, otherwise what kind of a standard is it, and how does an arbitrary consumer of the data read it? so i suspect "raw" in this case is still not raw, but has some structure like dimensions, datatypes, etc.,. if people are going down this road, i would at least suggest considering zarr as a potential option as ben indicated, as it still provides a bit of a model to structure the data. however zarr presently does not have any matlab bindings, except through using python in matlab :) . |
I agree that random formats should not be part of a specification. However, there can be a set of formats that are open or easily converted. Someone could still include data in a random format in a "raw" space as long as the primary data is in a usable format as Satra mentions. |
Also agree, and I remember that some of these same discusssions happened with BIDS, and it was eventually decided to go for a common nifti format because of the arguments laid out above. |
@jgrethe there are a few tools available that can read a wide range of ephys formats, e.g. Neo, SpikeInterface, SpykingCircus, so one option would be to support any format that can be read/converted by at least two different tools (while still recommending open, standard formats like NWB). |
just want to comment again on re-usability, as I have spent the last 30 minutes trying to open a .nwb file, and failed so far. I imagine people who would make their behavior data into a nwb format with the idea of insuring re-use, and it would in the end make it more difficult for people to re-use it. While I can see why one should invest time in using these tools, it is not fitted for all types of data (while the structure should). |
@jcolomb I share your pain, and FWIW I shared it (if I guess right the reasons for the pain) with NWB developers and they are working to address it all |
@jcolomb I'm sorry to hear you are having trouble opening an NWB file. I know that can be frustrating. I'd be happy to help if you could give me enough information to diagnose the problem, e.g. what file you are trying to open, what commands you are using, and what error you are getting. I'll send you a message on the NWB slack channel, as I think this is off-topic for the current thread. |
thanks @yarikoptic for the laugh in this Friday morning, it does help a lot, I will of course discuss my issue on a nwb specific place (slack), my point was not the pain of using nwb, but the adoption issue. I think we got a bit lost in ephy data details, and should also consider other data types (live and structural imaging, DNA/RNA experiments, behavior, proteomics, surveys) But maybe we need to be more constructive and start collecting information: what kind of data the standard should support, what raw data exists, what open standards exists and what tool can read what ? I am starting such a collection for a data management plan I have to write... |
hi @jcolomb ! yes, I agree, we've collectively (I mean, in this group and at this moment) drifted towards focusing on the ephys case; that goes with the practical progress we've made and our current proposal to extend BIDS for ephys data recorded in animals... but overall, a consensus has emerged that trying to go with BIDS / extend BIDS for whatever modality where it's possible might be a solution worth pursuing! so, fyi, in parallel to our BIDS extension proposal for animal ephys ( BEP032 ) , there is another one dedicated to microscopy ( BEP032 ) which should cover most of the imaging needs ; between these two BEPs, we share the need of supporting animal data in BIDS, which is now discussed at the global BIDS level here for behavior and omics, there are also other BIDS extension proposal I think... @yarikoptic @satra ? all this of course does not mean that BIDS is the only solution, but moving forward in practice with this solution should be beneficial for the community ;) |
@SylvainTakerkart : not sure for behaviour, but definitely for omics : led by C. Pernet IIRC |
at present behavior could be encoded in nwb (https://pynwb.readthedocs.io/en/latest/overview_nwbfile.html#processing-modules - see behavior) - at least that's what we are suggesting to dandi users. regarding omics, it's a completely scale. bids has some support for omics (https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/08-genetic-descriptor.html) which essentially is a simple metadata structure with a pointer to an external resource housing the data. for the brain initiative, the nemo data portal is housing transcriptomics and dandi is housing some proteomics (through immunostaining via BEP032; example dataset here: https://dandiarchive.org/dandiset/000026). |
for surveys and other kinds of data (e.g., voice recordings) that we can collect online, we have been building a reproschema via the ReproNim project, a JSONLD based specification for both the questionnaire side as well as the response side. and for actigraphy and other data there are some efforts to consolidate in other projects. |
I just want to link here to a related discussion that takes place around the BIDS specs... |
We propose to limit the file formats that are allowed to be placed within the directory structure. All allowed formats should have open, non-proprietary specifications.
Some suggestions to get the ball rolling:
The text was updated successfully, but these errors were encountered: