-
Notifications
You must be signed in to change notification settings - Fork 30
Description
I am working on a case of using TileDB-SOMA for proteomics data. The goal is to store the readings from many plates in a single TileDB, so we can easily query/slice across the full dataset.
Our current thinking:
- The plate barcodes, well positions, and other sample metadata go in the OBS dataframe
- The protein IDs (e.g. uniprot) go in the VAR dataframe
- The different data points per well+protein go in different layers
The challenge comes from different plates having different panels. We can make the var matrix the union of all proteins across all panels, but then we have to decide how to handle the "gaps", where a well position from a plate does not have values for a protein. If using a sparse matrix, it seems like we cannot differentiate between the empty value being a zero, or an absence of a value.
Some ideas we considered:
- Initially, I thought that multiple measurements could help us with our protein panels problem. But on closer study it seems this multi-modal support is designed to handle the case where the features vary across the same set of observations. but here the features vary across different sets of observations.
- Use a dense matrix rather than a sparse matrix, so "NaN" can represent lack of a value, and zero can mean zero.
- Having a layer of type "string", which, if present, gives contextual information about that position in the matrix. As well as whether the observation has a value for the protein, this allows us to represent different error information, since we see different types of "bad reading" conditions, that we might want to represent. If we use a sparse matrix, then it would hopefully be space efficient for those positions with no extra context to store.
I am interested in the experience of others in handling this kind of scenario, and advice on the suitability of the ideas we've considered for handling the issue.