-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Is your feature request related to a problem? Please describe.
update_obs currently requires the entire database obs dataframe (all rows) as input. Otherwise: ValueError: update_obs: old and new data must have the same row count;. For large databases, working with the entire obs dataframe is very cumbersome.
Describe the solution you'd like
Edit update_obs to work with a subset of the obs dataframe, and just use the soma_joinid column for updating the correct rows.
Describe alternatives you've considered
dash or polars can help with manipulating very large dataframes, but it would definitely be easier to just update certain rows.
I believe that the following is an alternative:
import pyarrow as pa
# query
obs_query = tiledbsoma.AxisQuery(value_filter='organism in ["human"]')
# get the target records
with tiledbsoma.Experiment.open(db_uri) as exp:
df = (
exp.axis_query("RNA", obs_query=obs_query)
.obs()
.concat()
.to_pandas()
)
# update metadata
df["organism"] = "Homo sapiens"
# update the database
with tiledbsoma.Experiment.open(db_uri, "w") as exp:
exp.obs.write(pa.Table.from_pandas(df))...but then why have update_obs, if exp.obs.write can work with either a partial or complete version of the obs dataframe?