Skip to content

[Feature request] update_obs: enable update of a subset #3715

@nick-youngblut

Description

@nick-youngblut

Is your feature request related to a problem? Please describe.

update_obs currently requires the entire database obs dataframe (all rows) as input. Otherwise: ValueError: update_obs: old and new data must have the same row count;. For large databases, working with the entire obs dataframe is very cumbersome.

Describe the solution you'd like

Edit update_obs to work with a subset of the obs dataframe, and just use the soma_joinid column for updating the correct rows.

Describe alternatives you've considered

dash or polars can help with manipulating very large dataframes, but it would definitely be easier to just update certain rows.

I believe that the following is an alternative:

import pyarrow as pa

# query
obs_query = tiledbsoma.AxisQuery(value_filter='organism in ["human"]')

# get the target records
with tiledbsoma.Experiment.open(db_uri) as exp:
    df = (
        exp.axis_query("RNA", obs_query=obs_query)
        .obs()
        .concat()
        .to_pandas()
    )

# update metadata
df["organism"] = "Homo sapiens"

# update the database
with tiledbsoma.Experiment.open(db_uri, "w") as exp:
    exp.obs.write(pa.Table.from_pandas(df))

...but then why have update_obs, if exp.obs.write can work with either a partial or complete version of the obs dataframe?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions