Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify disk-backed operations and need for assignment #90

Open
berombau opened this issue Feb 6, 2025 · 2 comments
Open

Clarify disk-backed operations and need for assignment #90

berombau opened this issue Feb 6, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@berombau
Copy link
Member

berombau commented Feb 6, 2025

Right now there are some subtle issues with in-memory or disk-backed operations:

In-memory segmentation works as expected:

import harpy

sdata = harpy.datasets.resolve_example().subset("raw_image")
harpy.im.segment(sdata, img_layer="raw_image")
harpy.pl.plot_labels(sdata, labels_layer="segmentation_mask") # gives labeled output

Disk-backed segmentation like this does not appear to work when plotting:

import harpy

sdata = harpy.datasets.resolve_example().subset("raw_image")
sdata.write("example123.zarr")
harpy.im.segment(sdata, img_layer="raw_image")
harpy.pl.plot_labels(sdata, labels_layer="segmentation_mask") # gives no output even though there is a 'segmentation_mask' element folder on-disk
INFO     The SpatialData object is not self-contained (i.e. it contains some elements that are Dask-backed from    
         locations outside example123.zarr). Please see the documentation of `is_self_contained()` to understand   
         the implications of working with SpatialData objects that are not self-contained.                         
INFO     The Zarr backing store has been changed from None the new file path: example123.zarr                      
2025-02-06 10:56:11,981 - harpy.image.segmentation._segmentation - INFO - Linking labels across chunks.
2025-02-06 10:56:11,982 - harpy.image._manager - WARNING - No dims parameter specified. Assuming order of dimension of provided array is ((c), (z), y, x)
2025-02-06 10:56:11,987 - harpy.image._manager - INFO - Writing results to layer 'segmentation_mask'
2025-02-06 10:56:12,271 - harpy.shape._manager - INFO - Finished vectorizing. Dissolving shapes at the border of the chunks. This can take a couple minutes if input mask contains a lot of chunks.
2025-02-06 10:56:12,293 - harpy.shape._manager - INFO - Dissolve is done.

But they do work when writing after segmentation:

import harpy

sdata = harpy.datasets.resolve_example().subset("raw_image")
harpy.im.segment(sdata, img_layer="raw_image")
sdata.write("example456.zarr")
harpy.pl.plot_labels(sdata, labels_layer="segmentation_mask") # gives labeled output

or reassigning the sdata variable:

sdata = harpy.datasets.resolve_example().subset("raw_image")
sdata.write("example789.zarr")
sdata = harpy.im.segment(sdata, img_layer="raw_image")
harpy.pl.plot_labels(sdata, labels_layer="segmentation_mask") # gives output

Is this reproducible @ArneDefauw? We've encountered this need for reassignment before. It's an inconsistency and a footgun for users we should try to fix. I think it's nicer for sdata.write to not have an influence and not to do assignment with sdata = everywhere. But is this a limitation we can't fix or something to fix in the segment function, the plotting function or SpatialData?

@berombau berombau added the bug Something isn't working label Feb 6, 2025
@ArneDefauw
Copy link
Collaborator

Hi @berombau , I am aware of this behaviour, and it could probably be improved, but not sure what the best way to fix it.

The root cause of the issue is that if we write a spatial element of a spatialdata object, the sdata object is not updated inplace. The dask task graph is executed, but sdata.write_element(...) does not trigger an 'inplace' reload of sdata. Small dummy example:

import dask.array as da
from spatialdata import SpatialData, read_zarr
from spatialdata.models import Image2DModel

sdata = SpatialData()
sdata.write( "/Users/arnedf/VIB/DATA/test_data/sdata_test.zarr" )
sdata = read_zarr( "/Users/arnedf/VIB/DATA/test_data/sdata_test.zarr" )

random_image =  da.random.random((1, 1000, 2000), chunks=(1, 500, 500))
random_image_2 = da.random.random((1, 2000, 1000), chunks=(1, 500, 500))

arr=image=da.matmul( random_image, random_image_2 )

sdata[ "image" ] = Image2DModel.parse( arr, dims = ( "c" , "y", "x" ) )

if we now do sdata.write_element( "image" ), the computation will be triggered, and results written to the zarr store. But the sdata object will still hold the element "image" for which no computation is done yet.

This is why I always do

sdata.write_element(...)
sdata=read_zarr( sdata.path )

in the harpy code.

This is also why the sdata that is returned by harpy functions is a different object.

The reason why you do not see labels plotted when doing

harpy.pl.plot_labels(sdata, labels_layer="segmentation_mask")

when sdata is disk backed, is because the sdata object from here:
harpy.im.segment(sdata, img_layer="raw_image")
holds a reference to a temporary zarr store, that is already removed during computation, see

zarr_path = os.path.join(temp_path, f"labels_{uuid.uuid4()}.zarr")

There are some possible solitons:

  1. Add a parameter lazy to every harpy function. However, this is only a theoretical solution, because most harpy functions do optimizations of the Dask task graph (persisting in memory where necessary, or writing to an intermediate zarr store to make the task graph smaller,), and it would be difficult to make it lazy.
    In other words, calling a harpy function always means excecuting the task graph.

  2. sdata.write_element(...) has an option inplace or reload, where sdata is written to the zarr store, and then reloaded.

  3. Monkey patch sdata and take care of this reloading inside harpy, so the object that goes in a harpy function is the same as the one that goes out.

@berombau
Copy link
Member Author

berombau commented Feb 6, 2025

Thanks for the clear explanation!

For the solutions:

  1. Also not a fan of this solution.
  2. This sounds like a fix so no inconsistent SpatialData objects are created, so this is an Issue we could raise in SpatialData maybe.
  3. Not a big fan of this solution, but I guess we should for every function at least check that the input SpatialData object is self-contained SpatialData.is_self_contained() and consistent with the zarr store SpatialData._symmetric_difference_with_zarr_store().
    But instead of fixing it in place, maybe we can just log a warning for the user Inconsistencies in input SpatialData object detected. Remove them by using an assignment after writing functions e.g. sdata = harpy.{operation}(sdata, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants