Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create sample dataset #72

Open
davidackerman opened this issue Feb 13, 2024 · 18 comments
Open

Create sample dataset #72

davidackerman opened this issue Feb 13, 2024 · 18 comments
Assignees

Comments

@davidackerman
Copy link
Collaborator

We need a sample dataset in a predefined format for testing and demoing

@rhoadesScholar
Copy link
Member

rhoadesScholar commented Feb 14, 2024

  • Locally generated small datasets, such as raws with a large X or two (in pixels) and GT being an inverted version
  • @d-v-b Pulling a couple training crops from s3://hela-2 plus raw data around those crops with some generous padding. Store them locally for convenient usage

@d-v-b
Copy link
Contributor

d-v-b commented Feb 14, 2024

which crops, how much padding, and where should it be saved?

@rhoadesScholar
Copy link
Member

rhoadesScholar commented Feb 15, 2024

so far for the small datasets to pull from s3:// via a script:

  • raw: data/jrc_hela-2/jrc_hela-2.zarr/recon-1/em/fibsem-uint8 (with separate crops for each GT cube)
  • validation (converted to separate arrays):
    • data/jrc_hela-2/staging/groundtruth.zarr/crop113/all
    • data/jrc_hela-2/staging/groundtruth.zarr/crop155/all
  • train (converted to separate arrays):
    • ...

@avweigel
Copy link
Member

@d-v-b @yuriyzubov
the datasets that we want available on s3:// are all currently on our nrs. we want to include jrc_hela-2.zarr and the crops that are in /staging/groundtruth.zarr

the general format should follow our schema

em data: jrc_hela-2.zarr/recon-1/em/...
crop data: jrc_hela-2.zarr/recon-1/labels/groundtruth/...

explicit list of crops to be included:
crop1
crop3
crop4
crop6
crop7
crop8
crop9
crop13
crop14
crop15
crop16
crop18
crop19
crop23
crop28
crop54
crop55
crop56
crop57
crop58
crop59
crop94
crop95
crop96
crop113
crop155

@d-v-b
Copy link
Contributor

d-v-b commented Feb 15, 2024

  • upload crops to s3 from hela2/staging
  • a script will download from s3 a skeleton raw volume with em data only in areas with crops + 256 on each dimension

@d-v-b
Copy link
Contributor

d-v-b commented Feb 15, 2024

also, replace the current jrc_hela-2.zarr on s3

@d-v-b
Copy link
Contributor

d-v-b commented Feb 16, 2024

the data on s3 is now correct (i.e., the s3://janelia-cosem-datasets/jrc_hela-2/jrc_hela-2.zarr/recon-1/em/fibsem-uint8 and s3://janelia-cosem-datasets/jrc_hela-2/jrc_hela-2.zarr/recon-1/labels/groundtruth are populated). I started a command-line tool for copying the right data locally; i uploaded it as a gist which you can find here: https://gist.github.com/d-v-b/6dc1ae079b664711061490ba4b866c6c.

obviously this will eventually need to a) do all the things it's supposed to do, and b) be integrated into dacapo. but I don't think I can do either of those things today. @yuriyzubov (or anyone else), if you want to hack on this script feel free, just let me know, so that we can avoid duplicated effort. Specifically, if becomes part of dacapo, please link that PR or commit to this issue so I know about it. Otherwise I can finish it up over the weekend.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 19, 2024

@avweigel two of the crops in this list overlap (6 and 113), is that OK?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 19, 2024

I updated the gist with a fully-functioning script. it's pretty slow -- running it took several hours on my workstation -- but it does work. If the crappy performance is a problem, we can explore some performance optimizations. I am already doing some parallelism, but it's pretty coarse-grained and could surely benefit from some tooling.

@rhoadesScholar if I wanted this script to be added to dacapo, where would we put it in the source tree?

@rhoadesScholar rhoadesScholar moved this from Todo to In Progress in DaCapo Hackathon 2024 Feb 21, 2024
@rhoadesScholar
Copy link
Member

rhoadesScholar commented Mar 13, 2024

@d-v-b The idea was to put it in the examples folder. But perhaps this should be done with a more minimal list of crops to speed things up. I imagine users might start getting frustrated after 5+ minutes if they're just trying to run an example notebook. 😬

Are you downloading the whole scale pyramids? Because that would explain a lot of slowness, can could be safely omitted for simple example cases imo.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 14, 2024

I will see how things run with a reduced number of crops + only downloading s0, and I will open a PR that actually adds the script to dacapo in the examples folder.

@mzouink
Copy link
Member

mzouink commented Mar 14, 2024

Sorry for jumping in late,
I think the goal is to have something similar to Tensorflow and Pytorch
Tensorflow

import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train', shuffle_files=True)

Pytorch

from torchvision import datasets
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

DaCapo Example:

import dacapo_datasets as dds
datasplit = dds.HeLaCell(
    path="/path")

@rhoadesScholar
Copy link
Member

I think dacapo.datasets.HelaCell would be an awesome entry point.

@mzouink
Copy link
Member

mzouink commented Mar 14, 2024

if we want to give them the best experience. i will recommand the hello world example to finetune setup04 model
to do this, i would recommand pulling gt from s1 (8nm) and raw from s2 (16nm, because we are using upsample unet)

@rhoadesScholar
Copy link
Member

rhoadesScholar commented Mar 14, 2024

from dacapo import datasets
training_data = datasets.HelaCell(
    download=True, # download data (instead of training from cloud, which isn't implemented yet)
    root="data", # download to folder "./data/"
    raw_scale=8, # download the 8nm raw data
    gt_scale=4, # download the 8nm GT data
)

@d-v-b
Copy link
Contributor

d-v-b commented Mar 14, 2024

what's the type of training_data here?

@mzouink
Copy link
Member

mzouink commented Mar 14, 2024

@rhoadesScholar
Copy link
Member

Revisiting this @d-v-b and @yuriyzubov. We need a script to essentially do this for the segmentation challenge as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

6 participants