Skip to content

Exporting a Graphene (PyChunkGraph) Volume

William Silversmith edited this page Mar 19, 2024 · 3 revisions

To create archival flat segmentations of a dynamic proofreading volume (these can be discerned with a cloud path that starts with graphene://, they are backed by the PyChunkGraph), all you need is Igneous!

Graphene is not well integrated into the CLI, so for the initial transfer, you should use the scripting interface.

Creating the initial materialization

First create the bucket where the data will be stored and an SQS queue (if needed) to populate tasks into. Then run the following script with your parameter choices plugged in. Below we used minnie65 as an example. Pick a timepoint to render and express that datetime in UNIX seconds. Select a chunk size and a compression mode (options: raw, compressed_segmentation, compresso, and crackle). Shard parameters will be automatically computed based on the default memory target of 3.5GB uncompressed (this is configurable).

There are two functions below. xfer65 and xfer65_l2. The former will render the root IDs (current global state of proofreading). The latter specifies stop_layer=2 and will render the L2 aggregates (which will be a grid of small cubes of proofread data).

Most of the time you'll want to pick compressed_segmentation for the encoding as it is backwards compatible with all neuroglancer viewers. compresso is compatible with Neuroglancer viewers updated since mid-2021 and is much smaller in file size. crackle is smaller still (and has other features too), but the viewer is more experimental.

import igneous.task_creation as tc 
from taskqueue import TaskQueue

# 2024-01-22T08:10:01.497934 UTC
TIMESTAMP = 1705911001

def xfer65():
  return tc.create_image_shard_transfer_tasks(
    src_layer_path="graphene://https://MY_URL", 
    dst_layer_path="gs://MY_BUCKET/", 
    chunk_size=(128, 128, 32), 
    fill_missing=False, 
    translate=(0,0,0), 
    bounds=None, 
    mip=0,
    encoding="compressed_segmentation", 
    agglomerate=True,
    timestamp=TIMESTAMP, 
    clean_info=True,
  )

def xfer65_l2():
  return tc.create_image_shard_transfer_tasks(
    src_layer_path="graphene://https://MY_URL", 
    dst_layer_path="gs://MY_BUCKET/l2/", 
    chunk_size=(128, 128, 32), 
    fill_missing=False, 
    translate=(0,0,0), 
    bounds=None, 
    mip=0,
    encoding="compresso", 
    agglomerate=True,
    timestamp=TIMESTAMP, 
    clean_info=True,
    stop_layer=2,
  )

tasks = xfer65()
tq = TaskQueue("sqs://my-igneous-queue")
tq.insert(tasks)

Once you've populated the queue, run Igneous, either locally or in the cloud, to pull tasks from the queue and transfer image shards to the destination bucket.

# example for local execution
igneous --parallel 8 execute -x sqs://my-igneous-queue

This process can take a while. The limiting factor is how quickly the PyChunkGraph can render the correct labels for a given chunk.

Downsampling

Currently, sharded downsampling only produces 1 mip at a time and so is somewhat tedious since you have to run it 9-12 times for large volumes to get a sufficient number of mip levels. One possibility is to run it 5 or 6 times and then generate 5 unsharded mip levels at once. The reason for this constraint is memory usage. Hopefully in the future, we will be able to generate 2-3 sharded mips at once.

igneous image downsample gs://MY_BUCKET --mip 0 --num-mips 1 --sharded --encoding cseg --fill-missing --chunk-size 128,128,32 --queue sqs://my-igneous-queue

At high mip levels, raw+gzip encoding becomes the most efficient as the downsampled segmentation approaches white noise.

Multi-Resolution Meshing

The next step is to generate multi-resolution meshes. There are four steps (two of them optional).

  1. (optional) Compute the global voxel counts per an object.
  2. The base meshes per a grid space must be calculated.
  3. (optional) Next, for images larger than pinky100, the spatial index should be loaded into a temporary SSD backed MySQL database with at least 100 GB of disk space and configured to accept connections from the internet.
  4. Generate and process mutli-resolution sharded mesh tasks that use the MySQL database for querying which mesh fragments to fetch.

(Optional) Compute Global Voxel Counts

When computing the base meshes, usually there are many tiny labels that slow down the process tremendously. If that doesn't matter, you can set the dust threshold to 0 and skip this step. If you set a dust threshold, since base meshes are calculated per a cutout, you may end up "chipping" meshes that only intersect with a small portion of a task. To avoid "chipping", you can make use of global voxel counts.

igneous image voxels count --mip 3 --queue sqs://my-igneous-queue 

Once you've processed that queue, you run the following command locally that aggregates the results. It can take a bit. For very large images, it's important to set the compression as zstd otherwise the IntMap file may be several gigabytes (it will be pulled by many processes later). zstd will get it down to a few hundred MB in a reasonable amount of time.

igneous image voxels sum --mip 3 --compress zstd

Initial Meshing

Select a near-isotropic mip level. Set a dust threshold. If you set a non-zero dust threshold without --dust-global, some meshes will get chipped. If you set a zero threshold, a lot of time will be wasted on tiny insignificant objects.

If you didn't compute the global voxel counts set dust threshold to 0:

igneous mesh forge gs://MY_BUCKET --mip 3 --dust-threshold 0 --fill-missing --sharded --queue sqs://my-igneous-queue

If you did compute the global voxel counts, you can pick any dust threshold safely:

igneous mesh forge gs://MY_BUCKET --mip 3 --dust-threshold 1000 --dust-global --sharded --fill-missing --queue sqs://my-igneous-queue

(Highly Advised) Setup a MySQL Spatial Index

Once you are done processing the initial mesh fragment files, a JSON spatial index file will have been generated for each task. This index is necessary to process sharded meshes since otherwise you won't be able to determine which fragment files contain the mesh pieces you need to complete a merge.

Igneous can use these JSON files as a spatial index on its own, but beyond a certain volume size (above pinky100 is a good rule of thumb) it becomes unwieldy as each mesh merge task will query all spatial index files. This is both costly in additional requests (it will cost some Class B request money that grows quadraticly) and in processing time (could be many minutes or hours per a task).

Thus, it makes sense to process it once and serve efficient queries from a temporary MySQL server. (NOTE: If you are running everything on a local cluster, you can probably use an sqlite database which requires less setup. In that case, skip to the command and run it using sqlite://spatial-index.db)

On Google Cloud Platform, go to "Cloud SQL" and pick a server with at least two cores and 100 GB of SSD storage. The server will likely stall during upload if HDD is selected. Configure it with a password, a public IP, and allow it to listen to connections from the internet 0.0.0.0/0.

Then run the following Igneous command which may take a bit (20 minutes to an hour?). Insert your password and host IP in the appropriate areas of the database path.

igneous mesh spatial-index db gs://MY_BUCKET mysql://root:{pwd}@{host_IP}/spatial-index

Create Multi-Resolution Sharded Meshes

Now create the mesh merging tasks.

igneous mesh merge-sharded --nlod 5 gs://MY_BUCKET --spatial-index-db mysql://root:{pwd}@{host_IP}/spatial-index --queue sqs://my-igneous-queue