Proposal: uncompressed input size #141

natefoo · 2024-11-07T20:02:48Z

Currently input_size is the size of the raw input, which can be either compressed or uncompressed. When scaling memory based on input size you probably only care about the uncompressed size. But gzip does store the uncompressed size, which we could read into a separate uncompressed_jnput_size variable. The uncompressed size is stored in the last 4 bytes, this seems to work for me:

#!/usr/bin/env python3
import os
import sys

path = sys.argv[1]

with open(path, 'rb') as f:
    f.seek(-4, os.SEEK_END)
    size = int.from_bytes(f.read(4), 'little')
    print(size)

The uncompressed size also isn't always set properly:

nate@pdp-11% gzip -l /home/nate/work/galaxy/test-data/1.bam
         compressed        uncompressed  ratio uncompressed_name
               3592                   0   0.0% /home/nate/work/galaxy/test-data/1.bam

So we should have a default... actual size, or actual size * some constant factor.

The text was updated successfully, but these errors were encountered:

mvdbeek · 2024-12-10T16:14:26Z

Currently input_size is the size of the raw input,

That seems like a bug and something to be fixed in that helper ? Would that close galaxyproject/galaxy#19280 ?

natefoo · 2024-12-10T16:49:19Z

Sure - you think this should be implemented in TPV though, not in Galaxy? I am not thrilled that TPV would need to pull the data into the cache if it's on object storage.

mvdbeek · 2024-12-10T17:06:48Z

Ah, i would be the last person to do that ;). I'm saying either we know which converted dataset we use, or we fix that.

I believe this was always the intention and is how it works if the dataset already exists (i.e. on a re-run). The converter records the dataset that the user chose, so there's no gap in the provenance either. This somewhat addresses galaxyproject/total-perspective-vortex#141 so you can (reliably) differetiate your rules on the input datatype and filesize combination.

natefoo · 2024-12-10T21:15:46Z

We discussed this out of band but for everyone else, I am not talking here about the converted dataset, I mean having access to the uncompressed size of native compressed (e.g. fastqsanger.gz) inputs so that you can scale memory based on the size of the uncompressed reads regardless of whether the input is compressed.

As a stopgap or maybe alternative if we don't want to store this in Galaxy, we did discuss that you could instead query and calculate stats on memory usage by input separated by compressed and uncompressed (i need to write a query for this) and then use a rule construct such as the following:

- if: |
    datasets = [jtid.dataset.dataset for jtid in job.input_datasets if jtid.dataset]
    input_is_compressed = any([d.ext.endswith(".gz") or d.ext.endswith(".bz2") for d in datasets])
    (input_is_compressed and 0.1 < input_size < 2.0) or (not input_is_compressed and 0.5 < input_size < 4.0)
  cores: 4
  mem: 28

Or without the helper:

mvdbeek · 2024-12-11T09:06:51Z

That rule is not good for performance (ORM overhead is likely significant here), if you do that I would suggest you write a helper that builds a core statement. That helper might also estimate uncompressed size so you don't need the 2 or statements.

which means we collect and store the converted dataset as the job input. We don't need to wait for the galaxy.json collection, we know the exact target type already. That fixes the retrieval in `get_converted_files_by_type`. I believe this was always the intention and is how it works if the dataset already exists (i.e. on a re-run), and for all converters that don't use galaxy.json The converter records the dataset that the user chose, so there's no gap in the provenance either. This somewhat addresses galaxyproject/total-perspective-vortex#141 so you can (reliably) differetiate your rules on the input datatype and filesize combination.

natefoo · 2024-12-11T20:10:41Z

What's the main penalty? Do we lazy load each jtid -> (h)da or something?

mvdbeek · 2024-12-11T20:13:43Z

yes, plus the ORM overhead. we've optimized this code so much on the galaxy side it's painful to tear it back down in tpv :(

which means we collect and store the converted dataset as the job input. We don't need to wait for the galaxy.json collection, we know the exact target type already. That fixes the retrieval in `get_converted_files_by_type`. I believe this was always the intention and is how it works if the dataset already exists (i.e. on a re-run), and for all converters that don't use galaxy.json The converter records the dataset that the user chose, so there's no gap in the provenance either. This somewhat addresses galaxyproject/total-perspective-vortex#141 so you can (reliably) differetiate your rules on the input datatype and filesize combination.

natefoo mentioned this issue Nov 7, 2024

Add a query for the tool-input-size-to-memory-usage ratio. galaxyproject/gxadmin#139

Draft

natefoo mentioned this issue Dec 6, 2024

Proposal: Read/calculate and store uncompressed size of compressed datasets galaxyproject/galaxy#19280

Open

mvdbeek mentioned this issue Dec 10, 2024

[24.1] Record implicitly converted dataset as input dataset galaxyproject/galaxy#19301

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: uncompressed input size #141

Proposal: uncompressed input size #141

natefoo commented Nov 7, 2024

mvdbeek commented Dec 10, 2024

natefoo commented Dec 10, 2024

mvdbeek commented Dec 10, 2024

natefoo commented Dec 10, 2024

mvdbeek commented Dec 11, 2024

natefoo commented Dec 11, 2024

mvdbeek commented Dec 11, 2024

Proposal: uncompressed input size #141

Proposal: uncompressed input size #141

Comments

natefoo commented Nov 7, 2024

mvdbeek commented Dec 10, 2024

natefoo commented Dec 10, 2024

mvdbeek commented Dec 10, 2024

natefoo commented Dec 10, 2024

mvdbeek commented Dec 11, 2024

natefoo commented Dec 11, 2024

mvdbeek commented Dec 11, 2024