Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: uncompressed input size #141

Open
natefoo opened this issue Nov 7, 2024 · 7 comments
Open

Proposal: uncompressed input size #141

natefoo opened this issue Nov 7, 2024 · 7 comments

Comments

@natefoo
Copy link
Member

natefoo commented Nov 7, 2024

Currently input_size is the size of the raw input, which can be either compressed or uncompressed. When scaling memory based on input size you probably only care about the uncompressed size. But gzip does store the uncompressed size, which we could read into a separate uncompressed_jnput_size variable. The uncompressed size is stored in the last 4 bytes, this seems to work for me:

#!/usr/bin/env python3
import os
import sys

path = sys.argv[1]

with open(path, 'rb') as f:
    f.seek(-4, os.SEEK_END)
    size = int.from_bytes(f.read(4), 'little')
    print(size)

The uncompressed size also isn't always set properly:

nate@pdp-11% gzip -l /home/nate/work/galaxy/test-data/1.bam
         compressed        uncompressed  ratio uncompressed_name
               3592                   0   0.0% /home/nate/work/galaxy/test-data/1.bam

So we should have a default... actual size, or actual size * some constant factor.

@mvdbeek
Copy link
Member

mvdbeek commented Dec 10, 2024

Currently input_size is the size of the raw input,

That seems like a bug and something to be fixed in that helper ? Would that close galaxyproject/galaxy#19280 ?

@natefoo
Copy link
Member Author

natefoo commented Dec 10, 2024

Sure - you think this should be implemented in TPV though, not in Galaxy? I am not thrilled that TPV would need to pull the data into the cache if it's on object storage.

@mvdbeek
Copy link
Member

mvdbeek commented Dec 10, 2024

Ah, i would be the last person to do that ;). I'm saying either we know which converted dataset we use, or we fix that.

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Dec 10, 2024
I believe this was always the intention and is how it works if the
dataset already exists (i.e. on a re-run). The converter records the
dataset that the user chose, so there's no gap in the provenance either.

This somewhat addresses
galaxyproject/total-perspective-vortex#141
so you can (reliably) differetiate your rules on the input datatype
and filesize combination.
@natefoo
Copy link
Member Author

natefoo commented Dec 10, 2024

We discussed this out of band but for everyone else, I am not talking here about the converted dataset, I mean having access to the uncompressed size of native compressed (e.g. fastqsanger.gz) inputs so that you can scale memory based on the size of the uncompressed reads regardless of whether the input is compressed.

As a stopgap or maybe alternative if we don't want to store this in Galaxy, we did discuss that you could instead query and calculate stats on memory usage by input separated by compressed and uncompressed (i need to write a query for this) and then use a rule construct such as the following:

- if: |
    datasets = [jtid.dataset.dataset for jtid in job.input_datasets if jtid.dataset]
    input_is_compressed = any([d.ext.endswith(".gz") or d.ext.endswith(".bz2") for d in datasets])
    (input_is_compressed and 0.1 < input_size < 2.0) or (not input_is_compressed and 0.5 < input_size < 4.0)
  cores: 4
  mem: 28

Or without the helper:

@mvdbeek
Copy link
Member

mvdbeek commented Dec 11, 2024

That rule is not good for performance (ORM overhead is likely significant here), if you do that I would suggest you write a helper that builds a core statement. That helper might also estimate uncompressed size so you don't need the 2 or statements.

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Dec 11, 2024
which means we collect and store the converted dataset as the job input.

We don't need to wait for the galaxy.json collection, we know the exact
target type already. That fixes the retrieval in
`get_converted_files_by_type`.

I believe this was always the intention and is how it works if the
dataset already exists (i.e. on a re-run), and for all converters that
don't use galaxy.json The converter records the dataset that the user
chose, so there's no gap in the provenance either.

This somewhat addresses
galaxyproject/total-perspective-vortex#141
so you can (reliably) differetiate your rules on the input datatype
and filesize combination.
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Dec 11, 2024
which means we collect and store the converted dataset as the job input.

We don't need to wait for the galaxy.json collection, we know the exact
target type already. That fixes the retrieval in
`get_converted_files_by_type`.

I believe this was always the intention and is how it works if the
dataset already exists (i.e. on a re-run), and for all converters that
don't use galaxy.json The converter records the dataset that the user
chose, so there's no gap in the provenance either.

This somewhat addresses
galaxyproject/total-perspective-vortex#141
so you can (reliably) differetiate your rules on the input datatype
and filesize combination.
@natefoo
Copy link
Member Author

natefoo commented Dec 11, 2024

What's the main penalty? Do we lazy load each jtid -> (h)da or something?

@mvdbeek
Copy link
Member

mvdbeek commented Dec 11, 2024

yes, plus the ORM overhead. we've optimized this code so much on the galaxy side it's painful to tear it back down in tpv :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants