-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: uncompressed input size #141
Comments
That seems like a bug and something to be fixed in that helper ? Would that close galaxyproject/galaxy#19280 ? |
Sure - you think this should be implemented in TPV though, not in Galaxy? I am not thrilled that TPV would need to pull the data into the cache if it's on object storage. |
Ah, i would be the last person to do that ;). I'm saying either we know which converted dataset we use, or we fix that. |
I believe this was always the intention and is how it works if the dataset already exists (i.e. on a re-run). The converter records the dataset that the user chose, so there's no gap in the provenance either. This somewhat addresses galaxyproject/total-perspective-vortex#141 so you can (reliably) differetiate your rules on the input datatype and filesize combination.
We discussed this out of band but for everyone else, I am not talking here about the converted dataset, I mean having access to the uncompressed size of native compressed (e.g. As a stopgap or maybe alternative if we don't want to store this in Galaxy, we did discuss that you could instead query and calculate stats on memory usage by input separated by compressed and uncompressed (i need to write a query for this) and then use a rule construct such as the following: - if: |
datasets = [jtid.dataset.dataset for jtid in job.input_datasets if jtid.dataset]
input_is_compressed = any([d.ext.endswith(".gz") or d.ext.endswith(".bz2") for d in datasets])
(input_is_compressed and 0.1 < input_size < 2.0) or (not input_is_compressed and 0.5 < input_size < 4.0)
cores: 4
mem: 28 Or without the helper: |
That rule is not good for performance (ORM overhead is likely significant here), if you do that I would suggest you write a helper that builds a core statement. That helper might also estimate uncompressed size so you don't need the 2 or statements. |
which means we collect and store the converted dataset as the job input. We don't need to wait for the galaxy.json collection, we know the exact target type already. That fixes the retrieval in `get_converted_files_by_type`. I believe this was always the intention and is how it works if the dataset already exists (i.e. on a re-run), and for all converters that don't use galaxy.json The converter records the dataset that the user chose, so there's no gap in the provenance either. This somewhat addresses galaxyproject/total-perspective-vortex#141 so you can (reliably) differetiate your rules on the input datatype and filesize combination.
which means we collect and store the converted dataset as the job input. We don't need to wait for the galaxy.json collection, we know the exact target type already. That fixes the retrieval in `get_converted_files_by_type`. I believe this was always the intention and is how it works if the dataset already exists (i.e. on a re-run), and for all converters that don't use galaxy.json The converter records the dataset that the user chose, so there's no gap in the provenance either. This somewhat addresses galaxyproject/total-perspective-vortex#141 so you can (reliably) differetiate your rules on the input datatype and filesize combination.
What's the main penalty? Do we lazy load each jtid -> (h)da or something? |
yes, plus the ORM overhead. we've optimized this code so much on the galaxy side it's painful to tear it back down in tpv :( |
Currently
input_size
is the size of the raw input, which can be either compressed or uncompressed. When scaling memory based on input size you probably only care about the uncompressed size. But gzip does store the uncompressed size, which we could read into a separateuncompressed_jnput_size
variable. The uncompressed size is stored in the last 4 bytes, this seems to work for me:The uncompressed size also isn't always set properly:
So we should have a default... actual size, or actual size * some constant factor.
The text was updated successfully, but these errors were encountered: