-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blog post: Moving workflows from single files to collections - a case study #2050
base: master
Are you sure you want to change the base?
Conversation
were based on processing single files to use collections as well. | ||
However, when applying this idea on our latest metagenomics workflow [Foodborn Pathogen detection](https://training.galaxyproject.org/training-material/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html) we encountered some problems | ||
that arise when switching from single files to collection. | ||
In the following we would like to present some of those issues and how we solved them, in the hopes that these strategies can help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you also want to write them as FAQs in the GTN it could be useful, just sort of "how to do X" type FAQs, then we can easily link users to them when they encounter those issues later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the post is finalized, we will condense into a FAQ, thanks for the Idea !
</figcaption> | ||
</div> | ||
|
||
However, if this logic needs to be applied to multiple collection or again for multiple steps, the workflow becomes even more unreasonably large and complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also run the workflow again and enable the job cache, this might be the easiest and most clean solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek do you mean to run the workflow again without the failed elements ? Then, if another collection fails, run the complete WF again and so on ? How do you enable the job cache and what does it do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just run the whole workflow again. Everything that had already been run successfully will be skipped. If enabled by the admin go to User > Prefernces -> Manage Information. There were some additional fixes in 23.1 and usegalaxy.eu isn't running this yet. If you want to test this you can try on usegalaxy.org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this info, cached jobs are indeed really cool, but I don't really see how it can solve the current issue.
The same elements will still fail again. Only in the case of randomly failing, this could help ... which should not be the case in any way.
Therefore, an additional step would still be required, by:
- Filtering the elements in the collection (-> reindexing problem)
- Run WF again without the Datasets that fail (-> laborious)
- Work on the tool to overcome the issue
Or am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your example said
Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server.
so it should help with that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, sorry I thought you referred to the complete paragraph, for that particular example your solution can help, although since it's random, one could get some failed ones for the new run as well and ultimately would need to repeat the workflow multiple times, which is also not ideal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, those will be picked up by the cache. And it's generally a good thing if you have a single complete invocation without partially failed items.
|
||
# Case 3 - Collection workflow logic does not fully comply with single file logic | ||
|
||
`TODO explain unziop problem, how it was solved` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a bug, would really welcome a reproduction here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we think so too, and also double-checked, we will write an Issue, but I think its still a good finding for the blog post, and we can link the issue here. Idea is to motivate users to write issues for such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will write a PR for that but for the blog post I think its enough to state that it is being investigated
…ion workflows and explaining the problem of the decompressing
I have added some parts in this PR |
Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since | ||
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server. That issue was solved by increasing the minimum memory required by the tool on the EU server. But there are various other scenarios where the failure of the tool can be attributed for example to specific input data. In other cases only a few elements of a collection are empty (e.g. if an assembly can not be made due to not overlapping reads). | ||
|
||
If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct, generally speaking. As long as you're not reducing the elements the workflow will proceed until the end.
### Rerun the workflow with rerun activated | ||
|
||
In some cases a rerun of the workflow can also solve the issue, for example in our use case where the failure was rather randomly. | ||
For this strategy it is beneficial to activate the rerun option of galaxy (User/Preferences/Manage Information) on instance with version > 23.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to refer to the job cache ? re-running single elements has been possible for many years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not the cache option: "Do you want to be able to re-use equivalent jobs ?" Thats what I found on .org ... or did you mean another option ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to be able to re-use equivalent jobs
-> this is the job cache.
I don't see how that is related to Rerun the workflow with rerun activated
?
|
||
Some of the tools, used in the *Foodborne pathogen detection workflow using collections*, failed when we run the workflow but succeed when the tool did run alone without a workflow, for example `Krakentools: Extract Kraken Reads By ID` and `Filter Sequence by ID`. | ||
|
||
After a bit of investigation we noticed that when these tools run alone (without being in a workflow) or in a workflow with single files galaxy performs an implicit decompressing of the zipped input files, which then channels the correct file to the underlying wrapper tool. However when the same tools run in a collection in a workflow this implicit decompressing does not take place, which cause the output to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear what you're doing there. Implicit converters sure run as part of workflows, using collections or not. Overall I don't think this is a statement that should be published.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, but still an issue, will raise next week.
However, this initial solution is not optimal, since the data size will increase in the user's history (due to file decompression) by running the workflow. Therefore, we have proposed another solution by updating the tools wrappers themselves to perform the decompression internally without the need to use the `Convert compressed file to uncompressed` tool (https://github.com/galaxyproject/tools-iuc/pull/5360). In fact the tool did not require a internal decompression step, since it can indeed work with zipped files, it only needed | ||
to know that the input is zipped. | ||
|
||
Although the problem could be solved again on the tool wrapper level, the general problematic of inconsistent conversion logic between single files and collections is most likely a bug which is currently investigated for following galaxy releases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, please open issues for this when you encounter them, and before writing a blog post. We do have test cases that exercise this, so this is rather surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove that part and we will do the PR
|
||
If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases. | ||
|
||
## Intermediate workflow specific solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to point out though that you could have taken your single file workflow and mapped it over a collection, and I think that would have solved all of the issues mentioned in this section. The workarounds here are nice, but they make the problem appear much more complicated than it should be. I would also appreciate a qualifier saying that you can ask these sort of questions in the IWC channel.
Furthermore it needs to be considered, that filtering empty or failed elements from a collection | ||
can hide the root cause of the problem. The question why some tool runs fail or produce empty output should always be further investigated. | ||
|
||
### Rerun the workflow with rerun activated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Rerun the workflow with rerun activated | |
### Rerun the workflow with the re-use equivalent jobs option |
is that what you meant ? I read this as the normal replace element in collection option
|
||
In some cases a rerun of the workflow can also solve the issue, for example in our use case where the failure was rather randomly. | ||
For this strategy it is beneficial to activate the rerun option of galaxy (User/Preferences/Manage Information) on instance with version > 23.0. | ||
This allows to only rerun failed elements, which saves computing resources and time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also true if you click on rerun and replace collection element, I think I would say
This creates a new workflow run where successful outputs are copied from the previous execution instead of being run again, and only failed jobs and jobs depending on the failed outputs will be submitted again.
I have created an issue on tools-iuc for the implicit dataset conversion that doesnot occur while using the tool in a workflow with an input collection of zipped files. We can now link it to the blog post |
@paulzierep is it ready to be merged? |
Collections in workflows pitfalls and strategies from foodborne pathogen detection