Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrappers for ppanggolin all and ppanggolin msa #6645

Merged
merged 13 commits into from
Jan 22, 2025

Conversation

thomas-lacroix
Copy link
Contributor

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful. Added a few comments.

homepage_url: https://ppanggolin.readthedocs.io/en/latest/
long_description: |
Depicting microbial species diversity via a Partitioned PanGenome Graph Of Linked Neighbors.
remote_repository_url: https://github.com/labgem/PPanGGOLiN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should point to the IUC repo. Use this one for homepage_url. The readthedocs URL might be cool to mention in the tool's help section.

Copy link
Contributor

@bernt-matthias bernt-matthias Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
remote_repository_url: https://github.com/labgem/PPanGGOLiN
remote_repository_url: https://github.com/galaxyproject/tools-iuc/tree/main/tools/ppanggolin

tools/ppanggolin/.shed.yml Show resolved Hide resolved
tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_msa.xml Outdated Show resolved Hide resolved
tools/ppanggolin/ppanggolin_msa.xml Outdated Show resolved Hide resolved
@thomas-lacroix
Copy link
Contributor Author

Thanks @bernt-matthias , I implemented your suggestions, does it look good to you ? I couldn't get the validator to work for checking the consistency of the input files format, I didn't find a way to loop through multiple datasets or dataset collection and didn't find an example of such a validator. I implemented this check in the section by raising an exception. Best, Thomas

#set extension_input_files = $file.ext
#else:
#if $file.ext != $extension_input_files:
#raise Exception("All the genome files must be of similar format, either all genbank files or all fasta files.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if raising an exception is a good idea. Have you tried what happens in such a case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception appears in the history for each output files that have failed, see screenshot enclosed.
Capture d’écran du 2025-01-14 13-50-40

#raise Exception("All the genome files must be of similar format, either all genbank files or all fasta files.")
#end if
#end if
#set base_name = str($file.element_identifier).split('.')[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if users have . in the identifiers. Maybe ".".join(str($file.element_identifier).split('.')[:-1])

tools/ppanggolin/ppanggolin_all.xml Outdated Show resolved Hide resolved
@thomas-lacroix
Copy link
Contributor Author

Hi @bernt-matthias
I am wondering, is there any more changes to do so this PR can be merged ? My todo list for this PR is empty but I might have missed something.
Best,
Thomas

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay and thanks for the ping. I gave the tool a final look and have a few minor suggestions and requests.

Comment on lines 10 to 11
#import os

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#import os

mkdir -p "./tmp_ppanggolin/organism_list" &&
mkdir -p "./tmp_ppanggolin/ln_input_genomes" &&

echo -n > "./tmp_ppanggolin/organism_list/organism.list" &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo -n > "./tmp_ppanggolin/organism_list/organism.list" &&
touch "./tmp_ppanggolin/organism_list/organism.list" &&

#set extension_input_files = $file.ext
#else:
#if $file.ext != $extension_input_files:
#raise Exception("All the genome files must be of similar format, either all genbank files or all fasta files.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#raise Exception("All the genome files must be of similar format, either all genbank files or all fasta files.")
#raise Exception("All the genome files must be of the same datatype, either all genbank files or all fasta files.")

Comment on lines 27 to 28
#set base_name = ".".join(str($file.element_identifier).split('.')[:-1])
echo -e '${base_name}\t${file}' >> "./tmp_ppanggolin/organism_list/organism.list" &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just assume that the user prepared the collection such that the element_identifiers are the desired sample names and do not make any changes. On collection creation users can choose to remove extensions.

Suggested change
#set base_name = ".".join(str($file.element_identifier).split('.')[:-1])
echo -e '${base_name}\t${file}' >> "./tmp_ppanggolin/organism_list/organism.list" &&
echo -e '${file.element_identifier}\t${file}' >> "./tmp_ppanggolin/organism_list/organism.list" &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bernt-matthias , I implemented all your recommendations except this one. This change uncovered a problem with PPanGGOLiN that happens for files that have a space in their name or for compressed files using Galaxy as the suffix " uncompressed" (the first char is a space) is sometimes added when choosing gzip files as input. My solution was to use :

            #set base_name = str($file.element_identifier).replace(" ", "_")
            echo -e '${base_name}\t${file}' >> "./tmp_ppanggolin/organism_list/organism.list" &&

What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. Maybe document this in the parameter/tool help.

Comment on lines 165 to 190
<section name="advanced_pangenome_output_files" title="Advanced pangenome output files" expanded="False">
<param name="advanced_pangenome_optional_files" type="select" label="Add the following pangenome output files in the Galaxy history" multiple="true" optional="true" display="checkboxes" >
<!-- Basic files -->
<option value="output_gene_presence_absence" selected="true">Gene presence absence</option>
<option value="output_gene_families" selected="true">Gene families</option>
<option value="output_mean_persistent_duplication" selected="true">Mean persistent duplication</option>
<option value="output_matrix" selected="true">Matrix</option>
<!-- plot files -->
<option value="output_Ushaped_plot" selected="true">Ushaped plot</option>
<option value="output_tile_plot" selected="true">Tile plot</option>
<!-- graph files -->
<option value="output_pangenomeGraph_light_gexf" selected="true">PanGenome Graph Light (Gexf)</option>
<option value="output_pangenomeGraph_gexf" selected="true">PanGenome Graph (Gexf)</option>
<option value="output_pangenomeGraph_json" selected="true">PanGenome Graph (Json)</option>
<!-- advanced files -->
<option value="output_summarize_spots" selected="true">Summarize spots</option>
<option value="output_spots" selected="true">Spots</option>
<option value="output_spot_borders" selected="true">Spot borders</option>
<option value="output_border_protein_genes" selected="true">Border protein genes</option>
<option value="output_modules_summary" selected="true">Modules summary</option>
<option value="output_modules_in_genomes" selected="true">Modules in genomes</option>
<option value="output_modules_spots" selected="true">Modules spots</option>
<option value="output_modules_RGP_lists" selected="true">Modules RGP lists</option>
<option value="output_functional_modules" selected="true">Functional modules</option>
</param>
</section>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<section name="advanced_pangenome_output_files" title="Advanced pangenome output files" expanded="False">
<param name="advanced_pangenome_optional_files" type="select" label="Add the following pangenome output files in the Galaxy history" multiple="true" optional="true" display="checkboxes" >
<!-- Basic files -->
<option value="output_gene_presence_absence" selected="true">Gene presence absence</option>
<option value="output_gene_families" selected="true">Gene families</option>
<option value="output_mean_persistent_duplication" selected="true">Mean persistent duplication</option>
<option value="output_matrix" selected="true">Matrix</option>
<!-- plot files -->
<option value="output_Ushaped_plot" selected="true">Ushaped plot</option>
<option value="output_tile_plot" selected="true">Tile plot</option>
<!-- graph files -->
<option value="output_pangenomeGraph_light_gexf" selected="true">PanGenome Graph Light (Gexf)</option>
<option value="output_pangenomeGraph_gexf" selected="true">PanGenome Graph (Gexf)</option>
<option value="output_pangenomeGraph_json" selected="true">PanGenome Graph (Json)</option>
<!-- advanced files -->
<option value="output_summarize_spots" selected="true">Summarize spots</option>
<option value="output_spots" selected="true">Spots</option>
<option value="output_spot_borders" selected="true">Spot borders</option>
<option value="output_border_protein_genes" selected="true">Border protein genes</option>
<option value="output_modules_summary" selected="true">Modules summary</option>
<option value="output_modules_in_genomes" selected="true">Modules in genomes</option>
<option value="output_modules_spots" selected="true">Modules spots</option>
<option value="output_modules_RGP_lists" selected="true">Modules RGP lists</option>
<option value="output_functional_modules" selected="true">Functional modules</option>
</param>
</section>
<param name="advanced_pangenome_optional_files" type="select" label="Add the following pangenome output files in the Galaxy history" multiple="true" optional="true" display="checkboxes" >
<!-- Basic files -->
<option value="output_gene_presence_absence" selected="true">Gene presence absence</option>
<option value="output_gene_families" selected="true">Gene families</option>
<option value="output_mean_persistent_duplication" selected="true">Mean persistent duplication</option>
<option value="output_matrix" selected="true">Matrix</option>
<!-- plot files -->
<option value="output_Ushaped_plot" selected="true">Ushaped plot</option>
<option value="output_tile_plot" selected="true">Tile plot</option>
<!-- graph files -->
<option value="output_pangenomeGraph_light_gexf" selected="true">PanGenome Graph Light (Gexf)</option>
<option value="output_pangenomeGraph_gexf" selected="true">PanGenome Graph (Gexf)</option>
<option value="output_pangenomeGraph_json" selected="true">PanGenome Graph (Json)</option>
<!-- advanced files -->
<option value="output_summarize_spots" selected="true">Summarize spots</option>
<option value="output_spots" selected="true">Spots</option>
<option value="output_spot_borders" selected="true">Spot borders</option>
<option value="output_border_protein_genes" selected="true">Border protein genes</option>
<option value="output_modules_summary" selected="true">Modules summary</option>
<option value="output_modules_in_genomes" selected="true">Modules in genomes</option>
<option value="output_modules_spots" selected="true">Modules spots</option>
<option value="output_modules_RGP_lists" selected="true">Modules RGP lists</option>
<option value="output_functional_modules" selected="true">Functional modules</option>
</param>

Du you think that the typical user really needs all outputs (all seem to be selected). If yes, fine for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed the topic of the many output files with the creator of PPanGGOLiN and he thinks it is better to leave the default output files are they are now.

Comment on lines 1 to 2
<tool id="ppanggolin_msa" name="PPanGGOLiN msa" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="23.0">
<description>computes msa of pangenome's gene families</description>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we capitalize MSA in the name and description.

@@ -0,0 +1,11 @@
name: ppanggolin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify the shed.yml file to a suite? Like here: https://github.com/galaxyproject/tools-iuc/blob/main/tools/ampvis2/.shed.yml

@thomas-lacroix
Copy link
Contributor Author

Hi @bernt-matthias ,
My todo list for this PR is now empty, let me know if you see other areas to improve.
Best,
Thomas

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go?

@thomas-lacroix
Copy link
Contributor Author

yes, good to go :)

@bernt-matthias bernt-matthias merged commit f51871d into galaxyproject:main Jan 22, 2025
12 checks passed
@thomas-lacroix thomas-lacroix deleted the ppanggolin branch January 22, 2025 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants