GitHub - zx0223winner/HSDSnake: A pipeline for comprehensive analysis of gene duplicates in eukaryotic genomes

Introduction

HSDSnake is a SnakeMake pipeline for comprehensive analysis of highly similar duplicates (HSDs) in genomes. The tools are shown in the Pipeline Flowchart and their references are listed in Citations.md.

Pipeline Flowchart

%%{init: {
    'theme': 'base',
    'themeVariables': {
    'fontSize': '18px",
    'primaryColor': '#9A6421',
    'primaryTextColor': '#ffffff',
    'primaryBorderColor': '#9A6421',
    'lineColor': '#B180A8',
    'secondaryColor': '#455C58',
    'tertiaryColor': '#ffffff'
  }
}}%%
flowchart TD
  
  PREPARE((PREPARE)) ==> preprocess_fasta[preprocess_fasta]
  PREPARE ==> diamond_db[diamond_db]
  PREPARE ==> diamond[ diamond]
  PREPARE ==> KEGG[KEGG]

  preprocess_fasta ==> DETECT
  diamond_db ==> DETECT
  diamond ==> DETECT
  KEGG ==> DETECT

  DETECT((DETECT)) ==> HSDFinder_preprocess[HSDFinder_preprocess]
  DETECT ==> HSDFinder[HSDFinder]
  

  HSDFinder_preprocess ==> CURATE
  HSDFinder ==> CURATE
  
  subgraph Automatically_combined
   CURATE((CURATE)) ==> HSDecipher_batch_run[HSDecipher_batch_run]
  end

  HSDecipher_batch_run ==> STATISTICS

  STATISTICS((STATISTICS)) ==> HSDecipher_statistics[HSDecipher_statistics]
  STATISTICS ==> HSDecipher_category[HSDecipher_category]
  STATISTICS ==> merge_statistics[merge_statistics]

  HSDecipher_statistics ==> VISUALIZE_and_COMPARE
  HSDecipher_category ==> VISUALIZE_and_COMPARE
  merge_statistics ==> VISUALIZE_and_COMPARE

  VISUALIZE_and_COMPARE ==> HSDecipher_heatmap_inter_species_prepare[heatmap_inter_species]
  VISUALIZE_and_COMPARE ==> HSDecipher_heatmap_intra_species[heatmap_intra_species]

Usage

Refer to Usage documents for details.

Note

If you are new to Snakmake, please refer to this page on how to set-up SnakeMake. Make sure to test the sample data below before running the workflow on actual data.

# Test if you have successfully installed the SnakeMake
mamba activate snakemake
snakemake --help

Prepare an config.yaml file with following columns representing input files for HSDSnake, please only substitute the species name to yours, keep the input file format, such as Arabidopsis_thaliana.fa, Arabidopsis_thaliana.interproscan.tsv, Arabidopsis_thaliana.ko.txt.

samples:
  - Arabidopsis_thaliana
  - Chlamydomonas_reinhardtii
 
genomes:
  Arabidopsis_thaliana:
    proteins: "data/Arabidopsis_thaliana.fa"
    interproscan: "data/Arabidopsis_thaliana.interproscan.tsv"
    KEGG: "data/Arabidopsis_thaliana.ko.txt"

  Chlamydomonas_reinhardtii:
    proteins: "data/Chlamydomonas_reinhardtii.fa"
    interproscan: "data/Chlamydomonas_reinhardtii.interproscan.tsv"
    KEGG: "data/Chlamydomonas_reinhardtii.ko.txt"

Now, you can run the pipeline using the following commands:

# Download the package
git clone https://github.com/zx0223winner/HSDSnake.git

# enter the working directory
cd HSDSnake

Note

Due to the size of sample files, please download the test data - HSDSnake_data.tar.gz through the Google drive link

# Then decompress the file HSDSnake_data.tar.gz under the HSDSnake directory,
# This will bring you a data folder with test files ready 
tar -xvzf HSDSnake_data.tar.gz

# Then you can give a dry run by the following command.
snakemake --use-conda --cores all -n

# If everthing is OK, then you can test the pipeline by running:
snakemake --use-conda --cores all

Citations

HSDecipher protocol, HSDatabase, HSDFinder tool, HSD review, HSD examples:

Xi Zhang, Yining Hu, Zhenyu Cheng, John M. Archibald (2023). HSDecipher: A pipeline for comparative genomic analysis of highly similar duplicate genes in eukaryotic genomes. StarProtocols. doi: doi: https://doi.org/10.1016/j.xpro.2022.102014
Zhang, X., Hu, Y. & Smith, D. R. 2022. HSDatabase - a database of highly similar duplicate genes from plants, animals, and algae. Database, doi:http://doi.org/10.1093/database/baac086.
Zhang, X. & Smith, D. R. 2022. An overview of online resources for intra-species detection of gene duplications. Frontiers in Genetics, doi: http://doi.org/10.3389/fgene.2022.1012788.
Xi Zhang, Yining Hu, David Roy Smith. (2021). HSDFinder: a BLAST-based strategy to search for highly similar duplicated genes in eukaryotic genomes. Frontiers in Bioinformatics. doi: http://doi.org/10.3389/fbinf.2021.803176
Xi Zhang, Yining Hu, David Roy Smith. (2021). Protocol for HSDFinder: Identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes DOI: https://doi.org/10.1016/j.xpro.2021.100619
Xi Zhang, et.al. David Roy Smith (2021). Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241 DOI:https://doi.org/10.1016/j.isci.2021.102084

Links to the InterProScan and KEGG

Pfam 37.0 (Sep 2024, 21,979 entries): https://pfam.xfam.org
InterPro 101.0 (Jul 2024, 45,899 entries):http://www.ebi.ac.uk/interpro/
KEGG Orthology Database: https://www.genome.jp/kegg/ko.html
InterProscan: https://github.com/ebi-pf-team/interproscan
KEGG : https://www.kegg.jp/kegg/
Diamond: https://github.com/bbuchfink/diamond

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
bin		bin
docs		docs
log		log
resources		resources
results		results
scripts		scripts
workflow		workflow
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Pipeline Flowchart

Usage

Citations

Links to the InterProScan and KEGG

About

Releases 1

Packages

Languages

License

zx0223winner/HSDSnake

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pipeline Flowchart

Usage

Citations

Links to the InterProScan and KEGG

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages