This is a snakemake wrapper to run trycylcer. The sub-workflows are divided into three different steps, following the original author's instruction: https://github.com/rrwick/Trycycler/wiki. See step 4 for how to run the subworkflows.
Clone this repository to your local system, into the place where you want to perform the data analysis.
git clone [email protected]:matinnuhamunada/trycycler_snakemake_wrapper.git
cd trycycler_snakemake_wrapper
mkdir -p data/raw/GCF_000012125
wget -O data/raw/GCF_000012125/23754659.tar.gz https://bridges.monash.edu/ndownloader/files/23754659
(cd data/raw/GCF_000012125 && tar -xvzf 23754659.tar.gz)
Configure the workflow according to your needs via editing the files in the config/
folder. Adjust config.yaml
to configure the workflow execution, and samples.tsv
to specify the strains to assemble. The file units.tsv
contains the location of the paired illumina and nanopore reads for each strain.
samples.tsv
example:
strain | description |
---|---|
GCF_000012125 | Example |
units.tsv
example:
strain | unit | illumina_reads | nanopore_reads |
---|---|---|---|
GCF_000012125 | 1 | data/raw/GCF_000012125.1 |
Further formatting rules will be defined in the workflow/schemas/
folder.
Installing Snakemake using Mamba is advised. In case you don’t use Mambaforge you can always install Mamba into any other Conda-based Python distribution with:
conda install -n base -c conda-forge mamba
Then install Snakemake with:
mamba create -c conda-forge -c bioconda -c panoptes-organization -n snakemake snakemake panoptes-ui
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Run panoptes to monitor jobs:
tmux new-session -A -s panoptes \; send -t panoptes "conda activate snakemake && panoptes" ENTER \; detach -s panoptes # run panoptes in background at http://127.0.0.1:5000
Do a dry-run:
snakemake --snakefile workflow/Snakefile-assembly --use-conda --cores 8 --wms-monitor http://127.0.0.1:5000 -n
We can then open http://127.0.0.1:5000
to monitor our jobs
See the Snakemake documentation for further snakemake CLI details.
The results can be found in the data
folder and are separated into three stages: raw
, interim
, and processed
. Further, the results will be splitted into the three steps.
This step generates multiple assemblies as described in: https://github.com/rrwick/Trycycler/wiki/Generating-assemblies
snakemake --snakefile workflow/Snakefile-assembly --use-conda --cores <n_cores> --wms-monitor http://127.0.0.1:5000
The first step in Trycycler is to generate assemblies from the subsampled data. Here, the subsets were assembled with three different assemblers. We can see that all assemblers agrees to generate a circular chromosome and a plasmid.
| data/processed/GCF_000012125/01_trycycler_assembly/GCF_000012125_graphs.png
|
This step clusters the assemblies into per-replicon groups as described in: https://github.com/rrwick/Trycycler/wiki/Clustering-contigs
snakemake --snakefile workflow/Snakefile-cluster --use-conda --cores <n_cores> --wms-monitor http://127.0.0.1:5000
This step also generate data/interim/02_trycycler_cluster/cluster.yaml
which should be copied to the config folder in order to proceed to the next step.
cp data/interim/02_trycycler_cluster/cluster.yaml config/cluster.yaml
NOTE: You can select or drops the bad contigs or clusters that will be run in the next step. See below about evaluating the clusters.
The second step in Trycycler is to cluster the contigs. Here we can see how the chromosome and the plasmid are grouped in different cluster. Each cluster should have similar length and read depths.
| data/processed/GCF_000012125/02_trycycler_cluster/GCF_000012125_cluster.png
|
As we can see in the figure, the plasmid sizes after the assembly is a bit weird. There are some contigs with 7500 bp and some with 15kb and 22kb. Open up the data/interim/02_trycycler_cluster/cluster.yaml
and remove the contigs from cluster 2 which have length > than 8 kb.
GCF_000012125:
cluster_001:
- A_contig_1
- B_utg000001c
- C_Utg422
- D_contig_1
- E_utg000001c
- F_Utg418
- G_contig_1
- H_utg000001c
- I_Utg428
- J_contig_1
- K_utg000002c
- L_Utg448
cluster_002:
- B_utg000002c
- E_utg000002c
- H_utg000002c
- I_Utg430
- K_utg000001c
- L_Utg450
This step summarizes step 3, 4, 5, and 6 in the Trycycler wiki and generate the consensus contig sequence as described in: https://github.com/rrwick/Trycycler/wiki/Generating-a-consensus
snakemake --snakefile workflow/Snakefile-consensus --use-conda --cores <n_cores> --wms-monitor http://127.0.0.1:5000
The final assembly can be found in data/processed/GCF_000012125/03_trycycler_consensus/GCF_000012125.fna
TO DO