1. Pipeline overview

This is an overview of the perSVade pipeline, where each module can be run independently:

trim_reads_and_QC

Trimming (with trimmomatic) and quality control (with fastqc) of the reads. Make sure that you check the output of fastqc before using the trimmed reads for other analyses

align_reads

Align reads, mark duplicates and calculate coverage per windows.

call_SVs

Call structural variants (SVs). It uses gridss to infer a list of breakpoints (two regions of the genome (two breakends) that are joined in the sample of interest and not in the reference genome) from discordant read pairs, split reads and de novo assembly signatures. The breakpoints are summarized into SVs with clove. You can find the optimal filtering parameters with the module 'optimize_parameters'.

optimize_parameters

Find optimal parameters for SV calling through simulations. In order to find these, perSVade generates two simulated genomes with 50 SVs of each type and tries several combinations (>13,000,000,000) of filters for the gridss and clove outputs. It selects the filters that have the highest F-value (harmonic mean between precision and recall) for each simulated genome and SV type. In order to reduce overfitting, perSVade selects a final set of "best parameters" that work well for all simulations and SV types. This set of best parameters can be an input to the 'call_SVs' module for SV calling on the real data. See "Output" and this FAQ for more details.

find_homologous_regions

By default, the simulations in 'optimize_parameters' are placed randomly across the genome. However, SVs often appear around repetitive elements or regions of the genome with high similarity (i.e.: transposable elements insertions). This means that random simulations may not be realistic, potentially leading to overestimated calling accuracy and a parameter selection that does not work well for real SVs. perSVade can also generate more realistic simulations around regions with known SVs (i.e. regions with SVs called with perSVade) or homologous regions (inferred from BLAST). This module finds regions with pairwise homology in a genome, which can be input to 'optimize_parameters' to perform realistic simulations around homologous regions.

find_knownSVs_regions

Find regions with perSVade-inferred SVs. These can be input to 'optimize_parameters' to perform realistic simulations around regions with previously-known SVs.

infer_repeats

Find repeats in a genome, which can be used for the modules 'call_SVs', 'find_knownSVs_regions', 'integrate_SV_CNV_calls', 'optimize_parameters' and 'call_small_variants'.

call_CNVs

Copy Number Variants (CNVs) are one type of SVs where there is an alteration in the genomic content (deletions or duplications). The 'call_SVs' module identifies some CNVs (insertions, tandem duplications, deletions and complex inverted SVs) but it can miss others (i.e.: whole-chromosome duplications or regions with unknown types of rearrangements yielding CNVs).

As an alternative, this 'call_CNVs' module calls CNVs from read-depth alterations. For example, regions with 0x or 2x read-depth as compared to the mean of the genome can be called deletions or duplications, respectively. A straight forward implementation of this concept to find CNVs is challenging because many genomic features drive variability in read depth independently of CNV. In order to solve this, this module calculates the relative coverage for bins of the genome and corrects the effect of the GC content, mappability and distance to the telomere (using non-parametric regression as in this paper). This corrected coverage is used by CONY, AneuFinder and/or HMMcopy to call CNVs across the genome. This module generates consensus CNV calls from the three programs taking always the most conservative copy number for each bin of the genome. For example, if the used programs disagree on the copy number of a region the closest to 1 will be taken as the best estimate.

integrate_SV_CNV_calls

Integrate the variant calls of 'call_SVs' and 'call_CNVs' into a single .vcf file. This is a file that is focused on showing the alteration of SVs on specific genomic regions (see the section "Output" for more details). It also removes redundant calls between the CNVs identified by 'call_SVs' and those derived from 'call_CNVs'.

annotate_SVs

Annotate the fuctional impact of the variants from 'integrate_SV_CNV_calls'

call_small_variants

Call SNPs and small IN/DELs. It runs any of freebayes, GATK HaplotypeCaller and/or bcftools call for small variant calling and integrates the results into .tab and .vcf files.

annotate_small_vars

Annotate the fuctional impact of the variants from 'call_small_variants'.

get_cov_genes

Calculate the coverage for each gene of the genome.

run_several_modules

A workflow to run various modules from above.

get_stats_optimization

A module to calculate coverage, insert size and read length for a list of samples. This module clusters together samples that have similar such sequencing parameters, defining one representative sample for each cluster. A cost-effective strategy is to run optimize_parameters only on these representative samples.

integrate_several_samples

A module to integrate the runs of several samples and to compare SVs between close samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. Pipeline overview

trim_reads_and_QC

align_reads

call_SVs

optimize_parameters

find_homologous_regions

find_knownSVs_regions

infer_repeats

call_CNVs

integrate_SV_CNV_calls

annotate_SVs

call_small_variants

annotate_small_vars

get_cov_genes

run_several_modules

get_stats_optimization

integrate_several_samples

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally