-
Notifications
You must be signed in to change notification settings - Fork 5
1. Pipeline overview
This is an overview of the perSVade pipeline, where each module can be run independently:
Trimming (with trimmomatic) and quality control (with fastqc) of the reads. Make sure that you check the output of fastqc before using the trimmed reads for other analyses
Align reads, mark duplicates and calculate coverage per windows.
Call structural variants (SVs). It uses gridss to infer a list of breakpoints (two regions of the genome (two breakends) that are joined in the sample of interest and not in the reference genome) from discordant read pairs, split reads and de novo assembly signatures. The breakpoints are summarized into SVs with clove. You can find the optimal filtering parameters with the module 'optimize_parameters'.
Find optimal parameters for SV calling through simulations. In order to find these, perSVade generates two simulated genomes with 50 SVs of each type and tries several combinations (>13,000,000,000) of filters for the gridss and clove outputs. It selects the filters that have the highest F-value (harmonic mean between precision and recall) for each simulated genome and SV type. In order to reduce overfitting, perSVade selects a final set of "best parameters" that work well for all simulations and SV types. This set of best parameters can be an input to the 'call_SVs' module for SV calling on the real data. See "Output" and this FAQ for more details.
By default, the simulations in 'optimize_parameters' are placed randomly across the genome. However, SVs often appear around repetitive elements or regions of the genome with high similarity (i.e.: transposable elements insertions). This means that random simulations may not be realistic, potentially leading to overestimated calling accuracy and a parameter selection that does not work well for real SVs. perSVade can also generate more realistic simulations around regions with known SVs (i.e. regions with SVs called with perSVade) or homologous regions (inferred from BLAST). This module finds regions with pairwise homology in a genome, which can be input to 'optimize_parameters' to perform realistic simulations around homologous regions.
Find regions with perSVade-inferred SVs. These can be input to 'optimize_parameters' to perform realistic simulations around regions with previously-known SVs.
Find repeats in a genome, which can be used for the modules 'call_SVs', 'find_knownSVs_regions', 'integrate_SV_CNV_calls', 'optimize_parameters' and 'call_small_variants'.
Copy Number Variants (CNVs) are one type of SVs where there is an alteration in the genomic content (deletions or duplications). The 'call_SVs' module identifies some CNVs (insertions, tandem duplications, deletions and complex inverted SVs) but it can miss others (i.e.: whole-chromosome duplications or regions with unknown types of rearrangements yielding CNVs).
As an alternative, this 'call_CNVs' module calls CNVs from read-depth alterations. For example, regions with 0x or 2x read-depth as compared to the mean of the genome can be called deletions or duplications, respectively. A straight forward implementation of this concept to find CNVs is challenging because many genomic features drive variability in read depth independently of CNV. In order to solve this, this module calculates the relative coverage for bins of the genome and corrects the effect of the GC content, mappability and distance to the telomere (using non-parametric regression as in this paper). This corrected coverage is used by CONY, AneuFinder and/or HMMcopy to call CNVs across the genome. This module generates consensus CNV calls from the three programs taking always the most conservative copy number for each bin of the genome. For example, if the used programs disagree on the copy number of a region the closest to 1 will be taken as the best estimate.
Integrate the variant calls of 'call_SVs' and 'call_CNVs' into a single .vcf file. This is a file that is focused on showing the alteration of SVs on specific genomic regions (see the section "Output" for more details). It also removes redundant calls between the CNVs identified by 'call_SVs' and those derived from 'call_CNVs'.
Annotate the fuctional impact of the variants from 'integrate_SV_CNV_calls'
Call SNPs and small IN/DELs. It runs any of freebayes, GATK HaplotypeCaller and/or bcftools call for small variant calling and integrates the results into .tab and .vcf files.
Annotate the fuctional impact of the variants from 'call_small_variants'.
Calculate the coverage for each gene of the genome.
A workflow to run various modules from above.
A module to calculate coverage, insert size and read length for a list of samples. This module clusters together samples that have similar such sequencing parameters, defining one representative sample for each cluster. A cost-effective strategy is to run optimize_parameters only on these representative samples.
A module to integrate the runs of several samples and to compare SVs between close samples.