-
Notifications
You must be signed in to change notification settings - Fork 5
7. Output
Before checking the output make sure that there is a file in the output directory called perSVade_finished_file.txt, which indicates that the pipeline worked and includes a report of the timing. Depending on how you ran perSVade you are interested in the following sections:
Each module writes files into an output directory, specified through --outdir, which contains all the commands ran in all_cmds.txt. Below is the content of the outdir in each of the modules.
-
trimmed_reads<readID>.fastq.gz: Trimmed reads after running TRIMMOMATIC with FASTQC-inferred adapters. -
trimmed_reads<readID>_fastqcReport.html: Html report of the quality of the trimmed reads, according to FASTQC.
-
aligned_reads.bam.sorted: Aligned reads with marked duplicates (unless--skip_marking_duplicatesis specified) with bwa mem and sorted by genomic coordinate. -
aligned_reads.bam.sorted.bai: The index file of the aligned reads. -
coverage_windows_<window_length>bp.tab: A table with the coverage for windows of the genome that are as long as 5% of the median chromosome length (<window_length>). This is the output of mosdepth. This table contatins the following fields:-
#chromis the chromosome name. -
startis the 0-based start coordinates of the region -
endis the 1-based end coordinate of the region -
lengthis the length of the region -
mediancov_1is the median read depth in the region -
percentcovered_1is the perecentage of the region that is covered with 1 read or more -
nocoveragebp_1is the number of bases that have no coverage
-
-
RepeatModeler_families.fasta: The repeat families identified byRepeatModeler. -
RepeatMasker_RepeatModelerFamilies.out: The repeats annotated byRepeatMaskerusingRepeatModeler_families.fastaas query librarary. -
RepeatMasker_DefaultFamilies.out: The repeats annotated byRepeatMaskerusing the default repeat databases as query librarary. -
combined_repeats.tab: The combined output ofRepeatMasker_RepeatModelerFamilies.outandRepeatMasker_DefaultFamilies.outin a .tab format. This file can be used as an input for the modulesSV_calling,coverageCNV_calling,small_variant_callingandSV_parameter_optimization.
-
breakpoints_aroundHomRegions_wsize=<query_window_size>bp_maxEval=<max_evalue>_minQcovS=<min_pct_overlap>.bedpe: A .bedpe file with the pairs of homologous regions in the provided genome.
-
knownSVs_breakpoints.bedpe: A .bedpe file with the regions that have SVs as called by perSVade in several short-read sequencing datasets.
-
optimized_parameters.json: A .json file with the optimized parameters, which can be an input to the modulecall_SVs. -
optimized_parameters_accuracy.tab: A table with the accuracy (precision, recall, Fvalue) of the optimized parameters when tested on the different simulations and SVs. Only SV types that have at least 5 SVs simulated are shown. These are the fields:-
simulation_genome,simulation_ploidyindicate the testing simulation and type of SV. For example, if you ran parameter optimisation for 2 simulations you expectsimulation_genometo have "simulation_1" and "simulation_2". If you ran for only one ploidy (i.e.: haploid) you expectsimulation_ploidyto be always "haploid". -
FN,FP,TP,neventsare the numbers of false negatives, false positives, true positives and number of all real positives, respectively. -
precision,recallandFvalueare the resulting accuracy.
-
-
SVfiles: Files of the simulated SVs. They are formatted as explained in the FAQ How are the SVs encoded into single files. -
plots/cross_accuracy_heatmaps/: Heatmaps that show how the best parameters for one simulation and SV type (training parameters in the rows) work for the other simulations and SV types (testing simulations in the columns). There is one plot for each accuracy measurement (recall, precision and Fvalue). Only testing SV types that have at least 5 SVs simulated are shown. Theoptimized_parameters.json(training) are those that work best across all the testing simulations and SVs. The heatmap shows the accuracy (as an annotation on the plot) for the chose set ofoptimized_parameters.json. These plots are useful to infer the SV calling accuracy in the input sample and assess the overfitting associated to parameter optimization.
-
deletions.tab,inversions.tab,tandemDuplications.tab,insertions.tab,translocations.tab,unclassified_SVs.tab: These are the called SVs. Each SV type is saved into a different files. Check the FAQ How are the SVs encoded into single files for more information on these files. -
gridss_output.raw.vcfandgridss_output.filt.vcfare the raw and filtered outputs ofgridss, respectively. These are a set of breakpoints encoded in a .vcf file. Check the gridss documentation for more details. -
gridss_output.filt.bedpeis a representation in the .bedpe format of the breakpoints fromgridss_output.filt.vcf. The column names are "chrom1", "start1", "end1", "chrom2", "start2", "end2", "breakpoint ID", "QUAL", "strand 1", "strand 2". Each row indicates a breakpoint between two breakends (1 and 2). Each breakend has a "start" and "end" (for example "start1" and "start2") because the exact position is not always resolved (and hence the range indicates a region where the breakend is found). -
clove_output.vcfis the output of clove without any coverage-based filtering of tandem duplications and deletions. Check clove for more details.
-
final_CNVcalling.tab: The regions with a copy number different from 1. These are the relevant fields (some of them depend on the chosen CNV calling algorithms specified with the option--cnv_calling_algs):-
chromosome,startandendindicate the region under CNV. -
median_relative_CN_AneuFinder,median_relative_CN_HMMcopyandmedian_relative_CN_CONYindicate the predicted copy number by each of the three CNV callers. This value is relative to 1.0 (absence of CNV). In a diploid genome, a complete deletion would be 0.0, a monosomy 0.5, a trisomy 1.5 and a tetraploidy 2.0. -
merged_relative_CNis the output of mergingmedian_relative_CN_AneuFinder,median_relative_CN_HMMcopyandmedian_relative_CN_CONY. This merging is done in a conservative way, so that the value that is closer to 1.0 from the three programs (no CNV) is always taken. -
median_coverageis the median read depth of this region. -
median_coverage_correctedis the coverage corrected by mappability, GC content and distance to the telomere. This is the coverage that is used to run CNV calling. Note that the correction is only performed if a LOWESS fit on the data using a given set of predictors yields and r2>0.1
-
-
final_df_coverage.tab: A table with the raw and corrected coverage per windows of the genome, which can be used for debugging the output. -
calculating_corrected_coverage: A folder with some files reporting how the coverage correction went. The most important one iscalculating_corrected_coverage/plots/coverage_interactive.html, which shows the raw and corrected coverage in an interactive way. -
plots: A folder with graphical visualizations of the CNV calling. The fileplots/CNcalling_interactive.htmlshows the corrected coverage with the results of each of the programs. It can be used to filter the outputs.
-
SV_and_CNV_variant_calling.vcf: Merged .vcf file with all SVs and CNVs. Check the FAQ What is in SV_and_CNV_variant_calling.vcf? for more information. Note that this file is NOT sorted by genomic coordinates, as it is sorted by SV identity. This may give problems when integrated with other software. If your downstream software requires a file sorted by genomic coordinates you can runsort -k1,1 -k1n,2 <vcf file>to get the sorted vcf by genomic coordinate.
-
annotated_variants_before_GeneticCode_correction.html: The graphical representation of the functional annotation withVEPbefore correcting the annotations according to the genetic codes. This file is useful to have an overview of the types of variants. However, it is based on the assumption that the genetic code is standard for all protein-coding genes. This means that some variants may be incorrectly annotated. The fileannotated_variants.tabcontains the corrected variants. -
annotated_variants.tab: The final, corrected functional annotation withVEP, where each line corresponds to a particular alteration in a given gene. This file is the 'Default VEP outout' (tabular) format from https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output.
-
bcftools_ploidy<ploidy>_out,HaplotypeCaller_ploidy<ploidy>_outandfreebayes_ploidy<ploidy>_out: Folders that contain the raw and filtered vcf files of each fo the programs. -
merged_vcfs_allVars_ploidy<ploidy>.vcf: A vcf file with the merged output of the algorithms used in the variant calling. You can check the header to understand what are all the tags in theINFOfield. Here, the multiallelic loci are split into different lines to ease the analysis. -
variant_calling_ploidy<ploidy>.tab: A tabular version ofmerged_vcfs_allVars_ploidy<ploidy>.vcf. Each column contains information parsed from the vcf file. This file is easier to manage because it has all the info in tabular format. -
variants_atLeast<n_PASS_programs>PASS_ploidy<ploidy>.vcf: The variants that PASS the filters in at least n_PASS_programs algorithms and also have a fraction of reads covering them above--min_AF. The INFO and FORMAT fields are simplified to take less space. It may be useful to take, for example, variants that PASS de filters by at least 2 algorithms.
-
annotated_variants.tab: The final, corrected functional annotation withVEP, where each line corresponds to a particular alteration in a given gene. This file is the 'Default VEP outout' (tabular) format from https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#output. In addition, this file also contains information about whether the annotation refers to a SNP (in the 'is_snp' field) and whether it is a protein-altering annotation (through the 'is_protein_altering' field).
-
genes_and_regions_coverage.tab: A table that contains the coverage of all the genes in the provided gff. These are the fields:-
chromosome,start,endandlengthare the coordinates of the gene -
median_reads_per_geneis the median read depth for this gene -
nocoveragebp_1is the number of bases that have no coverage -
percentcovered_1is the perecentage of the region that is covered with 1 read or more -
IDis the gene ID in the gff (parsed from the attributes). If there are some duplicated IDs with parts it corresponds to the union ofIDandpart. -
fraction_covered_by_MoreThan1readis equivalent topercentcovered_1, but from 0 to 1. -
relative_coverageis themedian_reads_per_genedivided by the median of allmedian_reads_per_gene. - All the fields with a
_+-10kb_regionsufix are equivalent to the ones that do not have it, but for a region that starts at -10kb of the gene start and ends at +10kb of the gene end. This can be useful for CNV analysis.
-
You can also use the script scripts/perSVade.py to execute multiple modules of perSVade with a single command. However, this script is less user friendly and error prone, and it will be no longer maintained after version 1.02.3. This file contains the description of the output of the one-liner version.