Fusion detection overlap merges fusion output data from Arriba, CICERO, FusionMap, FusionCatcher, JAFFA, MapSplice, SOAPfuse, STAR-Fusion, TopHat-Fusion, and DRAGEN RNA-Seq.
Download the overlap code from GitHub. Use git
to clone the repo or navigate to the nch-igm-ensemble-fusion-detection GitHub page, download, and unzip the code.
git clone https://github.com/nch-igm/nch-igm-ensemble-fusion-detection.git
The main dependencies are R (R Setup) and a Postgres database (Postgres Database Setup).
Install R (version 3.5 is preffered since the code is tested on this version). Visit the R website for installation instructions or use your preferred install tool.
The following packages are required for fusion detection overlap.
Start an R session.
R
Run these commands in the R session.
install.packages('devtools', repos='http://cran.us.r-project.org/')
install.packages('optparse', repos='http://cran.us.r-project.org/')
install.packages('dplyr', repos='http://cran.us.r-project.org/')
install.packages('readr', repos='http://cran.us.r-project.org/')
install.packages('DBI', repos='http://cran.us.r-project.org/')
install.packages('RPostgres', repos='http://cran.us.r-project.org/')
install.packages('dbplyr', repos='http://cran.us.r-project.org/')
install.packages('magrittr', repos='http://cran.us.r-project.org/')
install.packages('purrr', repos='http://cran.us.r-project.org/')
Overlap interacts with a database of fusions. The database is used to store previously called fusions. The overlap code queries the database for the frequency of reported fusions. Fusion calls with a high frequency are regarded as artifacts or unlikely to cause disease.
On macOS, you can use Homebrew to install Postgres.
brew install postgres
On Linux or Windows, you will need to find alternative instructions.
On macOS, Homebrew automatically initializes the space where the databases will be written on your filesystem at /usr/local/var/postgres
. Users may experience different results on different systems. You may need to initialize the database at /usr/local/var/postgres
or another location or you may not need to initialize the database at all.
initdb /usr/local/var/postgres
Start the database.
pg_ctl -D /usr/local/var/postgres start
Create the fusions database.
createdb fusions
Create the postgres user if it doesn't already exist.
createuser -s postgres
Import the schema and test data for the fusions database. You may need to modify the host.
cd nch-igm-ensemble-fusion-detection
psql -h localhost fusions < sql/fusions_example_db_dump.sql
Modify the read-only and read-write users' passwords in the create_users.sql file.
vim sql/create_users.sql # set passwords
Import the read-only and read-write users.
psql -h localhost fusions < sql/create_users.sql
R/.dbconfig.R
sets the configuration variables that the overlap code uses to connect to the database. R variables dbname
, host
, port
, user
, and passwd
must be set in this file so that the overlap code can connect to the database.
echo 'dbname <- "fusions"
host <- "localhost"
port <- 5432
user <- "fusion_user_rw"
passwd <- "PASSWORD_RW_HERE"' > R/.dbconfig.R
In order for the overlap script to run properly, there is an expected file structure hierarchy, please see
Run 'kickoff_overlap.sh' to run the necessary R scripts (described in detail below)
overlap/kickoff_overlap.sh -h
The result:
[USAGE]: This script kicks off a script that merges fusion detection results from many tools.
-h * [None] Print this help message.
-p * [path1,path2,...] paths where fusion results are where the script will be kicked off.
-s * [samples_file1,samples_file2,...] Files paths that contain the lists of samples to kickoff scripts for. The number of samples files provided must be equal to the number of paths, and they must be listed in the same respective order. If not specifically added, the script will search through sample names that are listed in the samples file within the path (-p) given[Default: samples,samples,...].
Note that -s is not required, and if not added the script will automatically run on all samples listed in your samples file
overlap/kickoff_overlap.sh -p /test_data
Results (will show you which callers it found output for):
Test
WARNING: ignoring environment value of R_HOME
Assembling a list of files to merge...
[1] "starFusion" "fusionMap" "fusionCatcher" "jaffa"
[5] "mapSplice" "dragen"
Run assemble_results.R
to merge results and generate the overlap report. See the help message for assemble_results.R
by running
R/assemble_results.R -h
or
Rscript R/assemble_results.R -h
The result
Usage: assemble_results.R --sample s1
assemble_results.R reads a set of fusion detection result files and produces an aggregated report with the overlapping fusion events predicted in the input files. The program searches recursively from the current directory to find known files. The input files can also be specified manually.
Options:
--sample=SAMPLE
Name of the sample. (required)
--baseDir=BASEDIR
Base directory to search for input files. [default = .]
--outReport=OUTREPORT
Location of the output report. [default = overlap_$sample.tsv]
--collapseoutReport=COLLAPSEOUTREPORT
Location of the output report. [default = collapsed_3callers_$sample.tsv]
--foutReport=FOUTREPORT
Location of the output report. [default = filtered_overlap_2callers_$sample.tsv]
--foutReport3=FOUTREPORT3
Location of the output report. [default = filtered_overlap_3callers_$sample.tsv]
--outSingleton=OUTSINGLETON
Location of the Singleton output report. [default = Singleton_KnownFusions_$sample.tsv]
--outBreakpoints=OUTBREAKPOINTS
Location of the breakpoint report. [default = breakpoints_$sample.tsv]
--dragen=DRAGEN
Path to the dragen results file. [default = search for file named like 'DRAGEN.fusion_candidates.final']
--fusionCatcher=FUSIONCATCHER
Path to the fusionCatcher results file. [default = search for file named like 'final-list_candidate-fusion-genes.txt']
--fusionMap=FUSIONMAP
Path to the fusionMap results file. [default = search for file named like 'FusionDetection.FusionReport.Table.txt']
--jaffa=JAFFA
Path to the jaffa results file. [default = search for file named like 'jaffa_results.csv']
--mapSplice=MAPSPLICE
Path to the mapSplice results file. [default = search for file named like 'fusions_well_annotated.txt']
--soapFuse=SOAPFUSE
Path to the soapFuse results file. [default = search for file named like '.final.Fusion.specific.for.genes']
--starFusion=STARFUSION
Path to the starFusion results file. [default = search for file named like 'star-fusion.fusion_predictions.abridged.tsv']
--tophatFusion=TOPHATFUSION
Path to the tophatFusion results file. [default = search for file named like 'result.txt']. potential_fusion.txt must be in the same directory as result.txt for the import to work.
--arriba=ARRIBA
Specific location of the arriba results file (fusions.tsv).
--cicero=CICERO
Specific location of the cicero results file (annotated.fusion.txt).
-h, --help
Show this help message and exit```
### `--sample`
`--sample` is the only required argument for `assemble_results.R`.
```bash
R/assemble_results.R --samples s1
--sample
is used to insert a comment into the top of the output files
# Sample: s1
# NumToolsAggregated: 5
# - starFusionCalls = 20
# - fusionCatcherCalls = 552
# - jaffaCalls = 948
# - mapSpliceCalls = 23
# - dragenCalls = 9
and name the output files
--outReport = overlap_$sample.tsv
--collapsedoutReport = collapsed_3callers_$sample.tsv
--foutReport = filtered_overlap_2callers_$sample.tsv
--foutReport3 = filtered_overlap_3callers_$sample.tsv
--outSingleton = Singleton_KnownFusions_$sample.tsv
--outBreakpoints = breakpoints_$sample.tsv
You can override the default output file names with the --*out*
parameters.
The --baseDir
argument sets the location of the input data. By default, --baseDir
is set to the current directory .
, but you can override the default location
R/assemble_results.R --sample s1 --baseDir fusion_results
kickoff_overlap.R
recursively searches all directories under the baseDir
for fusion detection results files. The default search looks for
--arriba = fusions.tsv
--cicero = annotated.fusion.txt
--dragen = DRAGEN.fusion_candidates.final
--fusionCatcher = final-list_candidate-fusion-genes.txt
--fusionMap = FusionDetection.FusionReport.Table.txt
--jaffa = jaffa_results.csv
--mapSplice = fusions_well_annotated.txt
--soapFuse = .final.Fusion.specific.for.genes
--starFusion = star-fusion.fusion_predictions.abridged.tsv
--tophatFusion = result.txt (potential_fusion.txt must be in the same directory as result.txt)
For each tool, overlap uses the first occurence of a file that matches the pattern associated with that tool. If there are no files that match for the tool, then overlap assumes that the tool was not run.
Further info: kickoff_overlap.R
uses the R function list.files
with the pattern
parameter set to the pattern. The pattern is used as a regular expression used to match file names. The soapFuse
pattern .final.Fusion.specific.for.genes
starts with a period because the actual output file is named [sample].final.Fusion.specific.for.genes
. Our pattern will match soapFuse
output files with different samples names, like s1.final.Fusion.specific.for.genes
or random_sample_name.final.Fusion.specific.for.genes
.
Given the following directory structure
/data/fusion_output_1/
final-list_candidate-fusion-genes.txt
FusionDetection.FusionReport.Table.txt
star-fusion.fusion_predictions.abridged.tsv
DRAGEN.fusion_candidates.final
run the command
cd /data/fusion_output_1
~/igm-ensemble-fusion-detection/R/assemble_results.R --sample s1
or
cd ~/igm-ensemble-fusion-detection
R/assemble_results.R --sample s1 --baseDir /data/fusion_output_1
and overlap will be able to find fusion output files for fusionCatcher
, fusionMap
, starFusion
, and dragen
. Overlap will be able to find the correct files even if they are under subdirectories since the search is recursive.
/data/fusion_output_1/
fusioncatcher/
final-list_candidate-fusion-genes.txt
fusionmap/
results/
FusionDetection.FusionReport.Table.txt
starfusion/
star-fusion.fusion_predictions.abridged.tsv
dragen/
DRAGEN.fusion_candidates.final
You can override the default fusion output file search with direct fusion output file names.
R/assemble_results.R --samples s1 --starFusion path/to/custom-star-fusion-output.tsv
This command will recursively search under the current directory .
for the dragen
, fusionCatcher
, fusionMap
, etc., default files, but it will not perform the same search for the starFusion
file. It will use the exact location to the starFusion
file as provided by the user path/to/custom-star-fusion-output.tsv
.
The overlap code only knows how to parse the data from the output files in the default list. Make sure your custom locations point to files that hold the same data. There may be multiple output files from each fusion caller. Our overlap code is only parsing certain files for certain data to generate the overlap report.
Run kickoff_upload.sh
to upload fusion results to the database.
overlap/kickoff_upload.sh -h
The result
[USAGE]: Use this script to upload fusion detection results to the fusion detection database.
-h * [None] Print this help message.
-d * [results_dir1,results_dir2,...] List of directories (comma-separated) where the fusion results are located. Directories must be full paths. Ex: /path/to/fusion_results1,/path/to/fusion_results2.
-p * [patient1,patient2,...] List of patient IDs (comma-separated). The number of patient IDs must be equal to the number of results directories and they must be listed in the same respective order. Ex: patient1,patient2. (Also referred to as the 'subject ID'.)
-s * [sample1,sample2,...] List of sample IDs (comma-separated). The number of sample IDs must be equal to the number of results directories and they must be listed in the same respective order. Ex: sample1,sample2.
-r * [reads1,reads2,...] List of read pairs numbers (comma-separated). The number of read pairs numbers must be equal to the number of results directories and they must be listed in the same respective order. Optional. Ex: 40000000,52000000.
[Example 1]: kickoff_upload.sh -d /path/to/fusion_results1 -p p1 -s s1
[Example 2]: kickoff_upload.sh -d /path/to/fusion_results1,/path/to/fusion_results2 -p p1,p2 -s s1,s2
[Example 3]: kickoff_upload.sh -d /path/to/fusion_results1 -p p1 -s s1 -r 42543879
The -d
results directory must contain fusion detection results files. The results directory usually is the top-level directory for a sample where multiple fusion caller results are stored in sub-directories. The upload code recursively searches all directories under the results directory for fusion detection results files. See the example above for an example results directory layout. kickoff_upload.sh
can take a single results directory as input (for results from multiple tools on a single sample) or multiple results directories as input (to make the upload process faster for multiple samples).
The -p
patient ID and -s
sample ID are necessary identifiers to upload the data to the database. The -r
number of read pairs is optional, depending on whether you want to store that information.
The database uses the term 'analysis' to refer to a fusion caller execution on a single sample. A STAR-Fusion
run on sample1
is considered an analysis, say analysis1
. A JAFFA
run on sample1
is considered a different analysis, say analysis2
.
When uploading data to the database, the upload script uses the concept of an analysis to decide whether the data already exists in the database. If the analysis already exists in the database, then the upload script will skip the upload for that analysis. If there is already STAR-Fusion
data for sample1
in the database (analysis1
), then any additional attempt to upload or replace STAR-Fusion
data for sample1
will fail.
- Upside - Since an analysis will not be uploaded twice, you can re-run the upload script on the same results directory and not have to worry about overwriting data. This can be useful if, for example, you run
STAR-Fusion
andJAFFA
onsample1
and upload the results. At a later time, you decide to runFusionMap
on the same sample. When you want to add theFusionMap
results to the database, you can run the upload script on thesample1
results directory and it will only add the newFusionMap
analysis to the database. - Downside - You cannot replace analysis data using the upload script. If, for example, you run a new version of
STAR-Fusion
onsample1
, you might want to replace the oldSTAR-Fusion
results in the database. The upload script does not handle this scenario and you will have to perform a manual replacement.
Use upload_fusion_results.R
for a greater level of control over the database upload (or if you don't have bash
on your system to run kickoff_upload.sh
).
R/upload_fusion_results.R -h
The result
Usage: R/upload_fusion_results.R [options]
This script acts as an importer for fusion detection results to the fusion detection database.
Options:
--subject=SUBJECT
Subject ID, e.g. subject1. (required)
--sample=SAMPLE
Sample ID, e.g. sample1. (required)
--phenotype=PHENOTYPE
Phenotype of the subject. (optional)
--storage=STORAGE
Storage method for the sample, such as 'Frozen' or 'FFPE'. (optional)
--reads=READS
Number of read pairs for the sample. (optional)
--tool=TOOL
Custom tool for which the results are being entered, possible values are 'dragen', 'FusionCatcher', 'FusionMap', 'Jaffa', 'MapSplice', 'SoapFuse', 'StarFusion', and 'TophatFusion'. (optional)
--results=RESULTS
Custom location of the results file for the associated tool, e.g. 'star-fusion.fusion_predictions.abridged.tsv'. (optional)
-h, --help
Show this help message and exit
upload_fusion_results.R
must be run from the results directory. It searches for results files under .
. Alternatively, you can upload results one-by-one by providing the --tool
and --results
parameters.