A publicly available WDL workflow made by Shalek Lab for Kallisto and Bustools wrapped within kb_python.
By jgatter [at] broadinstitute.org, created November 2019.
Jointly maintained by jgatter [at] broadinstitute.org and the Cumulus Team.
FULL DISCLOSURE: many optional parameters remain untested, post issues on this repo with bug reports, etc.
Kallisto and Bustools software developed by Pachter Lab. Documentation.
The kallisto-bustools workflow can be found here. This main workflow calls two subworkflows which can be run individually: kallisto-bustools_reference and kallisto-bustools_count.
This sample sheet is a tab-delimited text file with a header row. Here is an example:
Sample R1_Path R2_Path
CGP gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/CGP_R1.fastq.gz gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/CGP_R2.fastq.gz
DMSO gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/DMSO_R1.fastq.gz gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/DMSO_R2.fastq.gz
LGD gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LGD_R1.fastq.gz gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LGD_R2.fastq.gz
LKS_CGP gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LKS_CGP_R1.fastq.gz gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LKS_CGP_R2.fastq.gz
You must have the columns Sample
, R1_Path
, and R2_Path
. If your samples do not use paired-end FASTQ's, contact James.
Sample
can be any string, but R1_Path
and R2_Path
must be the gsURI leading to the respective FASTQ on the Google bucket.
After the reference is built, each row of this file will be scattered, or parallelized, as a shard of kallisto-bustools_count.
Due to the strange command line options for kb-python, the configurations of this workflow are not straightforward. Know for this workflow there are two tasks: build_reference
and count
.
- Export the workflow to your Terra workspace.
- Upload your FASTQ's, write and upload your sample sheet, then input your sample sheet for
sample_sheet
. - Set
bucket
as the gsURI to your Google bucket:gs://your-bucket-id/
. Set output_path as the path from the root of your bucket you wish to send your output files to, ex:20200124_YourSampleSet/kb/
. - If you want files that are suited for RNA Velocity, set
lamanno
=true
, otherwisefalse
. - If you do not have a reference, you must download or build one. Set
run_build_reference
=true
, otherwisefalse
and skip to step 4. - Decide whether you want to build your own reference index or download a pre-existing one. I believe the downloadable ones are NOT suited for RNA Velocity so if you set
lamanno
=true
, tough cookies you'll have to build one: obtain a GTF and a genomic FASTA and input each respectively forreference_gtf
andgenomic_fasta
. For downloading, simply setdownload_index
equal to one of the following:human
,mouse
, orlinnarsson
. - If you already had your reference built and set
run_build_reference
=false
, you will have to set the following:
preexisting_cDNA_transcripts_to_capture
= gs://bucket/path/to/cDNA_transcripts_to_capture.txt
preexisting_index
= gs://bucket/path/to/index.idx
preexisting_intron_transcripts_to_capture
= gs://bucket/path/to/intron_transcripts_to_capture.txt
preexisting_T2G_mapping
= gs://bucket/path/to/transcripts_to_genes.txt
- (Continued) Regardless of whether you set
run_build_reference
astrue
orfalse
, settechnology
as the technology you used to process the samples. UseDROPSEQ
for Seq-Well and your whitelist will be autogenerated. If your technology comes with a whitelist of barcodes, set it throughbarcode_whitelist
. Here's the full table:
name whitelist provided barcode (file #, start, stop) umi (file #, start, stop) read file #
--------- ------------------ --------------------------------- ------------------------- -----------
10XV1 yes (2, 0, 0) (1, 0, 0) 0
10XV2 yes (0, 0, 16) (0, 16, 26) 1
10XV3 yes (0, 0, 16) (0, 16, 28) 1
CELSEQ (0, 0, 8) (0, 8, 12) 1
CELSEQ2 (0, 6, 12) (0, 0, 6) 1
DROPSEQ (0, 0, 12) (0, 12, 20) 1
INDROPSV1 (0, 0, 11) (0, 30, 38) (0, 42, 48) 1
INDROPSV2 (1, 0, 11) (1, 30, 38) (1, 42, 48) 0
INDROPSV3 yes (0, 0, 8) (1, 0, 8) (1, 8, 14) 2
SCRUBSEQ (0, 0, 6) (0, 6, 16) 1
SURECELL (0, 0, 6) (0, 21, 27) (0, 42, 48) (0, 51, 59) 1
- If you want your matrices outputted in loom or h5ad format, set either
loom
orh5ad
totrue
for the one you desire. I don't want to imagine what happens if you set both totrue
... - Configure any other optional parameters as you desire. Increasing the
number_cpu_threads
might be worth it if your GCP zone has a machine with more than 32 cores. - Up top, set the workflow to "Process single workflow from files" and then hit "Run Analysis". Monitor your job! If it crashes and for the life of you you can't figure out what's gone wrong, post an issue on this Github repository.
TODO: Tables for all inputs and outputs.