Skip to content

ShalekLab/kallisto-bustools_workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Kallisto-Bustools Workflow

A publicly available WDL workflow made by Shalek Lab for Kallisto and Bustools wrapped within kb_python.
By jgatter [at] broadinstitute.org, created November 2019.
Jointly maintained by jgatter [at] broadinstitute.org and the Cumulus Team.
FULL DISCLOSURE: many optional parameters remain untested, post issues on this repo with bug reports, etc.
Kallisto and Bustools software developed by Pachter Lab. Documentation.

About

The kallisto-bustools workflow can be found here. This main workflow calls two subworkflows which can be run individually: kallisto-bustools_reference and kallisto-bustools_count.

Writing your sample sheet

This sample sheet is a tab-delimited text file with a header row. Here is an example:

Sample	R1_Path	R2_Path
CGP	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/CGP_R1.fastq.gz	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/CGP_R2.fastq.gz
DMSO	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/DMSO_R1.fastq.gz	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/DMSO_R2.fastq.gz
LGD	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LGD_R1.fastq.gz	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LGD_R2.fastq.gz
LKS_CGP	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LKS_CGP_R1.fastq.gz	gs://fc-secure-06a1d647-0ffd-410c-af2e-8a38acfede12/IRA_FASTQs/LKS_CGP_R2.fastq.gz

You must have the columns Sample, R1_Path, and R2_Path. If your samples do not use paired-end FASTQ's, contact James.

Sample can be any string, but R1_Path and R2_Path must be the gsURI leading to the respective FASTQ on the Google bucket.

After the reference is built, each row of this file will be scattered, or parallelized, as a shard of kallisto-bustools_count.

Configuring the workflow

Due to the strange command line options for kb-python, the configurations of this workflow are not straightforward. Know for this workflow there are two tasks: build_reference and count.

  1. Export the workflow to your Terra workspace.
  2. Upload your FASTQ's, write and upload your sample sheet, then input your sample sheet for sample_sheet.
  3. Set bucket as the gsURI to your Google bucket: gs://your-bucket-id/. Set output_path as the path from the root of your bucket you wish to send your output files to, ex: 20200124_YourSampleSet/kb/.
  4. If you want files that are suited for RNA Velocity, set lamanno=true, otherwise false.
  5. If you do not have a reference, you must download or build one. Set run_build_reference=true, otherwise false and skip to step 4.
  6. Decide whether you want to build your own reference index or download a pre-existing one. I believe the downloadable ones are NOT suited for RNA Velocity so if you set lamanno=true, tough cookies you'll have to build one: obtain a GTF and a genomic FASTA and input each respectively for reference_gtf and genomic_fasta. For downloading, simply set download_index equal to one of the following: human, mouse, or linnarsson.
  7. If you already had your reference built and set run_build_reference=false, you will have to set the following:

preexisting_cDNA_transcripts_to_capture = gs://bucket/path/to/cDNA_transcripts_to_capture.txt
preexisting_index = gs://bucket/path/to/index.idx
preexisting_intron_transcripts_to_capture = gs://bucket/path/to/intron_transcripts_to_capture.txt
preexisting_T2G_mapping = gs://bucket/path/to/transcripts_to_genes.txt

  1. (Continued) Regardless of whether you set run_build_reference as true or false, set technology as the technology you used to process the samples. Use DROPSEQ for Seq-Well and your whitelist will be autogenerated. If your technology comes with a whitelist of barcodes, set it through barcode_whitelist. Here's the full table:
name         whitelist provided    barcode (file #, start, stop)        umi (file #, start, stop)    read file #    
---------    ------------------    ---------------------------------    -------------------------    -----------    
10XV1        yes                   (2, 0, 0)                            (1, 0, 0)                    0              
10XV2        yes                   (0, 0, 16)                           (0, 16, 26)                  1              
10XV3        yes                   (0, 0, 16)                           (0, 16, 28)                  1              
CELSEQ                             (0, 0, 8)                            (0, 8, 12)                   1              
CELSEQ2                            (0, 6, 12)                           (0, 0, 6)                    1              
DROPSEQ                            (0, 0, 12)                           (0, 12, 20)                  1              
INDROPSV1                          (0, 0, 11) (0, 30, 38)               (0, 42, 48)                  1              
INDROPSV2                          (1, 0, 11) (1, 30, 38)               (1, 42, 48)                  0              
INDROPSV3    yes                   (0, 0, 8) (1, 0, 8)                  (1, 8, 14)                   2              
SCRUBSEQ                           (0, 0, 6)                            (0, 6, 16)                   1              
SURECELL                           (0, 0, 6) (0, 21, 27) (0, 42, 48)    (0, 51, 59)                  1 
  1. If you want your matrices outputted in loom or h5ad format, set either loom or h5ad to true for the one you desire. I don't want to imagine what happens if you set both to true...
  2. Configure any other optional parameters as you desire. Increasing the number_cpu_threads might be worth it if your GCP zone has a machine with more than 32 cores.
  3. Up top, set the workflow to "Process single workflow from files" and then hit "Run Analysis". Monitor your job! If it crashes and for the life of you you can't figure out what's gone wrong, post an issue on this Github repository.

Inputs / Outputs

TODO: Tables for all inputs and outputs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published