This repository contains code, scripts, reasonably sized data associated with scMPRA: Biorxiv link.
The code is sufficient to generate processed single-cell core promoter library expression data underlying the reproducibility plot, DE analysis.
All dependencies are described in conda environment mascot.yml.
A stand alone Go executable program (build for linux, amd64 architecture) is build for fast extraction of the barcodes. Briefly, the first 28 bps of Read1 are directly parsed as Cell BC and UMI. Sequences around the cBC and rBC were aligned to Read2 against sequencing and PCR errors, and cBC and rBC with correct length are kept. The program returns a tsv with columns of cellBC, UMI, cBC, rBC, count. First, make the program executable:
chmod -x ./scMPRA_parsing
./scMPRA_parsing Read1.fastq.gz Read2.fastq.gz > somefile.tsv
The code should finish within an hour using a single GPU with 8 Gb of memory.
If building for different architecture is desired,the source code for the program can be found on (https://github.com/szhao045/scMPRA_parsing).
To run Cell Ranger 6.0.1, first download cell ranger follow the instruction from 10X genomics: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/installation
To process the transcriptome data associated with scMPRA data, run
bash ./run_cell_ranger.sh
To further process the scMPRA barcode data, only the cells that have sufficient transcriptome information to be kept.
The cell barcode filter script error corrects the cell barcodes to the list of 10X filtered cell barcodes. The sript takes in the cell barcode file from processed expression matrix, which can be found in
cellranger_job_name/outs/filtered_features_bc_matrix/barcodes.tsv.gz
as well as the parsed barcodes file.
To run the the cell barcode filtering script, modify the following script:
sh cell_bcs_filtering.sh
To extract single-cell CRS expression, modify and fun:
sh sc_crs_exp.sh
The script takes in several inputs:
--quint : the directory to the cell barcode filtered barcode file (5 columns hence quint)
--promlib: the directory with the promoter id and cBC pair.
--out: output file name.
--exp: prefix to the output file name.
The script produces two outputs:
- The normalized expression for each cBC per cell.
- The expression mean, variance, auc for each CRS.