Genetic multiplexing of barcoded single cell RNA-seq
demuxlet is a software tool to deconvolute sample identity and identify multiplets when multiple samples are pooled by barcoded single cell sequencing. demuxlet takes (1) a SAM/BAM/CRAM file produced by the standard 10x sequencing platform, or any other barcoded single cell RNA-seq (with proper --tag-UMI and --tag-group) options (2) a VCF/BCF file containing the genotype (GT), posterior probability (GP), or genotype likelihood (GL) to assign each barcode to a specific sample (or a pair of samples) in the VCF file.
Before installing demuxlet, you need to install htslib. If you do not install this to standard system paths, you'll need to pass the directory for header files and compiled library file to configure. For example, if you downloaded and compiled htslib in /opt/htslib
, run ./configure CPPFLAGS=-I/opt/htslib LDFLAGS=-L/opt/htslib
.
$ git clone https://github.com/statgen/demuxlet.git $ cd demuxlet $ autoreconf -vfi $ ./configure (with additional options such as --prefix) $ make $ make install (may require root privilege)
demuxlet uses a self-documentation utility. You can run each utility with -man or -help option to see the command line usages.
$ ./demuxlet (for short usage) $ ./demuxlet -help (for detailed usage)
The detailed usage is also pasted below.
Options for input SAM/BAM/CRAM --sam [STR: ] : Input SAM/BAM/CRAM file. Must be sorted by coordinates and indexed --tag-group [STR: CB] : Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB --tag-UMI [STR: UB] : Tag representing UMIs. For 10x genomiucs, use UB Options for input VCF/BCF --vcf [STR: ] : Input VCF/BCF file, containing the individual genotypes (GT), posterior probability (GP), or genotype likelihood (PL) --field [STR: GP] : FORMAT field to extract the genotype, likelihood, or posterior from --geno-error [FLT: 0.01] : Genotype error rate (must be used with --field GT) --min-mac [INT: 1] : Minimum minor allele frequency --min-callrate [FLT: 0.50] : Minimum call rate --sm [V_STR: ] : List of sample IDs to compare to (default: use all) --sm-list [STR: ] : File containing the list of sample IDs to compare Output Options --out [STR: ] : Output file prefix --alpha [V_FLT: ] : Grid of alpha to search for (default is 0.1, 0.2, 0.3, 0.4, 0.5) --write-pair [FLG: OFF] : Writing the (HUGE) pair file --doublet-prior [FLT: 0.50] : Prior of doublet --sam-verbose [INT: 1000000] : Verbose message frequency for SAM/BAM/CRAM --vcf-verbose [INT: 10000] : Verbose message frequency for VCF/BCF Read filtering Options --cap-BQ [INT: 40] : Maximum base quality (higher BQ will be capped) --min-BQ [INT: 13] : Minimum base quality to consider (lower BQ will be skipped) --min-MQ [INT: 20] : Minimum mapping quality to consider (lower MQ will be ignored) --min-TD [INT: 0] : Minimum distance to the tail (lower will be ignored) --excl-flag [INT: 3844] : SAM/BAM FLAGs to be excluded Cell/droplet filtering options --group-list [STR: ] : List of tag readgroup/cell barcode to consider in this run. All other barcodes will be ignored. This is useful for parallelized run --min-total [INT: 0] : Minimum number of total reads for a droplet/cell to be considered --min-uniq [INT: 0] : Minimum number of unique reads (determined by UMI/SNP pair) for a droplet/cell to be considered --min-snp [INT: 0] : Minimum number of SNPs with coverage for a droplet/cell to be considered
demuxlet generates multiple output file, such as [prefix].best
, [prefix].sing
, [prefix].sing2
, and optionally [prefix].pair
(with --write-pair
argument). Each file contains the following information
- The
[prefix].best
file contains the best guess of the sample identity, with detailed statistics to reach to the best guess - The
[prefix].sing
file contains the statistics for matching each cell with each possible sample. - The
[prefix].sing2
file contains the statistics similar information to the previous one, but generated for sanity checking of the[prefix].pair
results. - The
[prefix].pair
file contains the statistics for matching each cell with each possible configuration of doublet.
The [prefix].best
file contains the following 22 columns.
- BARCODE - Cell barcode for the cell that is being assigned in this row
- RD.TOTL - The total number of reads overlapping with variant sites for each droplet.
- RD.PASS - The total number of reads that passed the quality threshold, such as mapping quality, base quality.
- RD.UNIQ - The total number of UMIs that passed the quality threshold. If a UMI is observed in a single variant multiple times, it won't be counted more. If a UMI is observed across multiple variants, it will be counted as different.
- N.SNP - The total number of variants overlapping with any read in the droplet.
- BEST - The best assignment for sample ID.
- For singlets, SNG-
- For doublets, DBL---
- For ambiguous droplets, , AMB---<doublet ID1/ID2>)
- SNG.1ST - The best singlet assignment for sample ID
- SNG.LLK1 - The log(likelihood that the ID from SNG.1ST is the correct assignment)
- SNG.2ND - The next best singlet assignment for sample ID
- SNG.LLK2 - The log(likelihood that the ID from SNG.2ND is the correct assignment)
- SNG.LLK0 - The log-likelihood from allele frequencies only
- DBL.1ST - The sample ID that is most likely included if the assignment is a doublet
- DBL.2ND - The sample ID that is next most likely included ifthe assignment is a doublet
- ALPHA - % Mixture Proportion
- LLK12 - The log(likelihood that the ID is a doublet)
- LLK1 - The log(likelihood that the ID from DBL.1ST is the correct singlet assignment)
- LLK2 - The log(likelihood that the ID from DBL.2ND is the correct singlet assignment)
- LLK10 - The log(likelihood that the ID from DBL.1ST is one of the doublet, and the other doublet identity is calculated from allele frequencies only)
- LLK20 - The log(likelihood that the ID from DBL.2ND is one of the doublet, and the other doublet identity is calculated from allele frequencies only)
- LLK00 - The log(likelihood that the droplet is doublet, but both identities are calculated from allele frequencies only)
- PRB.DBL - Posterior probability of the doublet assignment
- PRB.SNG1 - Posterior probability of the singlet assignment