Single- and multi-guide exact matching (with gaps) for single- and paired-end sequencing data, supporting both combinatorial and non-combinatorial setups.
This tool is primarily concerned with supporting the widest possible range of experimental designs.
Goals:
- support for BAM and CRAM inputs
- memory requirements independent of input size (read streaming)
Non-goals:
- fuzzy matching: it may be attempted downstream to rescue the mismatching reads, which are optionally collected in a separate BAM file for this purpose
- count normalisation
- support for redundant or overlapping libraries
- support for alternate formats for the input files (e.g., libraries)
- Locus-based Universal CRISPR Aligner (LUCA)
The tool requires at least a BAM or CRAM file containing the reads to process, and a description of the experimental design as configuration. The default options in the configuration may be overridden via the corresponding command line options.
Input files:
- reads (BAM/CRAM)
- experiment metadata (YAML)
- libraries (TSV)
Output files:
- experiment metadata (JSON)
- counts (TSV)
- experiment stats (JSON)
- library-independent stats (JSON)
- library-dependent stats (JSON)
- match QC table (optional; TSV)
- all/mismatching read counts (optional; TSV)
- mismatch QC table (optional; TSV)
- swap QC table (optional; TSV)
- fully/partially unmatched reads (optional; BAM)
- combination counts (optional; TSV)
To run it, e.g.:
luca count -o out_dir -l lib_dir exp.yaml reads.bam
The output directory is expected to exist and to be writeable.
The library directory is expected to exist and to contain the library files, each named as the identifiers listed in the libraries
section of the experiment configuration.
Usage: luca count [OPTIONS] EXPERIMENT QUERIES
Count reads or regions, optionally matching libraries.
EXPERIMENT: Experiment configuration (YAML or JSON)
QUERIES: Query sequence file (BAM or CRAM)
Options:
-o, --output DIRECTORY Output directory [required]
--rc-reverse Whether to apply the reverse complement to
reverse reads before matching (currently for
single-end only)
-m, --input-manifest FILE Manifest mapping identifiers used in the
experiment configuration to input file paths
Input sample metadata: Options adding information to the input
-s, --sample TEXT Sample name, interrogate header for others
when not defined
-r, --reference FILE Required for CRAM
Library-dependent: Options specific to library-dependent
counting
-l, --library-dir DIRECTORY Expanded library definition TSV file with
optional headers (common format for
single/dual/other)
--count-mm-reads Whether to count the mismatching reads (as
in library-independent counting)
--sort-mm-read-counts Whether to sort mismatching read counts (in
descending order), if calculated
--compress-mm-read-counts Whether to compress the counts of the
mismatching reads, if calculated
--full-match-info Whether to generate the full match QC table
(vs. only for mismatching reads)
--out-mm-match-info Whether to generate the match QC table for
mismatching reads
--test-swaps Whether to test swap libraries, if provided
--out-mm-reads Whether to write the mismatching reads to an
output BAM file
--profile
Performance: Options to tune the performance
-c, --cpus INTEGER RANGE CPUs to use (0 to detect) [default: 1;
x>=0]
Debug: Options specific to troubleshooting, testing
and debugging
--limit INTEGER Maximum number of reads (or read pairs) to
process
--loglevel [WARNING|INFO|DEBUG]
Set logging verbosity [default: INFO]
--help Show this message and exit.
The following command line options can override the defaults set in the experiment configuration (see the options reference documentation).
CLI option | Experiment default option |
---|---|
full-match-info |
full_match_info |
count-mm-reads |
count_mm_reads |
sort-mm-read-counts |
sort_mm_read_counts |
compress-mm-read-counts |
compress_mm_read_counts |
out-mm-match-info |
out_mm_match_info |
test-swaps |
test_swaps |
out-mm-reads |
out_mm_reads |
rc-reverse |
rc_reverse |
To merge the raw counts and generate aggregate statistics, list the manifest files generated by the quantification in their respective output directories. The other output files are expected to be in the same directories as the manifests.
To merge the results for two different lanes of the same sample, e.g.:
luca merge -o out_dir lane_1/manifest.json lane_2/manifest.json
Usage: luca merge [OPTIONS] MANIFEST...
Merge counts across different runs and regenerate the statistics.
MANIFEST: JSON manifest generated by the `count` subcommand
Options:
-o, --output DIRECTORY Output directory [required]
--validate-targets Validate the target sequences when merging
--skip-mm-read-counts Do not merge mismatching read library-
independent counts
Debug: Options specific to troubleshooting, testing
and debugging
--loglevel [WARNING|INFO|DEBUG]
Set logging verbosity [default: INFO]
--help Show this message and exit.
The following changes may be observed when comparing the original with the aggregate counts files:
- dynamic targets may be assigned different identifiers, and the order in which the manifests are provided affects the result (such identifiers are only required to decode the optional match QC table);
- combinations may appear in a different order, also depending on the order of the manifests.
The instructions that follow apply to Linux and macOS.
Please take care to read errors during the dependency installation step carefully. HTSlib (pysam) has system dependencies and will highlight the packages that need to be installed.
Requirements:
- Python 3.11 or above
To install in a virtual environment:
# Initialise the virtual environment
python -m venv .venv
# Activate the virtual environment
source .venv/bin/activate
# Install the luca package
pip install .
To build the Docker container:
docker build -t luca .
For all character-delimited input files:
- the separator is tab (TSV)
- the header line, containing only field names, is mandatory
- comments are not allowed
- the order of the fields is free
- additional fields are allowed and ignored
For a library, e.g.:
id sequence extra
ID1 ACGTTATC <IGNORED>
For all character-delimited output files (e.g., library counts):
- the separator is tab (TSV)
- no header is present (to facilitate concatenation when merging results)
All statistics are reported in JSON format.
Format: YAML, JSON
Description of the experiment. If no libraries are listed, library-independent counting will be performed.
Property | Type | Description |
---|---|---|
sequencing_type |
single_end |paired_end |
Sequencing type. |
libraries |
library array | Libraries. |
read_groups |
read group map | Read group properties. |
read_templates |
read template array | Read templates. |
read_group_templates |
read group to template map | Association between read groups and templates. |
combinations |
combination array | (Optional) Combinations to evaluate. |
default_options |
options object | (Optional) Default options. |
The sequencing type determines which read groups may be used:
single_end
:default
paired_end
:read_1
,read_2
The sequences of read_2
reads are expected to be reverse complements, and the regions of the corresponding read templates will be tested in reverse order against the reverse complement sequences of the targets.
Each read group has a default orientation that may be overridden via read_groups
:
Read group | Default orientation |
---|---|
default |
forward |
read_1 |
forward |
read_2 |
reverse |
Paired-end data is expected to be collated (e.g., via samtool collate
).
Set of DNA sequences with unique identifiers.
If the list of targets is not provided, the target sequences are expected to be found in a TSV file (on a path based on the library identifier).
Property | Type | Default | Description |
---|---|---|---|
id |
string | Library ID. | |
reverse_on |
reverse_group |always |never |
reverse_group |
Condition upon which to apply the reverse complement to the targets before matching. |
values |
DNA string array | null |
(Optional) List of targets. |
When setting the reverse condition to reverse_group
, the reverse complement is only applied to the targets when the library is being queried on a read group marked as reverse (by default, read_2
).
Property | Type | Default | Description |
---|---|---|---|
is_reverse |
Boolean | Whether the reads in the group are expected to be in the reverse orientation. |
Sequence of regions with a DNA sequence that are expected to match a target sequence.
Property | Type | Default | Description |
---|---|---|---|
id |
string | Read template ID. | |
anchor |
auto |left |right |
auto |
Direction in which the regions are expected to be found. |
regions |
region array | Regions. |
When setting anchor
to auto
, the regions are matched from the left in reads belonging to a forward group, and from the right in reads belonging to a reverse group. Currently, the read sequence is considered as stored in the BAM/CRAM file unless rc_reverse
is set (only valid in single-end mode); mapping may affect the orientation of the query sequence.
A region is defined as a portion of a DNA sequence that is expected to start with a known sequence (usually out of a set of candidate targets, the library). Several bases may be ignored at the start for the purposes of the match. The length of a region is not defined, and so is therefore the start position of any region after the first. E.g., the second region in a read template will start after the last position of the first region that matched the target. If no target matches any given regions, none of the regions downstream of it will be tested, as their start positions would not be definable.
Alternate offsets are always tested in ascending order, from zero to the maximum.
If none of the expected targets (from the collections listed in libraries
) matches a region, alternate targets may be tested (swap_libraries
); if any of the latter is a match, the read is still considered a mismatch, but is also counted as a potential swap event (known sequence found in a read or region it was not expected to appear in).
Property | Type | Default | Description |
---|---|---|---|
id |
Region ID. | ||
libraries |
Library ID array | Identifiers of the libraries to be tested. | |
swap_libraries |
Library ID array | empty array | Identifiers of the libraries to be tested as a fallback to detect swaps. |
skip |
integer | 0 | Number of bases to skip before matching. |
max_offset |
integer | 0 | Maximum number of bases that may be skipped before matching (after the amount expressed in the skip property). |
length |
integer | null |
Exact length, mutually exclusive with maximum offset. |
If skip is 2 and the maximum offset is 3, e.g.:
AACGATCAGTAC
2 **
2 + 0 ->
2 + 1 ->
2 + 2 ->
2 + 3 ->
The target sequences from the expected libraries will be compared with the start of the following subsequences within the read, until one matches:
CGATCAGTAC
GATCAGTAC
ATCAGTAC
TCAGTAC
If none of the above is a match, the read is flagged as mismatching.
When the length
property is set, but no libraries are provided, any sequence may be captured at the region start offset. Capturing e.g. six bases after skipping three:
AACGATCAGTAC
3 ***
3 + 0 ----->
If neither libraries nor a length are provided, the region will capture all remaining bases; only one such region is allowed in any given template, and it must be the last. This may be used to model library-independent quantification of paired-end data (see example configuration).
Mapping of read group identifiers to arrays of read template identifiers.
The valid read group identifiers depend on the sequencing type:
- only
default
for single-end - both
read_1
andread_2
for paired-end
Regions whose combined matches should be counted. The valid combinations for a subset of such regions (marked with the filter
flag) may be restricted via inclusion lists, whose identifiers are to be listed in the filters
property.
In paired-end dual-guide experiments, these regions would typically be the guides from read 1 and 2.
Property | Type | Default | Description |
---|---|---|---|
id |
string | Combination ID. | |
regions |
combination region array | Regions to evaluate. | |
filters |
string array | null |
(Optional) Identifiers of the combination inclusion list files. |
Property | Type | Default | Description |
---|---|---|---|
id |
string | Region ID. | |
read_group |
default |read_1 |read_2 |
default |
Read group ID. |
filter |
Boolean | false |
Whether it is expected to be included in the filter (if any). |
Combinations may be filtered by passing inclusion lists via files (filters
property). The expected field names should match the pattern [<READ GROUP ID>.]<REGION ID>
(unless the read group is default
, in which case it should match just the region ID), e.g.:
tag
(the read group is assumed to bedefault
)read_1.guide
read_2.guide
Property | Type | Default | Description |
---|---|---|---|
test_swaps |
Boolean | false |
Whether to test swap libraries, if provided. |
full_match_info |
Boolean | false |
Whether to generate the full match QC table (vs. only for mismatching reads). |
out_mm_match_info |
Boolean | false |
Whether to generate the match QC table for mismatching reads. |
out_mm_reads |
Boolean | false |
Whether to write the mismatching reads to an output BAM file. |
count_mm_reads |
Boolean | false |
Whether to count the mismatching reads (as in library-independent counting). |
sort_mm_read_counts |
Boolean | false |
Whether to sort mismatching read counts (in descending order), if calculated. |
compress_mm_read_counts |
Boolean | false |
Whether to compress the counts of the mismatching reads, if calculated. |
rc_reverse |
Boolean | false |
Whether to apply the reverse complement to reverse reads before matching (currently for single-end only). |
Format: JSON
File name: exp.stats.json
Property | Type | Description |
---|---|---|
read_match |
integer | Total matching reads. |
read_swap |
integer | Total mismatching reads with at least one swap event. |
read_mismatch |
integer | Total reads with at least one mismatching region. |
pair_one_end_match |
integer | (Paired-end only) Total read pairs in which only one read is matching. |
Format: JSON
File name: stats.json
Field | Format | Description |
---|---|---|
sample_name |
string | Name of the sample. |
input_reads |
integer | Total input reads. |
total_reads |
integer | Total reads passed on to counting. |
discarded_reads |
integer | Total reads discarded before counting. |
vendor_failed_reads |
integer | Total reads with the QCFAIL flag. |
length_excluded_reads |
integer | Total reads discarded because shorter than a user-defined threshold. |
ambiguous_nt_reads |
integer | Total reads with ambiguous nucleotides. |
masked_reads |
integer | Total soft-masked reads. |
zero_length_reads |
integer | Total zero-length reads. |
Format: TSV
File name: matches.<INDEX>.tsv
One file per read group, one row per read, in input order.
Index | Type | Description |
---|---|---|
1 | integer | Read index. |
2 | integer | Matching library index. |
3 | integer | Matching target index. |
4 | integer | Match status. |
Fields two to four appear as many times as the regions in the template.
The match status may represent one of the following:
- expected match (
1
) - unexpected match (prospective swap,
2
) - mismatch (
-1
)
If any region is a mismatch, those that would follow are not evaluated and are considered mismatches, because their starting positions are relative to the immediately preceding regions, which do not have an intrinsic length (the matching target, if any, would determine it).
Rows from the match QC table representing mismatch and/or swap events are duplicated into separate files for ease of retrieval by downstream processes (e.g., alignment-based rescue of mismatching reads).
The swap category is a subset of the mismatch category, and therefore its reads also appear in the mismatch file.
Category | File name |
---|---|
Mismatch | mm.matches.<INDEX>.tsv |
Swap | swap.matches.<INDEX>.tsv |
Format: TSV
File name (library-independent): read.counts.tsv
File name (library-dependent): mm.read.counts.tsv
Count of the unique read sequences; all of them in library-independent counting, or just the mismatching ones in library-dependent counting. The latter may optionally be sorted by count to facilitate the identification of the most common mismatching reads.
If the rc_reverse
option is set, the reverse reads won't be listed, and their counts will be added to those of their reverse complements.
Index | Type | Description |
---|---|---|
1 | string | Unique DNA sequence of the read. |
2 | integer | Total number of occurrences. |
Format: TSV
File name: combination.<INDEX>.counts.tsv
One file per entry in the combinations section of the experiment configuration.
Index | Type | Description |
---|---|---|
1 to n | string | Unique DNA sequence. |
n + 1 | integer | Total number of matches. |
Format: JSON
File name: manifest.json
(default)
Optional manifest to map identifiers to input files. File paths may be relative or absolute. Relative paths will use the library directory as their root, if provided. If a manifest is not explicitly provided, it will be looked for in the library directory (based on its default name).
Index | Type | Description |
---|---|---|
libraries |
map | Map of library identifiers to file paths. |
combination_filters |
map | Map of combination filter identifiers to file paths. |
E.g.:
{
"libraries": {
"lib1": "hgi_1.tsv",
"lib2": "hgi_2.tsv"
},
"combination_filters": {
"ft1": "/absolute/path/to/filter.tsv"
}
}
Format: JSON
File name: manifest.json
Manifest required to inform the merge process.
Index | Type | Description |
---|---|---|
sample_name |
string | Sample name. |
library_count |
integer | Number of libraries. |
total_library_templates |
integer | Total number of targets in the libraries. |
library_file_paths |
Per-library relative output file paths. | |
library_independent_count_file_paths |
Library-independent quantification relative output file paths. | |
total_dynamic_targets |
integer | Total number of unique regions captured as they were. |
dynamic_target_file_paths |
Unique region relative output file paths. | |
combination_file_paths |
Per-combination relative output file paths. |
For a library-depedent quantification, e.g.:
{
"sample_name": "SAMPLE1",
"library_count": 4,
"total_library_templates": 10,
"library_file_paths": {
"0": {
"counts": "lib.0.counts.tsv",
"stats": "lib.0.stats.json"
},
"1": {
"counts": "lib.1.counts.tsv",
"stats": "lib.1.stats.json"
},
"2": {
"counts": "lib.2.counts.tsv",
"stats": "lib.2.stats.json"
},
"3": {
"counts": "lib.3.counts.tsv",
"stats": "lib.3.stats.json"
}
},
"library_independent_count_file_paths": null,
"total_dynamic_targets": 0,
"dynamic_target_file_paths": null,
"combination_file_paths": [
{
"counts": "combination.0.counts.tsv",
"n": 2
}
]
}
One file per library.
Format: TSV
File name: same as the ID in the experiment metadata
Field | Type | Description |
---|---|---|
id |
integer | Unique identifier. |
sequence |
DNA string | Unambiguous unique DNA string. |
Format: TSV
File name: lib.<INDEX>.counts.tsv
Targets are reported in the same order they appear in the library of origin.
Index | Type | Description |
---|---|---|
1 | integer | Experiment-wide target index. |
2 | string | Unique DNA sequence. |
3 | integer | Total number of exact matches. |
The indices may be used to decode the matching targets in the match QC tables.
The number of matches includes any deriving from fallback attempts at matching a region, e.g. to determine whether a swap has occurred (swap_libraries
).
Format: JSON
File name: lib.<INDEX>.stats.json
Field | Format | Description |
---|---|---|
mapped_to_template_reads |
integer | Total reads mapping to the library. |
mean_count_per_template |
decimal | Mean reads per template. |
median_count_per_template |
decimal | Median reads per template. |
total_templates |
integer | Total number of templates. |
swap_matching_reads |
integer | Total number of swap matches. |
zero_count_templates |
integer | Total number of unique templates with no reads mapping to them. |
low_count_templates_lt_15 |
integer | Total number of unique templates with less than 15 reads mapping to them. |
low_count_templates_lt_30 |
integer | Total number of unique templates with less than 30 reads mapping to them. |
gini_coefficient |
decimal | Gini coefficient of the mapping read counts. |