Tools and pipelines for various off-target detection assays:
- CHANGE-Seq
- Cryptic-Seq
- Integration site mapping assay from Durant et al. 2022
- tbChaSIn
It is important that poetry
is not installed in the same environment as your package dependencies. We recommend using the "official" installer:
curl -sSL https://install.python-poetry.org | python3 -
If you already have poetry installed, make sure it is version 1.7.0
or greater:
poetry --version
We recommend using the miniforge installer. Download the installer for your operating system and run it. For example:
chmod +x Miniforge3-MacOSX-arm64.sh
./Miniforge3-MacOSX-arm64.sh
If you already have mamba
installed, make sure it is version 1.3.0
or greater:
mamba --version
- Get a local copy of the
tbChaSIn
repo
git clone [email protected]:tomebio/tbChaSIn.git
cd tbChaSIn
- Create and activate the environment
bash scripts/mamba.sh -A [-m micromamba]
You can also activate the environment manually:
mamba activate tbChaSIn
- Install
pytomebio
(developer mode)
cd src/python && poetry install
- If you plan to run the Snakemake workflows, ensure
realpath
is available
On OSX:
brew install coreutils
On Ubuntu 16.04 or higher:
sudo apt-get install coreutils
- To be able to use any of the tools that interact with the Benchling warehouse, you need to configure SSL/TLS as described here.
To run code checks, execute:
bash scripts/precommit.sh [-f] [-l]
This will run:
- Unit test for Python code and Snakemake plumbing (with
pytest
) - Linting of Python code (with
ruff
) - Code style checking of Python code (with
ruff
) - Type checking of Python code (with
mypy
) - Code style checking of Shell code (with
shellcheck
) - Code style checking of Snakemake code (with
snakefmt
)
If the optional -f
flag is specified, then the Python and Snakemake files will be automatically formatted prior to applying the checks. Similarly, use the -l
flag to fix any lints that can be fixed automatically.
The Nextflow workflows are designed to run either locally or using AWS Batch. To facilitate this, we must consider that input files (e.g., FASTQs and references) may be stored remotely (e.g., in AWS S3) and "staged" (i.e., Nextflow manages the automatic downloading of remote inputs to the local system) at runtime.
The minimal information to run the pipeline is:
- The CTB ID of the experiment, which is used to look up the experiment metadata in Benchling
- The root location of the references used in the experiment
- The root location of the FASTQ files for the experiment
The workflow runs the following steps:
- Retrieve the metadata from Benchling for the given CTB ID, specified by the
--ctb_id
option, and create a "metasheet" (aka a sample sheet).- If you already have a metasheet in the correct format, you can specify it using
--metasheet
rather than using the--ctb_id
option.
- If you already have a metasheet in the correct format, you can specify it using
- For each sample in the metasheet:
- If the genome build is not known, then use the mapping between species and genome build (specified by the
--genomes_json
option) to determine it, or fall back to the default genome (specified by the--default_genome
option). - Map the genome build to a reference using the mapping specified by the
--references_json
option, otherwise assume the reference has the same name as the genome build, or fall back to the default reference (specified by the--default_reference
option) if the genome build is not known> - If the reference is specified as a relative path, then resolve it to an absolute path using the root folder specified by the
--references_dir
option. Note that--references_dir
may specify a URI, such as an S3 bucket. - Determine the names of the FASTQ files using the pattern string specified by
--fastq_name_pattern
. This pattern string can contain variables of the form${var}
, which are replaced with the value from the corresponding column in the metasheet. There is also a special${read}
variable with the read number for the FASTQ (1 or 2). - If
--fastq_name_pattern
specifies a relative path, then resolve it to an absolute path using the root folder specified by--fastq_dir
, which may also specify a URI, such as an S3 bucket.
- If the genome build is not known, then use the mapping between species and genome build (specified by the
- Load each sample in the metasheet into a native Groovy object.
- Run all of the sample-specific steps in the pipeline in parallel over all the samples.
- Aggregate the per-sample results and run the aggregate steps in the pipeline.
TBD
For Integrase samples, a reference genome (e.g. GRCh38.p14) is modified to append:
- The full length attB sequence
- The full length attP sequence
- The full length attB-containing plasmid sequence
The build_reference.sh script will build the reference for you:
bash scripts/build_reference.sh \
[-i <reference folder URI> | -g <genome_name>] \
[-n <reference name>] \
[-D <root dir> | -G <genomes_dir> -R <references_dir>] \
[-A] \
attB.fasta attP.fasta plasmid.fasta
where:
-i
specifies the folder from which the genome files can be downloaded. Right now, this is assumed to be an NCBI URI, and defaults to https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14. The folder must contain a<name>_genomic.fna.gz
file with the genome sequence, and a<name>_assembly_report.txt
file with the contig names, where<name>
is the folder name (e.g.GCF_000001405.40_GRCh38.p14
in the default URI).-g
specifies the genome name. Use this instead of-i
if the genome files already exist in the genomes folder. Alternatively, the script will try to infer the NCBI URI from the genome name.-n
specifies the name of the reference to create. This defaults to the genome name.-D
specifies a root directory where both genomes and references will be stored, in the<root>/genomes/
and<root>/references
directories, respectively. This defaults to the system temp dir and is not required if you specify both-G
and-R
.-G
specifies the directory where genomes are downloaded.-R
specifies the directory where references are created.-A
specifies that alt contigs should not be included in the final reference.
All the files for a reference are created in a directory with the reference name, and with all files prefixed with the reference name. For example, with reference name GRCh38.p14_PL2312
, the reference directory will contain:
- GRCh38.p14_PL2312.dict
- GRCh38.p14_PL2312.gtf
- GRCh38.p14_PL2312.fasta.gz
- GRCh38.p14_PL2312.fasta.gz.amb
- GRCh38.p14_PL2312.fasta.gz.ann
- GRCh38.p14_PL2312.fasta.gz.bwt
- GRCh38.p14_PL2312.fasta.gz.fai
- GRCh38.p14_PL2312.fasta.gz.gzi
- GRCh38.p14_PL2312.fasta.gz.pac
The Cryptic-seq pipeline has been ported to Nextflow, and this is now the preferred method of execution. For the other pipelines, use Snakemake. For regression testing, the Nextflow workflow can run a wrapped version of the original Snakemake workflow using the -profile snakemake
option.
Each process uses a single Docker image, but multiple processes can use the same image. Each environment configuration file in mamba/ corresponds to a Docker image. To build all the Docker images, run:
bash scripts/docker.sh [-m] [-v VERSION]
The -m
option builds images that are compatible with an ARM Mac. The -v
option specifies the version with which to tag the images, and must match the docker_image_version
parameter in nextflow.config.
To build a specific image, run the following command, where <target>
is the name of the environment configuration file without the .yml
extension.
bash scripts/docker.sh [-m] [-v VERSION] <target>
To be able to run the Snakemake version of the workflow, you'll need to build a single "monolithic" Docker image instead. Run the following from the root directory of the project:
docker build \
-f docker/Dockerfile.snakemake \
-t tomebio/cryptic-seq:1.0 .
On an ARM-based Mac, some additional options are required:
docker build \
-f docker/Dockerfile.snakemake \
-t tomebio/cryptic-seq:1.0 \
--platform linux/amd64 \
--load .
If you plan to use CTB IDs to fetch metadata from benchling, then you must configure secrets for the Benchling Warehouse credentials:
nextflow secrets set WAREHOUSE_USERNAME 'username'
nextflow secrets set WAREHOUSE_PASSWORD 'password'
nextflow secrets set WAREHOUSE_HOST 'postgres-warehouse.tome.benchling.com'
nextflow secrets set WAREHOUSE_PORT '5432'
nextflow secrets set WAREHOUSE_DBNAME 'warehouse'
nextflow secrets set WAREHOUSE_SSLMODE 'verify-ca'
Note: make sure to enclose the secret in single-quotes.
nextflow run \
src/nextflow/cryptic-seq \
[--metasheet <metasheet.[xlsx|txt]> | --ctb_id <ctb_id>] \
--references_json <references.json> \
--reference_dir <ref_dir> \
--annotation_reference <name_or_path> \
--fastq_dir <fastq_dir> \
[--fastq_name_pattern <pattern>] \
[--output_dir <output_dir> ] \
[--prefix <output_prefix>] \
[-with-report report.html] \
# this option configures the pipeline for running locally
-profile local \
# this option only required on ARM Mac
[-profile rosetta] \
# this option only required on linux systems where it is required to run docker as root
[-profile linux]
# this option causes the snakemake workflow to be run
[-profile snakemake] \
# this option provides the global config when running the snakemake workflow
[--global_config <config_yml> ]
The most common options are:
--metasheet
specifies the sample sheet. Alternatively,--ctb_id
can be used to specify a CTB ID in benchling from which to create the metasheet. This may require specifying--genomes_json
, which is a mapping between species and reference name.--references_json
specifies a file that contains mappings between genome build and path to the folder that contains the reference (FASTA file and BWA index).--reference_dir
specifies the root folder that contains references. This is only necessary if--references_json
contains relative paths.--annotation_reference
the name of, or path to, the reference to use in theannotate_sites
step when searching for exact site sequence hits in the genome.--fastq_dir
is the root directory where FASTQ files live. The FASTQ file names may be specified in the metasheetfq1
andfq2
columns as relative paths under thefastq_dir
, or the--fastq_name_prefix
option may be specified with a glob expression that can contain placeholders for any of the columns in the metasheet, as well as the specialread
placeholder which has a value of1
for read 1 files and2
for read 2 files, e.g.**/{sample_name}*/*_R{read}_*.gz
.--output_dir
is the directory where pipeline outputs will be published. It defaults to the directory where the workflow is launched.--output_prefix
is the name of the subdirectory withinoutput_dir
where pipeline outputs will be published, and is also used to name the run-level outputs. It defaults to the name of themetasheet
(without extension).
To see the full list of options that can be set, run:
nextflow run src/nextflow/cryptic-seq -h
Note that the options that are specific to the nextflow command only start with one dash (e.g., -with-report
) while options that override workflow parameters start with two dashes (e.g. --metasheet
).
Instead of (or in addition to) setting options on the command line, you can also put them in a configuration file. Options specified on the command line override those in config files.
example.config
metasheet = "my_samples.txt"
references_json = "my_references.json"
trim_Tn5 = true
nextflow run \
src/nextflow/cryptic-seq \
-c example.config \
--fastq_dir my_fastq_dir
--output_dir my_output_dir
-profile local
Execute the following command to run the pipeline. pipeline
is the name of one of the pipelines in the src/snakemake
folder. Note that the -d
argument is only required if you are executing the script from somewhere other than the root folder of the tbChaSIn
project.
bash scripts/run_snakemake.sh \
[-d /path/to/snakemake/dir] \
-p <pipeline> \
-c /path/to/config.yml \
[-g /path/to/global_config.yml] \
-o /path/to/output \
-t /path/to/large/temp/directory
See Reference Preparation for how to prepare the reference genome.
There are two configuration files: run and global.
The run config is required and provides metadata for the samples to be processed. The run config is organized at three levels: run, group, and sample level. The run level contains configuration that applies to all groups and samples, for example tool-level parameters. The group level contains configuration that applies to related samples, for example the guide for specific CRISPR samples, or tool-specific parameters recommended for integrase samples. The sample level contains configuration that applies to each sample, for example the name, replicate number, and paths to the input FASTQs.
Config Key | Description | Level | Required | Default |
---|---|---|---|---|
name |
The name of the group | Group | Yes | NA |
ref_fasta |
The absolute path to the reference FASTA, with accompanying BWA index files and FASTA index | Group | Yes | NA |
attachment_sites |
The list of attachment sites (<name>:<left-seq>:<overhang>:<right-seq> ) |
Group | Yes | NA |
name |
The name of the sample | Sample | Yes | NA |
replicate |
The replicate number (e.g. 1, 2, 3) | Sample | Yes | NA |
fq1 |
The absolute path to the FASTQ for read 1 (R1) | Sample | Yes | NA |
fq2 |
The absolute path to the FASTQ for read 2 (R2) | Sample | Yes | NA |
The global config is optional and overrides default values for parameters that apply to all samples. The global parameters are specific to each workflow.
The following steps are performed in the CHANGE-Seq pipeline:
- Convert FASTQ to BAM, trim and annotate the leading attachment sites. Keep only read pairs where the left/right side of the same attachment site are found.
- Search for the Tn5 mosaic end in the reads trimming it and all subsequent bases (due to short inserts). All reads kept.
- Align the reads.
- Mark duplicates (no UMIs).
- Collect various sequencing quality control metrics.
- Collate putative integration sites per sample.
- Collate putative integration sites across samples.
An example config.yml
for CHANGE-Seq is shown below:
settings:
- name: bxb1
ref_fasta: /path/to/GRCh38.p14.full/GRCh38.p14.full.fasta
attachment_sites:
- attB:CACCACGCGTGGCCGGCTTGTCGACGACGGCG:GT:CTCCGTCGTCAGGATCATCCGGGGATCCCGGG
- attP:GCCGCTAGCGGTGGTTTGTCTGGTCAACCACCGCG:GT:CTCAGTGGTGTACGGTACAAACCCAGCTACCGGTC
samples:
- name: BxbI_alone_rep1_S10
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/BxbI_alone_rep1_S10/BxbI_alone_rep1_S10_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_alone_rep1_S10/BxbI_alone_rep1_S10_L001_R2_001.fastq.gz
- name: BxbI_alone_rep2_S11
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/BxbI_alone_rep2_S11/BxbI_alone_rep2_S11_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_alone_rep2_S11/BxbI_alone_rep2_S11_L001_R2_001.fastq.gz
- name: BxbI_alone_rep3_S12
replicate: 3
fq1: /path/to/cryptic-seq/fastqs/BxbI_alone_rep3_S12/BxbI_alone_rep3_S12_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_alone_rep3_S12/BxbI_alone_rep3_S12_L001_R2_001.fastq.gz
- name: BxbI_attB_rep1_S16
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/BxbI_attB_rep1_S16/BxbI_attB_rep1_S16_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_attB_rep1_S16/BxbI_attB_rep1_S16_L001_R2_001.fastq.gz
- name: BxbI_attB_rep2_S17
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/BxbI_attB_rep2_S17/BxbI_attB_rep2_S17_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_attB_rep2_S17/BxbI_attB_rep2_S17_L001_R2_001.fastq.gz
- name: BxbI_attB_rep3_S18
replicate: 3
fq1: /path/to/cryptic-seq/fastqs/BxbI_attB_rep3_S18/BxbI_attB_rep3_S18_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_attB_rep3_S18/BxbI_attB_rep3_S18_L001_R2_001.fastq.gz
- name: BxbI_attP_rep1_S13
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/BxbI_attP_rep1_S13/BxbI_attP_rep1_S13_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_attP_rep1_S13/BxbI_attP_rep1_S13_L001_R2_001.fastq.gz
- name: BxbI_attP_rep2_S14
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/BxbI_attP_rep2_S14/BxbI_attP_rep2_S14_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_attP_rep2_S14/BxbI_attP_rep2_S14_L001_R2_001.fastq.gz
- name: BxbI_attP_rep3_S15
replicate: 3
fq1: /path/to/cryptic-seq/fastqs/BxbI_attP_rep3_S15/BxbI_attP_rep3_S15_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/BxbI_attP_rep3_S15/BxbI_attP_rep3_S15_L001_R2_001.fastq.gz
- name: HEK293c12_BxbI_attP_rep1_S25
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/HEK293c12_BxbI_attP_rep1_S25/HEK293c12_BxbI_attP_rep1_S25_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/HEK293c12_BxbI_attP_rep1_S25/HEK293c12_BxbI_attP_rep1_S25_L001_R2_001.fastq.gz
- name: HEK293c12_BxbI_attP_rep2_S26
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/HEK293c12_BxbI_attP_rep2_S26/HEK293c12_BxbI_attP_rep2_S26_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/HEK293c12_BxbI_attP_rep2_S26/HEK293c12_BxbI_attP_rep2_S26_L001_R2_001.fastq.gz
- name: HEK293c12_BxbI_attP_rep3_S27
replicate: 3
fq1: /path/to/cryptic-seq/fastqs/HEK293c12_BxbI_attP_rep3_S27/HEK293c12_BxbI_attP_rep3_S27_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/HEK293c12_BxbI_attP_rep3_S27/HEK293c12_BxbI_attP_rep3_S27_L001_R2_001.fastq.gz
The following steps are performed in the Cryptic-Seq pipeline:
- Convert FASTQ to BAM, trim and annotate the leading UMIs
- Trims the start of R1s for the Tn5 mosaic end, keeping only read pairs where the Tn5 mosaic end was found in R1.
- Trims the start of R2s for the leading attachment site, keeping only read pairs where the attachment site was found in R2.
- Search for the Tn5 mosaic end in the R2, trimming it and all subsequent bases (due to short inserts). All reads kept.
- Align the reads.
- Clip reads in FR pairs that sequence past the far end of their mate.
- Mark duplicates using the UMIs.
- Collect various sequencing quality control metrics.
- Collate putative integration sites per sample.
- Collate putative integration sites across samples.
An example config.yml
for CRYPTIC-Seq is shown below:
settings:
- name: Bxb1attP
ref_fasta: /path/to/GRCh38.p14.full/GRCh38.p14.full.fasta
attachment_sites:
- attP:GTGGTTTGTCTGGTCAACCACCGCG:GT:CTCAGTGGTGTACGGTACAAACCCA
samples:
- name: JM-CS004a-N7-Bxb1attP-rep1-Sid3_S3
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attP-rep1-Sid3_S3_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attP-rep1-Sid3_S3_L001_R2_001.fastq.gz
- name: JM-CS004a-N7-Bxb1attP-rep2-Sid8_S8
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attP-rep2-Sid8_S8_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attP-rep2-Sid8_S8_L001_R2_001.fastq.gz
- name: Bxb1attB
ref_fasta: /path/to/GRCh38.p14.full/GRCh38.p14.full.fasta
attachment_sites:
- attB:GGCCGGCTTGTCGACGACGGCG:GT:CTCCGTCGTCAGGATCATCCGG
samples:
- name: JM-CS004a-N7-Bxb1attB-rep1-Sid4_S4
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attB-rep1-Sid4_S4_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attB-rep1-Sid4_S4_L001_R2_001.fastq.gz
- name: JM-CS004a-N7-Bxb1attB-rep2-Sid9_S9
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attB-rep2-Sid9_S9_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-Bxb1attB-rep2-Sid9_S9_L001_R2_001.fastq.gz
- name: attBalone
ref_fasta: /path/to/GRCh38.p14.full/GRCh38.p14.full.fasta
attachment_sites:
- attB:GGCCGGCTTGTCGACGACGGCG:GT:CTCCGTCGTCAGGATCATCCGG
samples:
- name: JM-CS004a-N7-attBalone-rep1-Sid2_S2
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attBalone-rep1-Sid2_S2_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attBalone-rep1-Sid2_S2_L001_R2_001.fastq.gz
- name: JM-CS004a-N7-attBalone-rep2-Sid7_S7
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attBalone-rep2-Sid7_S7_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attBalone-rep2-Sid7_S7_L001_R2_001.fastq.gz
- name: attPalone
ref_fasta: /path/to/GRCh38.p14.full/GRCh38.p14.full.fasta
attachment_sites:
- attP:GTGGTTTGTCTGGTCAACCACCGCG:GT:CTCAGTGGTGTACGGTACAAACCCA
samples:
- name: JM-CS004a-N7-attPalone-rep1-Sid1_S1
replicate: 1
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attPalone-rep1-Sid1_S1_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attPalone-rep1-Sid1_S1_L001_R2_001.fastq.gz
- name: JM-CS004a-N7-attPalone-rep2-Sid6_S6
replicate: 2
fq1: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attPalone-rep2-Sid6_S6_L001_R1_001.fastq.gz
fq2: /path/to/cryptic-seq/fastqs/JM-CS004a-N7-attPalone-rep2-Sid6_S6_L001_R2_001.fastq.gz
The global parameters supported for Cryptic-seq are listed in the following table. A configuration file with the default values is provided.
Config Key | Description | Default |
---|---|---|
read_structure_r1 | R1 read structure for fgbio FastqToBam, made up of '<number><operator>' pairs | 11M+T |
read_structure_r2 | R2 read structure for fgbio FastqToBam, made up of '<number><operator>' pairs | +T |
trim_Tn5 | Whether to trim the Tn5 mosiac end sequence from the start of R1 | True |
trim_Tn5_max_mismatches | Maximum number of mismatches to allow Tn5 trimming | 1 |
trim_att_max_mismatches | Maximum number of mismatches to allow leading attachment site trimming | 4 |
umi_from_read_name | Set to True if the UMI is found in the read name instead of the sequence |
False |
The following steps are performed in the Durant et al. pipeline:
- uses
fastp
to trim adapters (r1: custom, r2: nextera) - uses
tomebio-tools durant trim-leading-r2
to keep only reads where R2 starts with the sample-specific stagger and inner donor primer - aligns to the genome, which includes attB and attD
- keeps only reads that:
- R1 and R2 must each have at least one alignment to the genome (non-donor)
- If R1 has an alignment to the donor, it cannot have too many mapped bases to the donor, otherwise it is assumed to be a linear plasmid template (default: 55bp)
- R2 must have an alignment to the donor on the forward strand
- re-aligns these reads to the genome without attB and attD
- does not de-duplicate
- uses the
tomebio-tools durant find-sites
tool to find integration sites. Keeps templates that:
- both R1 and R2 map to the genome
- the template length is 1kb or smaller (implies R1 and R2 map to the same contig!)
- R1 has enough mapped bases (default: 25bp) For a given template (read pair), the following must be true to be considered an integration:
An example config.yml
for Durant et al. is shown below:
settings:
- name: Durant
ref_fasta: /path/to/GRCh38.p14.full/GRCh38.p14.full.fasta
genome_fasta: /path/to/GRCh38.p14/GRCh38.p14.fasta
min_aln_score: 20
inter_site_slop: 10
samples:
- name: SRR21306552
replicate: 1
bio_rep: 2
tech_rep: 2
stagger: ''
donor_inner_primer: CAGCGAGTCAGTGAGCGAGG
umi_length: 0
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306552_1.fastq.gz
fq2: /path/to/SRR21306552_2.fastq.gz
- name: SRR21306553
replicate: 2
bio_rep: 2
tech_rep: 1
stagger: T
donor_inner_primer: CAGCGAGTCAGTGAGCGAGG
umi_length: 0
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306553_1.fastq.gz
fq2: /path/to/SRR21306553_2.fastq.gz
- name: SRR21306554
replicate: 3
bio_rep: 1
tech_rep: 1
stagger: ATCGAT
donor_inner_primer: TCGATCGAGGTTGCATTCGG
umi_length: 0
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306554_1.fastq.gz
fq2: /path/to/SRR21306554_2.fastq.gz
- name: SRR21306560
replicate: 4
bio_rep: 5
tech_rep: 1
stagger: T
donor_inner_primer: CAGCGAGTCAGTGAGCGAGG
umi_length: 0
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306560_1.fastq.gz
fq2: /path/to/SRR21306560_2.fastq.gz
- name: SRR21306561
replicate: 5
bio_rep: 2
tech_rep: 3
stagger: ''
donor_inner_primer: CAGCGAGTCAGTGAGCGAGG
umi_length: 0
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306561_1.fastq.gz
fq2: /path/to/SRR21306561_2.fastq.gz
- name: SRR21306626
replicate: 6
bio_rep: 4
tech_rep: 1
stagger: CGAT
donor_inner_primer: TCGATCGAGGTTGCATTCGG
umi_length: 12
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306626_1.fastq.gz
fq2: /path/to/SRR21306626_2.fastq.gz
- name: SRR21306627
replicate: 7
bio_rep: 3
tech_rep: 1
stagger: T
donor_inner_primer: TCGATCGAGGTTGCATTCGG
umi_length: 12
r1_adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
r2_adapter: CTGTCTCTTATACACATCTGACGCTGCCGACGA
fq1: /path/to/SRR21306627_1.fastq.gz
fq2: /path/to/SRR21306627_2.fastq.gz
*** Important ***: ref_fasta
is the genome with attD, while genome_fasta
does not contain attD.
All custom scripts used by these workflows are written in python and available as a command line toolkit called pytomebio
. To add a new tool:
- Create a
<tool>.py
file in the appropriate package withinpytomebio.tools
- Add the tool to the
TOOLS
dict inpytomebio.__main__.py
- Add any new dependencies to
pyproject.toml
We follow a two-step process:
- Wrap the Snakemake pipeline using snk inside a Docker container and call it from a Nextflow process.
- Convert each Snakemake rule to a corresponding Nextflow process and write a "native" Nextflow workflow.
The workflow then has the option of running either the "native" version or the Snakemake version. This is useful to be able to compare the results for regression testing.
To add a new step to the workflow, create a new process
in the appropriate main.nf
file in one of the subfolders of src/nextflow
. The process should have the following directives:
label
corresponding to the Docker image that should be usedtag "${meta.id}"
for processes that run on individual sampleslabel 'global'
for processes that run on the aggregated resultsext resource: value
to specify resource requirements that are above the base level of 1 CPU, 2 GB memory, and 100 GB of disk space
Each process needs to have an associated Docker image. You can use one of the existing images by giving your process the associated label
directive. If you need to create a new image:
- Create a new Mamba configuration file in mamba/
- Create a new target in Dockerfile
- Add the necessary configuration for the new image in
src/nextflow/<workflow>/nextflow.config
From src/nextflow/cryptic-seq
run:
REF_DIR=<ref_dir> PROFILES=local,test[,rosetta|linux] pytest --git-aware tests/integration/
where ref_dir
is the root directory for references.