This pipeline has several different supported modes of operation. As such there are various entrypoints into the pipeline, each with their own set of relevant inputs. The various entrypoints are described in detail here, and the individual parameters are described here. We recommend first determining which entrypoint you need then cross-referencing the relevant input descriptions.
You can create an input JSON for running the pipeline end-to-end, i.e. from fastqs to loop and domain calls, on a Hi-C experiment from the ENCODE portal using the provided input JSON generation script. Before running it install the requirements with pip install -r requirements-scripts.txt
. To invoke it, you must at the minimum provide the accession of the experiment on the portal. See the script's help text for documentation of usage and options (python scripts/make_input_json_from_portal.py --help
). Currently the script only supports experiments with one of HindIII
, DpnII
, and MboI
as the library fragmentation method.
Under each individual entrypoint the inputs for that entrypoint are listed. To run the pipeline using that particular entrypoint you need only specify the required inputs.
Runs the pipeline from the very beginning starting from fastq
files. read_groups
is useful for adding extra metadata into the BAM headers.
Required inputs
fastq
restriction_enzymes
restriction_sites
chrsz
reference_index
Runs the pipeline starting with a .hic
file for loop and TAD calling.
Required inputs
input_hic
Use the WDL make_restriction_site_locations.wdl
to generate the restriction sites file for your enzyme and reference fasta.
Required inputs
reference_fasta
restriction_enzyme
assembly_name
, should be set totrue
fastq
is a twice nested array of input fastqs. The outermost level corresponds to the biological replicates. Each biological replicate then contains one or moreFastqPair
objects. In the example below there are two biological replicates, therefore there are two elements in the outermost array. The first biological replicate has two technical replicates, so it has twoFastqPair
s. The second biological replicate only has one technical replicate so it only has oneFastqPair
. TheFastqPair
can optionally contain aread_group
which will be added to the aligned bam usingbwa mem
-R
option. Note that this must contain the full header line, and that backslashes must be escaped in the JSON.
"hic.fastq": [
[
{
"read_1": "biorep1_techrep1_R1.fastq.gz",
"read_2": "biorep1_techrep1_R2.fastq.gz"
},
{
"read_1": "biorep1_techrep2_R1.fastq.gz",
"read_2": "biorep1_techrep2_R2.fastq.gz"
}
],
[
{
"read_1": "biorep2_techrep1_R1.fastq.gz",
"read_2": "biorep2_techrep1_R2.fastq.gz",
"read_group": "@RG\\tID:myid"
}
]
]
restriction_enzymes
is an array of names containing the restriction enzyme(s) used to generate the Hi-C libraries. Currently onlyMboI
,HindIII
,DpnII
, andnone
are supported.none
is useful for libraries like DNAse produced using a non-specific cutter.ligation_site_regex
is a custom regular expression for counting ligation sites. If specified thenrestriction_sites
file must be specified in the pipeline input. This can be just a single site, e.g.ATGC
, or several sites wrapped in parentheses and separated by pipes, e.g.(ATGC|CTAG)
(usesgrep -E
extended regular expression syntax)restriction_sites
is a gzipped text file containing cut sites for the given restriction enzyme. For supported enzymes you can generate this using the reference building entrypoint. Note that if you need to generate a sites file for a multiple digest or for an unsupported enzyme you will need to edit this script and run it yourself: https://github.com/aidenlab/juicer/blob/encode/misc/generate_site_positions.pychrsz
is a chromosome sizes file for the desired assembly. It is a gzipped and tab-separated text file whose rows take the form[chromosome][TAB][size]
. You can find these on the ENCODE portal for some human and mouse assemblies, see reference filesreference_index
is a pre-generated BWA index for the desired assembly. Depending on your assembly you may also be able to find these on the ENCODE portal, see reference filesinput_hic
is an input.hic
file which will be used to call loops and domainsnormalization_methods
is an array of normalization methods to use for.hic
file generation as per Juicer Toolspre
. If not specified then will usepre
defaults ofVC
,VC_SQRT
,KR
, andSCALE
. Valid methods areVC
,VC_SQRT
,KR
,SCALE
,GW_KR
,GW_SCALE
,GW_VC
,INTER_KR
,INTER_SCALE
, and `INTER_VC.reference_fasta
is FASTA file for the genome of interest to be used for generating restriction site locations. For the output locations file to have a descriptive filename it is also recommended to specify theassembly_name
no_pairs
is a boolean which iftrue
results in skipping generating.pairs
files, defaults tofalse
no_call_loops
is a boolean which iftrue
results in skipping calling loops, defaults tofalse
. Since the loop calling requires GPUs it is recommended to set totrue
if you do notno_call_tads
is a boolean which iftrue
skips calling domains with arrowhead, defaults tofalse
align_num_cpus
is number of threads to use forbwa
alignment, it is recommended to leave at the default value.create_hic_num_cpus
is number of threads to use for hic creation, it is recommended to leave at the default value. If you have an OOM error for Juicer Toolspre
, which may occur for large experiments, then supply a small value such as4
assembly_name
is name of assembly, defaults to "unknown". If the assembly is supported by Juicer Toolspre
then.hic
file creation will use Juicer Tools' internal chrom sizes instead of the inputtedchrsz
, seePre
documentation for list of supported values. The pipeline does some normalization of this value internally, for instanceGRCh38
will be converted into the Juicer Tools-supportedhg38
.
In order to run the pipeline from the beginning you will need to specify the bwa
index and chromosome sizes file. We recommend using reference files from the ENCODE portal to ensure comparability of the analysis results. Links to the reference fasta
files are included in case you need to generate a custom restriction sites file.
reference file description | assembly | ENCODE portal link |
---|---|---|
bwa index | GRCh38 | link |
genome fasta | GRCh38 | link |
chromosome sizes | GRCh38 | link |
bwa index | hg19 | link |
genome fasta | hg19 | link |
chromosome sizes | hg19 | link |
In most cases you will also need a restriction map file appropriate for the restriction enzyme and assembly. MboI
and DpnII
share the same restriction map because they have the same recognition site. If you don't see your enzyme here you can generate a custom sites file, see generating restriction site files.
restriction enzymes | assembly | ENCODE portal link |
---|---|---|
DpnII, MboI | GRCh38 | link |
HindIII | GRCh38 | link |
DpnII, MboI | hg19 | link |
HindIII | hg19 | link |
The exact outputs of the pipeline will depend on the combination of inputs. When the pipeline completes check Caper/Cromwell metadata JSON file, particularly the top-level key outputs
for values that are not null
. Descriptions of the individual outputs are below.
A draft document describing the pipeline outputs and quality control (QC) values is available on the ENCODE portal.
alignable_bam
is an array of filtered BAM files, one per biological replicateout_pairs
is an array of files inpairs
format, one per biological replicateout_dedup
is an array of files in Juicer long format, one per biological replicatelibrary_complexity_stats_json
is an array of library complexity QC statistics in JSON format, one per biological replicate.stats
is an array of library QC statistics in JSON format, one per biological replicate. It includes statistics describing the quantity and nature of the Hi-C contacts.alignment_stats_
is an array of arrays of alignment QC statistics in plain text, one per technical replicate.merged_stats_json
is a JSON file containing alignment and library statistics for merged librariesout_hic_1
is a.hic
file containing the contact matrix filtered by MAPQ >= 1out_hic_30
is a.hic
file containing the contact matrix filtered by MAPQ >= 30out_tads
containsarrowhead
domain calls in the Juicer format described here