Repeat Browser Data Processing (Iteres)

In this repo, we provide a pipeline to process the transposable elements(TE), which is used for getting alignment statistics of transposable elements and save the processed data into a proper file format that is suitable for our repeat browser.

This pipeline includes a modified iteres pipeline and a Python script to convert the analysis into zarr format. The output zarr files can be uploaded into our repeat browser for visualization.

Prerequisites

1. Compile the iteres and download related files

In the top directory, run

git clone clone Jiawei-Shen/Repeat-Browser_data_processing
cd Repeat-Browser_data_processing
make

Here are some related files you may need:

Repeat size file: here (length of consensus sequence of repeat subfamily)

hg38 Repeat annotation: download from UCSC

hg38 Chromosome size file: Full Lite (without supercontigs)

2. Install the required packages of python

Python (version 3.6.13 recommended)
pip (Python package installer)

# Install dependencies from requirements.txt
pip install -r requirements.txt

Implementation

Step 0. Align the reads

If you already have a bam file, this step can be omitted. If not, here are some tutorials to align the reads in different scenarios.

(1). ChIP-Seq data

We recommend users use BWA to align the ChIP-Seq data.

# change the path to the bwa folder
./bwa index ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz

(2). CAGE-Seq data

We recommend users use STAR to align the CAGE-Seq data. Since we are focused on the multireads, it will have some differences from the default settings of STAR.

STAR --chimSegmentMin 100  
    --outFilterMultimapNmax 100 
    --winAnchorMultimapNmax 100 
    --alignEndsType EndToEnd 
    --alignEndsProtrude 100 DiscordantPair 
    --outFilterScoreMinOverLread 0.4 
    --outFilterMatchNminOverLread 0.4 
    --outSAMtype BAM Unsorted 
    --outSAMattributes All 
    --outSAMstrandField intronMotif 
    --outSAMattrIHstart 0 
    --readFilesCommand zcat 
    --chimOutType WithinBAM SoftClip

We consult the SQuIRE repository for guidance on the STAR parameters related to handling CAGE-Seq multireads.

Step 1. Run the pipeline

You can choose to run the bash file or run the whole pipeline by the bash file run.sh.

Here are some sample files for human you may need:

--chrom_size
--subfam_size 
--rmsk_path

Repeat size file: here (length of consensus sequence of repeat subfamily)

hg19: Repeat annotation: download from UCSC

hg19: Chromosome size file: Full Lite (without supercontigs)

hg38: Repeat annotation: download from UCSC

hg38: Chromosome size file: Full Lite (without supercontigs)

For the bash file:

(1). The default scenario

In this scenario, you only have one bam file to process. And this bam file is not from CAGE-Seq

bash run.sh --bam_file /path/to/your/bam_file 
            --output_path /path/to/output 
            --chrom_size /path/to/chrom_size_file 
            --subfam_size /path/to/subfam_size_file 
            --rmsk_path /path/to/rmsk_file

(2). The data is from ChIP-Seq

In this case, it will have two bam files. One is the signal bam file, another is IgG control bam file.

bash run.sh --signal_bam_file /path/to/signal.bam 
            --control_bam_file /path/to/control.bam
            --output_path /path/to/output 
            --chrom_size /path/to/chrom_size_file 
            --subfam_size /path/to/subfam_size_file 
            --rmsk_path /path/to/rmsk_file

(3). The data is from CAGE-Seq

In this case, you will have to set the length of cage_window, which is the length of basepairs segments around 5' end during our process.

The default value of cage_window is 20, which means the segment we select is from 20 bp in front of 5' end to 20 bp behind it.

bash run.sh --bam_file /path/to/your/bam_file 
            --output_path /path/to/output 
            --chrom_size /path/to/chrom_size_file 
            --subfam_size /path/to/subfam_size_file 
            --rmsk_path /path/to/rmsk_file 
            --cage_window 20

We provide some sample files in the Prerequisites section.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cuskent		cuskent
cussamtools		cussamtools
othertools_to_Zarr		othertools_to_Zarr
utils		utils
README.md		README.md
cpgfilter.c		cpgfilter.c
cpgstat.c		cpgstat.c
filter.c		filter.c
from_kent.c		from_kent.c
from_kent.h		from_kent.h
generic.c		generic.c
generic.h		generic.h
iteres.c		iteres.c
makefile		makefile
nearby.c		nearby.c
requirements.txt		requirements.txt
run.sh		run.sh
stat.c		stat.c
zarrScript.py		zarrScript.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repeat Browser Data Processing (Iteres)

Prerequisites

1. Compile the iteres and download related files

2. Install the required packages of python

Implementation

Step 0. Align the reads

(1). ChIP-Seq data

(2). CAGE-Seq data

Step 1. Run the pipeline

(1). The default scenario

(2). The data is from ChIP-Seq

(3). The data is from CAGE-Seq

About

Releases

Packages

Languages

twlab/Repeat-Browser_data_processing

Folders and files

Latest commit

History

Repository files navigation

Repeat Browser Data Processing (Iteres)

Prerequisites

1. Compile the iteres and download related files

2. Install the required packages of python

Implementation

Step 0. Align the reads

(1). ChIP-Seq data

(2). CAGE-Seq data

Step 1. Run the pipeline

(1). The default scenario

(2). The data is from ChIP-Seq

(3). The data is from CAGE-Seq

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages