In this repo, we provide a pipeline to process the transposable elements(TE), which is used for getting alignment statistics of transposable elements and save the processed data into a proper file format that is suitable for our repeat browser.
This pipeline includes a modified iteres pipeline and a Python script to convert the analysis into zarr format. The output zarr files can be uploaded into our repeat browser for visualization.
In the top directory, run
git clone clone Jiawei-Shen/Repeat-Browser_data_processing
cd Repeat-Browser_data_processing
make
Here are some related files you may need:
Repeat size file: here (length of consensus sequence of repeat subfamily)
hg38 Repeat annotation: download from UCSC
hg38 Chromosome size file: Full Lite (without supercontigs)
# Install dependencies from requirements.txt
pip install -r requirements.txt
If you already have a bam file, this step can be omitted. If not, here are some tutorials to align the reads in different scenarios.
We recommend users use BWA to align the ChIP-Seq data.
# change the path to the bwa folder
./bwa index ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
We recommend users use STAR to align the CAGE-Seq data. Since we are focused on the multireads, it will have some differences from the default settings of STAR.
STAR --chimSegmentMin 100
--outFilterMultimapNmax 100
--winAnchorMultimapNmax 100
--alignEndsType EndToEnd
--alignEndsProtrude 100 DiscordantPair
--outFilterScoreMinOverLread 0.4
--outFilterMatchNminOverLread 0.4
--outSAMtype BAM Unsorted
--outSAMattributes All
--outSAMstrandField intronMotif
--outSAMattrIHstart 0
--readFilesCommand zcat
--chimOutType WithinBAM SoftClip
We consult the SQuIRE repository for guidance on the STAR parameters related to handling CAGE-Seq multireads.
You can choose to run the bash file or run the whole pipeline by the bash file run.sh.
Here are some sample files for human you may need:
--chrom_size
--subfam_size
--rmsk_path
Repeat size file: here (length of consensus sequence of repeat subfamily)
hg19: Repeat annotation: download from UCSC
hg19: Chromosome size file: Full Lite (without supercontigs)
hg38: Repeat annotation: download from UCSC
hg38: Chromosome size file: Full Lite (without supercontigs)
For the bash file:
In this scenario, you only have one bam file to process. And this bam file is not from CAGE-Seq
bash run.sh --bam_file /path/to/your/bam_file
--output_path /path/to/output
--chrom_size /path/to/chrom_size_file
--subfam_size /path/to/subfam_size_file
--rmsk_path /path/to/rmsk_file
In this case, it will have two bam files. One is the signal bam file, another is IgG control bam file.
bash run.sh --signal_bam_file /path/to/signal.bam
--control_bam_file /path/to/control.bam
--output_path /path/to/output
--chrom_size /path/to/chrom_size_file
--subfam_size /path/to/subfam_size_file
--rmsk_path /path/to/rmsk_file
In this case, you will have to set the length of cage_window, which is the length of basepairs segments around 5' end during our process.
The default value of cage_window is 20, which means the segment we select is from 20 bp in front of 5' end to 20 bp behind it.
bash run.sh --bam_file /path/to/your/bam_file
--output_path /path/to/output
--chrom_size /path/to/chrom_size_file
--subfam_size /path/to/subfam_size_file
--rmsk_path /path/to/rmsk_file
--cage_window 20
We provide some sample files in the Prerequisites section.