Bastilevirinae introns

Scripts used to analyse the data in the "Self-splicing introns in genes of Bastillevirinae bacteriophages" paper. The analysis consists of four main steps:

Finding potential group I and group II introns in the genomes using Infernal matches to custom intron CM databases and resolving potential intron-exon boundaries based on alignment to known protein families.
Confirmation of introns splicing using cDNA reads and finding potential splicing sites.
Analysis of introns evolutionary relationships using cmalign and IQ-tree.
Preliminary prediction of secondary structure using LinearTurboFold and R2DT.

REQUIREMENTS

Linux system (tested on Ubuntu 22.04 LTS), Conda or compatible package manager (dependencies are listed in the phage_introns_env.yml and installed automatically when the environment is created)

INSTALLATION

Clone the repository and enter the main directory:

git clone https://github.com/zwmuam/Bastilevirinae_introns
cd Bastilevirinae_introns

Create the environment:

conda env create -f phage_introns_env.yml

OR

mamba env create -f phage_introns_env.yml

Activate the environment:

conda activate phage_introns

Run the download_databases.sh script:

./download_databases.sh

By default the script downloads the database versions used in the paper to default location (./databases/).

USAGE

Locate initial group I and group II matches:

./find_introns.py -f genomes.fasta -o genomes_intron_pred

This command assumes that default databases are installed. Custom HMMs and CMs can be used "-h" and "-c" arguments. Optionally, xlsx table with taxonomy can be added as "-w" argument to perform distribution analysis.

If you have spliced cDNA reads (see the paper) pass them to "confirm_introns" script to map transcripts to genome and identify splicing sites:

./confirm_introns.py -ca custom_adapters -cb custom_barcodes -fq reads.fastq -rd reference_genomes_dir -o output_dir -s '__' -mr 1000

The script uses modified Porechop to demultiplex reads and remove adapters. Clean reads are mapped to reference genomes with minimap2 and analysed by spliced_bam2gff to find splicing sites. The "-s" argument is used to specify the separator between the barcode and the read name in the fastq file. The "-mr" specifies the number of reads in bam files used for downstream./alignment_statistics.py -u -f redundant_sequences.fasta -o redundant_dataset_stats.xlsx visualiation

To align all analysed introns to the reference model from RFAM:

cmalign -o cmalign_output.sto --matchonly --outformat Stockholm reference.cm non_redundant_seqences.fasta

We used InfeRNAl cmalign with "--matchonly" option and RF00028 model to exclude variable insertion sequences and focus on conserved RNA structures. In case of our machine we also added "--mxsize 5140" "--cpu 20" parameters to increase the maximum matrix size as well as number of threads and improve the performance. Some tools (e.g. tree construction programs) may also require DNA-like output triggered by "--dnaout" option.

Convert stockholm ouput of cmalign to aligned fasta:

./sto2fasta.py -s cmalign_output.sto

Columns with more than 50% gaps can be masked using one of commonly used sequence analysis software suites. To prune poorly aligned sequences run:

./prune_alignment.py -f cmalign_output.fasta -l 50

To re-annotate introns (useful for intron sequences imported from external databases like GISSD) and generate R2DT-compatible annotations use:

./reannotate_introns.py -f non_redundant_seqences.fasta -o Annotated_nr_sequences

Generate statistics for all sequence datasets and alignments:

./alignment_statistics.py -u -f redundant_sequences.fasta
./alignment_statistics.py -u -f non_redundant_seqences.fasta
./alignment_statistics.py -f cmalign_output.fasta -r 251
./alignment_statistics.py -f cmalign_output.pruned.fasta -r

Construct a phylogenetic tree:

iqtree -nt AUTO -s cmalign_output.pruned.fasta -alrt 1000 -bb 1000

This command uses IQ-tree with automatic model selection, 1000 ultrafast bootstrap replicates and 1000 SH-aLRT replicates.

Predict the secondary structure of introns from different clades observed in the phylogenetic tree:

for clade in clade_*.fna; do ./linearturbofold -i ${clade} -o ${clade}_ltf --pf -v --bpp; done

LinearTurboFold is run to simultaneously align and fold the sequences from each clade ("-v" prints alignment, folding and runtime information, "--pf" saves partition function and "--bpp" writes base pair probabilities). This step is not included in the environment. To get the software, please visit the original repository by Li and colleagues.

Use structures, tsv annotations and json colour code (provided by reannotate_introns.py) to generate structure visualisations using R2DT:

# In the Docker container:
r2dt.py templatefree ${intron_structure}.fasta ${intron_structure}
python3 enrich_json.py --input-json ${intron_structure}.json --input-data ${intron_structure}.annot --output ${intron_structure}.enriched.json
json2svg.py -p colour_dict.json -i ${intron_structure}.enriched.json -o ${intron_structure}.enriched.svg

The visualisation is not included in the repository because it requires a fully configured R2DT pipeline. Run it within the R2DT Docker container based on the image from Docker Hub.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
adapters_custom.py		adapters_custom.py
alignment_statistics.py		alignment_statistics.py
annotations.py		annotations.py
confirm_introns.py		confirm_introns.py
download_databases.sh		download_databases.sh
find_introns.py		find_introns.py
intron_statistics.py		intron_statistics.py
phage_introns_env.yml		phage_introns_env.yml
porechop_custom.py		porechop_custom.py
postprocessing.py		postprocessing.py
prune_alignment.py		prune_alignment.py
reannotate_introns.py		reannotate_introns.py
sto2fasta.py		sto2fasta.py
tweaks.py		tweaks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bastilevirinae introns

REQUIREMENTS

INSTALLATION

USAGE

About

Releases 1

Packages

Contributors 2

Languages

License

zwmuam/Bastilevirinae_introns

Folders and files

Latest commit

History

Repository files navigation

Bastilevirinae introns

REQUIREMENTS

INSTALLATION

USAGE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages