Skip to content

This pipeline generates a conservation landscape based on kmer frequency

Notifications You must be signed in to change notification settings

INMEGEN/conservationLandscape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

conservationLandscape

This repository host a pipeline to generate a conservation landscape based on the frequency of reference kmers across a set of target genomes. This pipeline has been applied to select regions suitabe for PCR design.

RATIONALE

A reference genome is splitted into subsequences of size k (kmer). The frequency of each reference kmer across a set of target genomes is obtained. The conservation landscape is the graphical representation of such frequency map.

Frequent mutations produce sharp drops in frequency while highly conserved regions are seen as steady high frequency regions.

INSTALLATION

DEPENDENCIES

Before running this pipeline, you must have installed and avaiable in your $PATH the following dependencies:

HOW TO INSTALL THIS PIPELINE

  1. Clone this repository
git clone https://github.com/INMEGEN/conservationLandscape.git
  1. Move to the repository directory
cd conservationLandscape
  1. Execute the example command line:
nextflow run landscape.nf --input data/sequences.fasta --db ref/MN908947.3.coronavirus.Wuhan-1.fasta --out results/ --kmer 20 --pipeline $PATH_TO_REPO

You should see a folder results that contains a counts file and a conservation landscape plot similar to this one.

HOW TO RUN THIS PIPELINE

REQUIRED ARGUMENTS

Argument Description
--input The path to the input FASTA file
--db The path to the reference genome FASTA file
--out The path to the output directory
--db The size of the kmer to be analyzed
--pipeline The path to the repository directory

OPTIONAL ARGUMENTS

Argument Description
--s The size of the hash stored in memory by Jellyfish (DEFAULT:100M)
--t Number of threads used by Jellyfish (DEFAULT:1)

OUTPUT

The output directory will contain two files:

  • kmer_frequency.txt. This is a text file with the frequency information per each reference kmer. Column 1: Start position of reference kmer. Column 2: Frequency of reference kmer. Column 3: Frequency of reference kmer, only forward strand. Column 3: Frequency of reference kmer, only reverse strand.
  • conservationLandscape.pdf. The graphical display of the conservation landscape. The X-axis represent the start position of each reference kmer and the Y-axis represent its frequency.

HOW TO CITE THIS PIPELINE

If you use this pipeline as part of your work, please cite:

About

This pipeline generates a conservation landscape based on kmer frequency

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published