Skip to content

Conservation genomics pipeline for Dudleya setchellii

Notifications You must be signed in to change notification settings

evanhackstadt/dudleya

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dudleya Genomics Pipeline

Dr. Justen Whittall, Evan Hackstadt, Karina Martinez, Dante Cable

Instructions last updated: 1/14/26

Pipeline Architecture

  • Snakefile — contains the bulk of the pipeline. It is a list of steps (terminal commands / scripts) to process the data.
  • The Snakefile depends on two config files that tell it what to do:
    • config/profile.yaml — contains parameters telling Snakemake to use SLURM and specifying default resources. We do NOT normally change this.
    • config/samples.yaml — paths to the ref genome, anc genome, and a dictionary mapping sample names to the paths to each raw file (R1 and R2). We MAY change this.
    • update_sample_config.py — utility script providing an easy way to update samples config file.
  • submit_snakemake.sh — batch file that allows us to run the pipeline in the background (as a master SLURM job).

Running the Pipeline

Update sample config

Let's say we have new samples in a folder data/ that we want to process. First we need to udpate config/samples.yaml to point to our samples.

You can use the utility script, update_sample_config.py, to do this easily without editing the file directly.

The script takes a few arguments and has optional flags. To see options, run:

python scripts/update_sample_config.py --help

Example usage:

python scripts/update_sample_config.py data/ snakemake/samples_config/

The script should print what it's doing to the terminal. Note that this script CANNOT change paths to the ref and anc genomes, so these must be changed manually if needed.

Run Snakemake

Now we can run the pipeline.

To run Snakemake in the background (as a SLURM job):

sbatch snakemake/submit_snakemake.sh

And that's it! To check status, use squeue or tail <output_file> or ls <dirs_being_created>. The pipeline creates a master results/ directory containing subdirectories for relevant outputs from each step.

To run Snakemake directly (not recommended since this requires you to stay logged in until it finishes):

snakemake --profile config/profile.yaml

Batching

For large datasets, Snakemake allows us to compute batches of input files for a specified rule. This needs to be run manually, or profile.yaml needs to be changed, to add the --batch myrule=1/n flag. This rule should be an aggregation step: in this pipeline, create_bam_list. Example:

snakemake --profile config/profile.yaml --batch create_bam_list=1/3
snakemake --profile config/profile.yaml --batch create_bam_list=2/3
snakemake --profile config/profile.yaml --batch create_bam_list=3/3

After running the final third (3/3), the rest of the pipeline will be run.

Visualization

Currently, visualization (PCA plotting) must be done manually. We hope to integrate it into the pipeline soon.

Once Snakemake finishes, you should have results/pca/ containing population.cov and population.info files. We need to pass these to pcangsd_visualize.py to create a plot. This script requires the visualization conda env.

Example usage:

conda activate /WAVE/projects/whittalllab/conda_envs/visualization
python scripts/pcangsd_visualize.py snakemake/results/pca/

The PCA plot should be saved to the directory provided. You can then scp it onto your personal computer to view.

About

Conservation genomics pipeline for Dudleya setchellii

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published