Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename output.gd file with the name of the fastq file or part of it #275

Open
loukesio opened this issue Jul 23, 2021 · 3 comments
Open

Comments

@loukesio
Copy link

Dear all,

Does anybody have an idea on how to rename hundreds output.gd files with the name of the fastq file they are coming from?
So far I am doing it manually.

Thank you for your time

@danieldeatherage
Copy link
Contributor

danieldeatherage commented Jul 23, 2021

I often use bash for loops to rename the 'output.gd' files based on what I set the output fold as when I started the run. So for the breseq commands:
breseq -r Reference.gbk -o Output_folder/Sample1 fastq1.R1.fastq fastq1.R2.fastq
breseq -r Reference.gbk -o Output_folder/Sample2 fastq2.R1.fastq fastq2.R2.fastq

After the breseq runs are complete, i would execute the following command inside the Output_folder:
for sample in *; do cp $sample/output/output.gd $sample.gd; done

I actually tend to get more elaborate putting this as part of a small bash script i have named "genome_diff_collection.sh"

#! /bin/bash

for sample in *; do cp $sample/output/output.gd $sample.gd; cp $sample/output/evidence/annotated.gd $sample.annotated.gd; done
mkdir all_gds_annotated
mv *.annotated.gd all_gds_annotated
mkdir all_gds
mv *.gd all_gds

If you aren't using unique output directories at the breseq command line, I have some ideas of how you could use grep to capture the "#=READSEQ" lines and edit them in a text editor or excel to give you a list of copy commands that you would paste onto the prompt to rename everything for you.

@loukesio
Copy link
Author

Thank you Daniel for sharing your code. I think thats the way to go

@jeffreybarrick
Copy link
Contributor

To add a couple of other ways of accomplishing workflows that involve many samples to the thread...

  1. Use gdtools utility commands with GenomeDiff files describing the input/output

Start by creating GenomeDiff files with #=TITLE, #=READSEQ and #=REFSEQ information (and optionally #=POPULATION, #=TREATMENT, and #=TIME ). For example,....

#=GENOME_DIFF 1.0
#=TITLE	Ara-1_2000gen_1164B
#=REFSEQ	https://raw.githubusercontent.com/barricklab/LTEE/7da91974eafac0c5a8f903ae57275795d4395af2/reference/REL606.gbk
#=READSEQ	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR051/ERR051713/ERR051713.fastq.gz
#=TIME	2000
#=POPULATION	Ara-1
#=TREATMENT	LTEE
#=CLONE	B

Then, name these like sample1.gd, 'sample2.gd, etc. and put them in a folder named 01_Datain your directory you will use for running _breseq_. Plop all of the read and reference sequences into a folder called02_Downloadsin the same directory. (If you use URLs in the location of these files as in the example, you can usegdtools DOWNLOAD` to retrieve all of the files to that directory. But, it is also OK to copy them there yourself. As long as the filenames at the end of the paths match what is in the GD file at the end of the path, it will detect and us them fine.)

Now, you can use gdtools RUNFILE to generate a file that has breseq commands with the READ and REFERENCE files set correctly for each sample. Use the option --options <your_breseq_options> to change the options that are added to the breseq commands. You can run this script, or farm it out to a cluster.

After the output is present for all samples, commands like this should now work for comparing everything, because each output.gd file inherits the metadata (title) from the original GD file you made:

gdtools COMPARE -r 02_Downloads/<your_references> 03_Output/*/output/output.gd

With this method, you still have to create these initial GenomeDiff files that describe each sample. I think @danieldeatherage has a script for creating these from a folder of FASTQ and some metadata.

  1. Use the utility script batch_run.pl to automate commands being run on sets of directories

This is a script I created for the common situation of needing to do something once for each directory inside of the current directory or once for each file inside of the current directory. It accomplishes what @danieldeatherage's bash script does, but with some other functionality (by invoking a Perl script).

Download this and put it in your $PATH
https://github.com/barricklab/barricklab/blob/master/batch_run.pl

To get the full help, use:

batch_run.pl --man

I always recommend running things in test mode first to print what commands it will run if you are not in test mode first (so you don't do something crazy by mistake).

batch_run.pl  -t  ...

Now, if you are in a folder that contains many folders of breseq output for different samples named sample1, sample2, sample3, etc., then you can use this command to copy all output.gd files and rename them to sample1.gd, sample2.gd, sample3.gd, etc.

mkdir gd
batch_run.pl "cp output/output.gd ../gd"

What's happening is the script enters each directory and runs this command, replacing #d with the name of the directory. The ../gd copies back up a level to the gd directory we made outside of directory it is in.

You can use the -0 flag to operate on all files in the current directory (instead of on all directories). In this case you can use #e in the command to substitute the name of the file with the extension removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants