-
Notifications
You must be signed in to change notification settings - Fork 204
Download_From_SRA
Obtain a list of run accession numbers from the SRA. This can be done by searching for the SRA or SRP number in the SRA database (or by searching for the PRJ number in the BioProject database). See the NCBI's documentation for more information about what all the accession numbers mean and how they're linked. For example, searching for the SRA accession number SRA045646 yields 145 metagenomic experiments.
Once you have the search results you want, you can collect the run IDs for the experiments:
- one run ID per line, by selecting "Send to"-> File and choosing "Accession List" for the format. This method works if each sample has only one associated run.
- a table of run IDs, with additional metadata. This may be required if you have samples which required multiple sequencing runs (e.g. the sequencing was split across multiple lanes). Some extra effort will be required to merge the resulting FASTQs in this case (not covered in this tutorial).
More information about this is here: https://www.ncbi.nlm.nih.gov/books/NBK158899/
This step is technically optional, since fastq-dump
can download and dump FASTQs in one go, but it's a simple way to guard against network issues when trying to, for example, concatenate many runs belonging to a single sample. There are at least two ways to download the files.
NCBI's SRA Toolkit comes with a command named prefetch
that takes a run accession as an argument and stores the run in a user folder (~/ncbi/public/sra/
). To use prefetch
to download all the files, wrap it in a shell script loop or use parallel
:
parallel -j 1 prefetch {} ::: $(cat SraAccList.txt)
- The
-j 1
specifies the number of threads to use. Using1
limits to downloading one file at a time (simultaneous downloads may be faster, depending on your computer and network). - The
::: $(cat SraAccList.txt)
passes the contents of SraAccList.txt as arguments to the parallel command. This assumes that the SRR ids are all in the fileSraAccList.txt
that was downloaded in Step 1.
Download the files using wget
. You can form the URL for each file like so (note that the first 3 digits of the identifier is used as a subdirectory):
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra
Since we don't want to do that manually for each file we can get parallel
to help:
parallel -j 1 \
wget -P sra ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/{='$_=substr($_,0,3)'=}/{='$_=substr($_,0,6)'=}/{}/{}.sra \
::: $(cat SraAccList.txt)
- The
-P sra
specifies that the download files should be place in the directory 'sra'.
Here, we use the SRA Toolkit's fastq-dump
command. If you used prefetch
above OR if you did not download the SRRs, the command will be:
parallel -j 1 fastq-dump --skip-technical -F --split-files -O fastq {} ::: $(cat SraAccList.txt)
-
-j 1
specifies number of threads to use. Can increase this number to allow parallel processing of files. -
-F
specifies that the original ids be used (instead of those changed by the SRA) -
--skip-technical
some sequencing technologies will have other reads besides forward and reverse. This skips those. -
--split-files
will split the files into forward and reverse reads -
-O fastq
specifies the directory to place the converted fastq files -
--gzip
can be added as an option if you would like the fastq files to be gzipped (this saves space, but takes much longer to do the conversion).
Otherwise, if you used wget
, the command will be similar:
parallel -j 1 fastq-dump --skip-technical -F --split-files -O fastq {} ::: sra/*
-
::: sra/*.sra
feeds the downloaded sra files from step 2 and pipes that list to parallel for processing
A nice explanation of other fastq-dump options are provided by Rob Edward's group: https://edwards.sdsu.edu/research/fastq-dump/
- Please feel free to post a question on the Microbiome Helper google group if you have any issues.
- General comments or inquires about Microbiome Helper can be sent to [email protected].