Skip to Content

Downloading SRA/FASTQ files using SRAdb

Introduction

Sequence Read Archive (SRA) is a bioinformatics database which hosts DNA sequences of short reads generated by high throughput sequencing. The sequences are made publicly available by researchers as part of the publication process. SRA represents a collaboration between three different institutes: NCBI SRA, EBI ENA and DDBJ DRA.

SRA organises sequences in a hierarchal structure. There are 6 types of SRA submission objects:

  • SRA = Submission; a virtual container for other objects (below).
  • SRP = Study/Project; describes project’s metadata.
  • SRS = Sample; describes the sample which was sequenced
  • SRX = Experiment; describes the library, platform, and processing parameters
  • SRR = Run; contains sequencing files
  • SRZ = Analyzis; contains BAM files

Why SRAdb?

  • To interact with the SRA database using R including:
    • Query SRA data and metadata
    • Check for availability and size of sequence files
    • Download files in bulk using ftp or fasp protocol

Let’s get started.

Installing and configuring SRAdb

# install the required libraries
source('http://bioconductor.org/biocLite.R')
biocLite('SRAdb')
biocLite("Rgraphviz")
# install extra libraries
install.packages("dplyr")
# load the required libraries
library(SRAdb)
library(Rgraphviz)
# load extra libraries
library(dplyr)

SRAdb relays on an SQLite file which is updated regularly to reflect changes in the SRA database.

I think it is not efficient to rely on a huge SQLite file which needs to be downloaded locally with every update. I hope to see a successful implementation of an application programming interface (API) in future releases.

# download the SRA SQL database, only if does not exist locally
if( ! file.exists('~/SRAmetadb.sqlite') ) {
   sqlfile <- getSRAdbFile()
} else {
  sqlfile <- '~/SRAmetadb.sqlite'
}

Once the SQLite file is downloaded (~2.3GB) and extracted (~37GB), it is time to establish a connection from SRAdb to the SQLite database.

# connect to the database file
sra_con = dbConnect(RSQLite::SQLite(), sqlfile)

Exploring SRA submissions

In order to focus on the subject of this post (i.e. downloading SRA files), I will dive directly into this functionality and I will write about using SQL to query the SRA database in another post.

In the example below, I will use a random SRA study (e.g. SRP042080). Let’s start by exploring the hierarchy of this study (Figure 1).

# draw a map of the study hierarchy
sraGraph('SRP042080', sra_con) -> sra_graph
getDefaultAttrs(list(node=list(fillcolor='lightblue', shape='ellipse'))) -> attrs
plot(sra_graph, attrs=attrs)
The hierarchy of SRA submission objects for SRP042080

Figure 1: The hierarchy of SRA submission objects for SRP042080

Next, let’s list of samples available in the study.

# list available SRA files in the study
listSRAfile('SRP042080', sra_con) -> sra_project
sra_project

We are interested in the runs objects which contain the sequence files.

# list available runs in the study
sra_project$run -> sra_runs
sra_runs
## [1] "SRR1291260" "SRR1291261"

Before we download the files, let’s check their sizes.

# check the size of the runs files
getSRAinfo(sra_runs, sra_con, sraType = 'sra') -> sra_fileSize
sra_fileSize %>% 
   select(run, study, sample, experiment, `size(KB)`, date, ftp)

Installing and configuring Aspera connect

Now that we have a list of the runs (i.e. sequencing files), we can download them using either ftp or fsap. The latter is recommended over FTP as it is faster, optimised for bulk data across the continents, and is not affected by network delay or packet loss. Let’s download and configure Aspera connect.

# download asper from https://downloads.asperasoft.com/connect2/
cd ~/Download
wget https://d3gcli72yxqn2z.cloudfront.net/connect/bin/ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
tar -xzvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz

# install it
./ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh

Downloading sequence files

SRAdb can download sequencing files from NCBI or EBI. In addition, it uses either FTP or FASP protocols. In the following steps, I will be using the FASP protocol.

Downloading SRA files

For SRA files, SRAdb offers to download them from NCBI.

# download SRA files via fasp 
'~/Downloads' -> destDir
"ascp -T -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh" -> ascpCMD
getSRAfile(sra_runs, sra_con, destDir = destDir, fileType ='sra', srcType = "fasp", ascpCMD = ascpCMD )

Where:

  • -T : disable encryption for maximum throughput
  • -l : set the target transfer rate in kbps
  • -i : to specify a private key file

Downloading FASTQ files

In addition, SRAdb has the ability to download FASTQ files from EBI-ENA network. This step recommended as it would save the time required to convert SRA to FASTQ files.

# download FASTQ files via fasp 
'~/Downloads' -> destDir
"ascp -T -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh" -> ascpCMD
getFASTQfile(sra_runs, sra_con, destDir = destDir, srcType = "fasp", ascpCMD = ascpCMD )

Where:

  • -P : TCP port used for SSH authentication