Sequence Read Archive (SRA) is a bioinformatics database which hosts DNA sequences of short reads generated by high throughput sequencing. The sequences are made publicly available by researchers as part of the publication process. SRA represents a collaboration between three different institutes: NCBI SRA, EBI ENA and DDBJ DRA.

SRA organises sequences in a hierarchal structure. There are 6 types of SRA submission objects:

  • SRA = Submission; a virtual container for other objects (below).
  • SRP = Study/Project; describes project’s metadata.
  • SRS = Sample; describes the sample which was sequenced
  • SRX = Experiment; describes the library, platform, and processing parameters
  • SRR = Run; contains sequencing files
  • SRZ = Analyzis; contains BAM files

Why SRAdb?

  • To interact with the SRA database using R including:
    • Query SRA data and metadata
    • Check for availability and size of sequence files
    • Download files in bulk using ftp or fasp protocol

Let’s get started.

Installing and configuring SRAdb

# install the required libraries
# install extra libraries
# load the required libraries
# load extra libraries

SRAdb relays on an SQLite file which is updated regularly to reflect changes in the SRA database.

I think it is not efficient to rely on a huge SQLite file which needs to be downloaded locally with every update. I hope to see a successful implementation of an application programming interface (API) in future releases.

# download the SRA SQL database, only if does not exist locally
if( ! file.exists('~/SRAmetadb.sqlite') ) {
   sqlfile <- getSRAdbFile()
} else {
  sqlfile <- '~/SRAmetadb.sqlite'

Once the SQLite file is downloaded (~2.3GB) and extracted (~37GB), it is time to establish a connection from SRAdb to the SQLite database.

# connect to the database file
sra_con = dbConnect(RSQLite::SQLite(), sqlfile)

Exploring SRA submissions

In order to focus on the subject of this post (i.e. downloading SRA files), I will dive directly into this functionality and I will write about using SQL to query the SRA database in another post.

In the example below, I will use a random SRA study (e.g. SRP042080). Let’s start by exploring the hierarchy of this study (Figure @ref(fig:SRA-hierarchy)).

# draw a map of the study hierarchy
sraGraph('SRP042080', sra_con) -> sra_graph
getDefaultAttrs(list(node=list(fillcolor='lightblue', shape='ellipse'))) -> attrs
plot(sra_graph, attrs=attrs)

Next, let’s list of samples available in the study.

# list available SRA files in the study
listSRAfile('SRP042080', sra_con) -> sra_project

We are interested in the runs objects which contain the sequence files.

# list available runs in the study
sra_project$run -> sra_runs

Before we download the files, let’s check their sizes.

# check the size of the runs files
getSRAinfo(sra_runs, sra_con, sraType = 'sra') -> sra_fileSize
sra_fileSize %>%
   select(run, study, sample, experiment, `size(KB)`, date, ftp)

Installing and configuring Aspera connect

Now that we have a list of the runs (i.e. sequencing files), we can download them using either ftp or fsap. The latter is recommended over FTP as it is faster, optimised for bulk data across the continents, and is not affected by network delay or packet loss. Let’s download and configure Aspera connect.

# download asper from
cd ~/Download
tar -xzvf ibm-aspera-connect-

# install it

Downloading sequence files

SRAdb can download sequencing files from NCBI or EBI. In addition, it uses either FTP or FASP protocols. In the following steps, I will be using the FASP protocol.

Downloading SRA files

For SRA files, SRAdb offers to download them from NCBI.

# download SRA files via fasp
'~/Downloads' -> destDir
"ascp -T -l 300m -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh" -> ascpCMD
getSRAfile(sra_runs, sra_con, destDir = destDir, fileType ='sra', srcType = "fasp", ascpCMD = ascpCMD )


  • -T : disable encryption for maximum throughput
  • -l : set the target transfer rate in kbps
  • -i : to specify a private key file

Downloading FASTQ files

In addition, SRAdb has the ability to download FASTQ files from EBI-ENA network. This step recommended as it would save the time required to convert SRA to FASTQ files.

# download FASTQ files via fasp
'~/Downloads' -> destDir
"ascp -T -l 300m -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh" -> ascpCMD
getFASTQfile(sra_runs, sra_con, destDir = destDir, srcType = "fasp", ascpCMD = ascpCMD )


  • -P : TCP port used for SSH authentication

It is possible to export a list of links to the files for download using an external download manager.

# to generate links for external download manager - NCBI
'~/Downloads' -> destDir
getSRAfile(sra_runs, sra_con, destDir = destDir, fileType = 'sra', srcType = 'fasp') -> NCBI_cmd
# to generate links for external download manager - EBI
'~/Downloads' -> destDir
getFASTQfile(sra_runs, sra_con, destDir = destDir, srcType = "fasp") -> EBI_cmd

Finally, it is recommended to close the connection to the SQLite file once done.

# disconnect from the database


During the process, I faced the following error:

Session Stop (Error: Failed to open TCP connection for SSH)
ascp: Failed to open TCP connection for SSH, exiting.

The solution was to define the TCP port used for SSH authentication using -P parameter.