Sequence Read Archive (SRA) is a bioinformatics database which hosts DNA sequences of short reads generated by high throughput sequencing. The sequences are made publicly available by researchers as part of the publication process. SRA represents a collaboration between three different institutes: NCBI SRA, EBI ENA and DDBJ DRA.
SRA organises sequences in a hierarchal structure. There are 6 types of SRA submission objects:
- SRA = Submission; a virtual container for other objects (below).
- SRP = Study/Project; describes project’s metadata.
- SRS = Sample; describes the sample which was sequenced
- SRX = Experiment; describes the library, platform, and processing parameters
- SRR = Run; contains sequencing files
- SRZ = Analyzis; contains BAM files
- To interact with the SRA database using R including:
- Query SRA data and metadata
- Check for availability and size of sequence files
- Download files in bulk using ftp or fasp protocol
Let’s get started.
Installing and configuring SRAdb
SRAdb relays on an SQLite file which is updated regularly to reflect changes in the SRA database.
I think it is not efficient to rely on a huge SQLite file which needs to be downloaded locally with every update. I hope to see a successful implementation of an application programming interface (API) in future releases.
Once the SQLite file is downloaded (~2.3GB) and extracted (~37GB), it is time to establish a connection from SRAdb to the SQLite database.
Exploring SRA submissions
In order to focus on the subject of this post (i.e. downloading SRA files), I will dive directly into this functionality and I will write about using SQL to query the SRA database in another post.
In the example below, I will use a random SRA study (e.g. SRP042080). Let’s start by exploring the hierarchy of this study (Figure @ref(fig:SRA-hierarchy)).
Next, let’s list of samples available in the study.
We are interested in the runs objects which contain the sequence files.
Before we download the files, let’s check their sizes.
Installing and configuring Aspera connect
Now that we have a list of the runs (i.e. sequencing files), we can download them using either ftp or fsap. The latter is recommended over FTP as it is faster, optimised for bulk data across the continents, and is not affected by network delay or packet loss. Let’s download and configure Aspera connect.
Downloading sequence files
SRAdb can download sequencing files from NCBI or EBI. In addition, it uses either FTP or FASP protocols. In the following steps, I will be using the FASP protocol.
Downloading SRA files
For SRA files, SRAdb offers to download them from NCBI.
-T: disable encryption for maximum throughput
-l: set the target transfer rate in kbps
-i: to specify a private key file
Downloading FASTQ files
In addition, SRAdb has the ability to download FASTQ files from EBI-ENA network. This step recommended as it would save the time required to convert SRA to FASTQ files.
-P: TCP port used for SSH authentication
Saving downloads links
It is possible to export a list of links to the files for download using an external download manager.
Finally, it is recommended to close the connection to the SQLite file once done.
During the process, I faced the following error:
The solution was to define the TCP port used for SSH authentication using