In a previous post, I wrote about downloading SRA files from NCBI-SRA or EBI-ENA using the R package SRAdb. In this post, I will write about using SQL to query the SRA SQLite file, with the aim of giving the downloaded sequencing files meaningful titles.
Please follow the instructions below to establish a connection to the SRA SQLite file. For more details, please visit the aforementioned post.
The structure of the SRA SQLite database
The SRA SQLite file consists of multiple tables. Each table has a number of fields (or columns) and a number of records (or rows). Each field has a type which corresponds to the type of data it stores.
Let’s start by listing all the tables in the database.
As you see, we have 13 tables in the database. To explore one of the tables further (e.g. experiment), we can list the fields in this particular table.
In the experiment table, we have 42 fields (or columns). Next, we will explore what type of data each field contains.
Furthermore, we can list the descriptions of what is stored in each field of each table.
Using SQL to query the SRA SQLite database
We can use SQL to submit queries to the database. For example, to access the first ten records (or rows) of the experiment table:
To search for keywords, e.g. accessibility, within the field of study_description in the study table:
Notice that in SQL, we can use
%, which represents zero, one, or multiple characters and
_ which represents a single character. In the example above
%accessibl% means finds any value that has
accessibl in any position.
To get general statistics about the database, we can list all types of study:
Finally, we can list all types of library strategies:
Renaming downloaded sequence files
As in the previous post, I picked a random SRA submission (e.g. SRP042080). One nice feature in SRAdb is the ability to list all SRA submission objects associated with a specific object.
We will build the query to get the run titles as follow:
- Join the experiment and run tables using the experiment_accession field
- Join them with the fastq table using run_accession (optional)
- Filter the experiment_accession of interest.
- Order by library_name
It is time to use our regex skills to generate the rename table.
- look into the exp_title column
- match any character from the start of the line including
:and the space after it
- match any character from the end of the line till last
;and including all
- capture the characters in between using
- store the results in a new column run_title
- look into the fasp column
- use first capturing group
- match any characters from the start of the line till the last
/in the path.
- use second capturing group
- capture the rest of the full path, which is the file name and the extension
- return only the second capturing group using
- store the results in a new column fasp_file
Then we will use str_replace which will, for each row, search in the new fasp_file column to match the string in run and replace it with the string from run_title. The results will be a new column called file_name.
Now that we have the rename_table, we can simply loop through the rows to create a symbolic link named file_name to the fasp_file.