Skip to Content

Renaming SRA/FASTQ files using SRAdb

Introduction

In a previous post, I wrote about downloading SRA files from NCBI-SRA or EBI-ENA using the R package SRAdb. In this post, I will write about using SQL to query the SRA SQLite file, with the aim of giving the downloaded sequencing files meaningful titles.

Please follow the instructions below to establish a connection to the SRA SQLite file. For more details, please visit the aforementioned post.

# load required libraries
library(SRAdb)
library(Rgraphviz)
# load extra libraries
library("dplyr")
library("stringr")
# download the SRA SQL database, only if does not exist locally
# Warning: the download size is ~2.3GB and the extracted size is ~37 GB! 
if( ! file.exists('~/SRAmetadb.sqlite') ) {
   sqlfile <- getSRAdbFile()
} else {
  sqlfile <- '~/SRAmetadb.sqlite'
}
# connect to the SRA SQLite database
sra_con = dbConnect(RSQLite::SQLite(), sqlfile)

The structure of the SRA SQLite database

The SRA SQLite file consists of multiple tables. Each table has a number of fields (or columns) and a number of records (or rows). Each field has a type which corresponds to the type of data it stores.

Let’s start by listing all the tables in the database.

# list the tables in a database
dbListTables(sra_con) -> sra_tables
sra_tables
##  [1] "col_desc"        "experiment"      "fastq"          
##  [4] "metaInfo"        "run"             "sample"         
##  [7] "sra"             "sra_ft"          "sra_ft_content" 
## [10] "sra_ft_segdir"   "sra_ft_segments" "study"          
## [13] "submission"

As you see, we have 13 tables in the database. To explore one of the tables further (e.g. experiment), we can list the fields in this particular table.

# show the fields in a table
dbListFields(sra_con,"experiment")
##  [1] "experiment_ID"                 "bamFile"                      
##  [3] "fastqFTP"                      "experiment_alias"             
##  [5] "experiment_accession"          "broker_name"                  
##  [7] "center_name"                   "title"                        
##  [9] "study_name"                    "study_accession"              
## [11] "design_description"            "sample_name"                  
## [13] "sample_accession"              "sample_member"                
## [15] "library_name"                  "library_strategy"             
## [17] "library_source"                "library_selection"            
## [19] "library_layout"                "targeted_loci"                
## [21] "library_construction_protocol" "spot_length"                  
## [23] "adapter_spec"                  "read_spec"                    
## [25] "platform"                      "instrument_model"             
## [27] "platform_parameters"           "sequence_space"               
## [29] "base_caller"                   "quality_scorer"               
## [31] "number_of_levels"              "multiplier"                   
## [33] "qtype"                         "sra_link"                     
## [35] "experiment_url_link"           "xref_link"                    
## [37] "experiment_entrez_link"        "ddbj_link"                    
## [39] "ena_link"                      "experiment_attribute"         
## [41] "submission_accession"          "sradb_updated"

In the experiment table, we have 42 fields (or columns). Next, we will explore what type of data each field contains.

# check the data type of each column in a table
dbGetQuery(sra_con,'PRAGMA TABLE_INFO(experiment)')

Furthermore, we can list the descriptions of what is stored in each field of each table.

colDescriptions(sra_con=sra_con) -> colDesc
colDesc

Using SQL to query the SRA SQLite database

We can use SQL to submit queries to the database. For example, to access the first ten records (or rows) of the experiment table:

# select the first 10 records from the experiment table
dbGetQuery(sra_con, 
  "SELECT *
   FROM experiment LIMIT 10"
)

To search for keywords, e.g. accessibility, within the field of study_description in the study table:

# search for 'accessibility' and to retrieve the study accession and title

dbGetQuery(sra_con, paste(
  "SELECT study_accession,
          study_title
   FROM study
   WHERE study_description LIKE '%accessib%'",
  sep=" "
))

Notice that in SQL, we can use %, which represents zero, one, or multiple characters and _ which represents a single character. In the example above %accessibl% means finds any value that has accessibl in any position.

To get general statistics about the database, we can list all types of study:

# list all study types and number of studies contained in each type
dbGetQuery(sra_con, paste(
  "SELECT study_type AS 'Study Type',
          COUNT(*) AS Number
   FROM study
   GROUP BY study_type
   ORDER BY Number DESC",
  sep = ""
))

Finally, we can list all types of library strategies:

# list all types of library strategies and number of runs in each type
dbGetQuery(sra_con, paste(
  "SELECT library_strategy AS 'Library Strategy',
          COUNT (*) AS Runs 
  FROM experiment
  GROUP BY library_strategy
  ORDER BY Runs DESC",
  sep = ""
))

Renaming downloaded sequence files

As in the previous post, I picked a random SRA submission (e.g. SRP042080). One nice feature in SRAdb is the ability to list all SRA submission objects associated with a specific object.

sraConvert( c('SRP042080'), sra_con = sra_con ) -> sra_project
sra_project
# list available runs in the study
sra_project$run -> sra_runs
sra_runs
## [1] "SRR1291260" "SRR1291261"
# list the download links
'~/Downloads' -> destDir
getFASTQfile(sra_runs, sra_con, destDir = destDir, srcType = "fasp") -> EBI_cmd
## Files are saved to: 
## '~/Downloads'
EBI_cmd

We will build the query to get the run titles as follow: - Join the experiment and run tables using the experiment_accession field - Join them with the fastq table using run_accession (optional) - Filter the experiment_accession of interest. - Order by library_name

dbGetQuery(sra_con, paste(
  "SELECT experiment.library_name AS 'Library Name',
          experiment.title AS 'exp_title',
          run.run_accession AS 'run',
          fastq.FASTQ_FILES AS 'fastq'
   FROM experiment
   JOIN run ON experiment.experiment_accession = run.experiment_accession
   JOIN fastq ON run.run_accession = fastq.run_accession
   WHERE study_accession = 'SRP042080'
   ORDER BY library_name",
   sep = ""
)) -> runs_titles

runs_titles

It is time to use our regex skills to generate the rename table.

  • The expression ^.*:.()|;.*$ will:
    • look into the exp_title column
    • match any character from the start of the line including : and the space after it
    • match any character from the end of the line till last ; and including all ;
    • capture the characters in between using ()
    • store the results in a new column run_title
  • The expression (^.*/*/)(.*$) will:
    • look into the fasp column
    • use first capturing group (^.*/*/) to
    • match any characters from the start of the line till the last / in the path.
    • use second capturing group (.*$) to
    • capture the rest of the full path, which is the file name and the extension
    • return only the second capturing group using \\2
    • store the results in a new column fasp_file

Then we will use str_replace which will, for each row, search in the new fasp_file column to match the string in run and replace it with the string from run_title. The results will be a new column called file_name.

merge(runs_titles, EBI_cmd) %>% 
  mutate(run_title = str_replace_all(exp_title, "^.*:.()|;.*$", "")) %>% 
  mutate(fasp_file = str_replace(fasp, "(^.*/*/)(.*$)", "\\2")) %>% 
  mutate(file_name = str_replace(fasp_file, run, run_title)) %>% 
  select(file_name, fasp_file) -> rename_table

rename_table

Now that we have the rename_table, we can simply loop through the rows to create a symbolic link named file_name to the fasp_file.

for(seq in seq_along(rename.table$fasp_file)){
    file.symlink(paste0(rename.table$fasp_file [seq]), paste0(rename.table$file_name[seq]))
}
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2      stringr_1.3.1       dplyr_0.7.6        
##  [4] Rgraphviz_2.24.0    SRAdb_1.42.2        RCurl_1.95-4.11    
##  [7] bitops_1.0-6        graph_1.58.0        BiocGenerics_0.26.0
## [10] RSQLite_2.1.1      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     compiler_3.5.1   pillar_1.3.0     bindr_0.1.1     
##  [5] tools_3.5.1      digest_0.6.15    bit_1.1-14       jsonlite_1.5    
##  [9] evaluate_0.11    memoise_1.1.0    tibble_1.4.2     pkgconfig_2.0.1 
## [13] rlang_0.2.1      DBI_1.0.0        GEOquery_2.48.0  yaml_2.2.0      
## [17] blogdown_0.8     xfun_0.3         xml2_1.2.0       knitr_1.20      
## [21] hms_0.4.2        stats4_3.5.1     rprojroot_1.3-2  bit64_0.9-7     
## [25] tidyselect_0.2.4 Biobase_2.40.0   glue_1.3.0       R6_2.2.2        
## [29] rmarkdown_1.10   bookdown_0.7     limma_3.36.2     tidyr_0.8.1     
## [33] readr_1.1.1      purrr_0.2.5      blob_1.1.1       magrittr_1.5    
## [37] backports_1.1.2  htmltools_0.3.6  assertthat_0.2.0 stringi_1.2.4   
## [41] crayon_1.3.4