DaPars for alternative polyadenylation analysis
A colleague of mine asked me for help in using DaPars for analysing alternative polyadenylation in their RNA-seq dataset. So, I thought to write a short post here to describe how I use it.
From Xia et al. 2014
Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq.
Installation
Download the source files of DaPars from GitHub and extract the files:
|
|
Input files
You can find more details on the documentation page, but in essence, DaPars requires the following files:
- BED file: a tab separated, 12 columns, which represents the gene model.
- BedGraph file: stores the reads alignment results from an aligned BAM file.
- Gene Symbol file: two columns containing
NCBI RefSeq
andgene symbol
.
The BED file of the gene model can be downloaded from UCSC Table Browser.
- genome: mouse
- assembly: July 2007 (NCBI37/mm9)
- group: Genes and Gene Predictions
- track: REfSeq Genes
- table: refGene
- region: genome
- output format: BED - browser extensible data
- output file:
mm9_refseq_whole_gene.bed
Click ‘get output’ button, and in the next page ‘Output refGene as BED’ click ‘get output’ button.
To generate the BedGraph files from BAM files, you need the chromsome size file chromInfo.txt.gz
which can be downloaded from UCSC (hg19 or mm9) and then use the BedTools' genomecov as follow:
|
|
Where:
-bg
: report depth in BedGraph format.-ibam
: use BAM file as input for coverage.-g
: the genome file.-split
: Treat “split” BAM or BED12 entries as distinct BED intervals when computing coverage.
The Gene Symbol file can be downloaded from UCSC Table Browser.
- genome: mouse
- assembly: July 2007 (NCBI37/mm9)
- group: Genes and Gene Predictions
- track: REfSeq Genes
- table: refGene
- region: genome
- output format: selected fields from primary and related tables
- output file:
mm9_30_03_2016_Refseq_id_from_UCSC.txt
Click ‘get output’ button, and in the next page select
- name: Name of gene (usually transcript_id from GTF)
- name2: Alternate name (e.g. gene_id from GTF)
Click ‘get output’ and save the file.
Usage
1. Generate region annotation
DaPars will use the extracted distal polyadenylation sites to infer the proximal polyadenylation sites based on the alignment files.
|
|
Where:
-b
GENE_BED_FILE : The gene model in BED format-s
Gene_Symbol_FILE : The mapping of transcripts to gene symbol, which can be extracted from UCSC Tables.-o
OUTPUT_FILE : The output of the extracted annotation region.
The structur of the DaPars folder looks like this:
|
|
Since I am using Sun Grid Engine (SGE), I used the following job script to perform this step
|
|
2. Sample processing
The files generated in step 1 above will be used in step 2.
Also for this step, you need to generate configure_file
for each sample. For example:
|
|
The following is the SGE job script that I used to perform this step.
|
|