Skip to Content

DaPars for alternative polyadenylation analysis

A colleague of mine asked me for help in using DaPars for analysing alternative polyadenylation in their RNA-seq dataset. So, I thought to write a short post here to describe how I use it.

From Xia et al. 2014

Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq.

Installation

Download the source files of DaPars from GitHub and extract the files:

$ tar zxvf DaPars-VERSION.tar.gz

Input files

You can find more details on the documentation page, but in essence, DaPars requires the following files:

  • BED file: a tab separated, 12 columns, which represents the gene model.
  • BedGraph file: stores the reads alignment results from an aligned BAM file.
  • Gene Symbol file: two columns containing NCBI RefSeq and gene symbol.

The BED file of the gene model can be downloaded from UCSC Table Browser. * genome: mouse * assembly: July 2007 (NCBI37/mm9) * group: Genes and Gene Predictions * track: REfSeq Genes * table: refGene * region: genome * output format: BED - browser extensible data * output file: mm9_refseq_whole_gene.bed

Click ‘get output’ button, and in the next page ‘Output refGene as BED’ click ‘get output’ button.

To generate the BedGraph files from BAM files, you need the chromsome size file chromInfo.txt.gz which can be downloaded from UCSC (hg19 or mm9) and then use the BedTools’ genomecov as follow:

$ bedtools genomecov -bg -ibam sample_sorted.bam -g mm9_chr_size.txt -split > sample.bedgraph

Where: * -bg : report depth in BedGraph format. * -ibam : use BAM file as input for coverage. * -g : the genome file. * -split : Treat “split” BAM or BED12 entries as distinct BED intervals when computing coverage.

The Gene Symbol file can be downloaded from UCSC Table Browser. * genome: mouse * assembly: July 2007 (NCBI37/mm9) * group: Genes and Gene Predictions * track: REfSeq Genes * table: refGene * region: genome * output format: selected fields from primary and related tables * output file: mm9_30_03_2016_Refseq_id_from_UCSC.txt

Click ‘get output’ button, and in the next page select

  • name: Name of gene (usually transcript_id from GTF)
  • name2: Alternate name (e.g. gene_id from GTF)

Click ‘get output’ and save the file.

Usage

1. Generate region annotation

DaPars will use the extracted distal polyadenylation sites to infer the proximal polyadenylation sites based on the alignment files.

$ python DaPars_Extract_Anno.py -b mm9_refseq_whole_gene.bed -s mm9_30_03_2016_Refseq_id_from_UCSC.txt -o mm9_refseq_extracted_3UTR.bed

Where:

  • -b GENE_BED_FILE : The gene model in BED format
  • -s Gene_Symbol_FILE : The mapping of transcripts to gene symbol, which can be extracted from UCSC Tables.
  • -o OUTPUT_FILE : The output of the extracted annotation region.

The structur of the DaPars folder looks like this:

.
├── DaPars_Extract_Anno.py
├── DaPars_main.py
├── mm9_30_03_2016_Refseq_id_from_UCSC.txt
├── mm9_chr_size_chromInfo.txt
└── mm9_refseq_whole_gene.bed

Since I am using Sun Grid Engine (SGE), I used the following job script to perform this step

#!/bin/sh
module load bedtools

PROJECT_DIR=$HOME/Projects/RNAseq_Project
DAPARS=$HOME/dapars/
CHROMINFO=$HOME/dapars/mm9_chr_size_chromInfo.txt # The extracted chromInfo.txt.gz
MM9_REFSEQID=$HOME/dapars/mm9_30_03_2016_Refseq_id_from_UCSC.txt
MM9_BED=$HOME/dapars/mm9_refseq_whole_gene.bed


cat > $PROJECT_DIR/Region-Annotation.sh <<EOF
python $DAPARS/DaPars_Extract_Anno.py -b $MM9_BED -s $MM9_REFSEQID -o mm9_refseq_extracted_3UTR.bed
EOF

qsub -l h_vmem=2G -V -cwd  -m e -M email@example.com $PROJECT_DIR/Region-Annotation.sh

2. Sample processing

The files generated in step 1 above will be used in step 2. Also for this step, you need to generate configure_file for each sample. For example:

#The following file is the result of step 1.
Annotated_3UTR=mm9_refseq_extracted_3UTR.bed

#A comma-separated list of BedGraph files of samples from condition 1
Group1_Tophat_aligned_Wig=KO1.bedgraph,KO2.bedgraph,KO3.bedgraph

#A comma-separated list of BedGraph files of samples from condition 2
Group2_Tophat_aligned_Wig=WT1.bedgraph,WT2.bedgraph,WT3.bedgraph

Output_directory=DaPars_Output_Condition1/
Output_result_file=DaPars_Output_Condition1

#At least how many samples passing the coverage threshold in two conditions
Num_least_in_group1=1
Num_least_in_group2=1

Coverage_cutoff=30

#Cutoff for FDR of P-values from Fisher exact test.

FDR_cutoff=0.05
PDUI_cutoff=0.5
Fold_change_cutoff=0.59

The following is the SGE job script that I used to perform this step.

#!/bin/sh
module load bedtools

PROJECT_DIR=$HOME/Projects/RNAseq_Project
DAPARS_SOURCE=$HOME/dapars
DAPARS_CONFIG_FILE=$HOME/dapars/configure_file

GROUP01=Cell_type_01
GROUP02=Cell_type_02
GROUP03=Cell_type_03
GROUP04=Cell_type_04

for G_ID in {01..04} #Where G_ID is Group ID
do
  export GROUP=GROUP${G_ID}
  export GROUP_ID=${!GROUP}
cat > $PROJECT_DIR/DaPars_${GROUP_ID}.sh <<EOF
python $DAPARS/DaPars_main.py configure_file_${GROUP_ID}
EOF

qsub -l h_vmem=2G -V -cwd  -m e -M email@example.com $PROJECT_DIR/DaPars_${GROUP_ID}.sh

done