A colleague of mine asked me for help in using DaPars for analysing alternative polyadenylation in their RNA-seq dataset. So, I thought to write a short post here to describe how I use it.

From Xia et al. 2014

Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq.

Installation

Download the source files of DaPars from GitHub and extract the files:

1$ tar zxvf DaPars-VERSION.tar.gz

Input files

You can find more details on the documentation page, but in essence, DaPars requires the following files:

  • BED file: a tab separated, 12 columns, which represents the gene model.
  • BedGraph file: stores the reads alignment results from an aligned BAM file.
  • Gene Symbol file: two columns containing NCBI RefSeq and gene symbol.

The BED file of the gene model can be downloaded from UCSC Table Browser.

  • genome: mouse
  • assembly: July 2007 (NCBI37/mm9)
  • group: Genes and Gene Predictions
  • track: REfSeq Genes
  • table: refGene
  • region: genome
  • output format: BED - browser extensible data
  • output file: mm9_refseq_whole_gene.bed

Click ‘get output’ button, and in the next page ‘Output refGene as BED’ click ‘get output’ button.

To generate the BedGraph files from BAM files, you need the chromsome size file chromInfo.txt.gz which can be downloaded from UCSC (hg19 or mm9) and then use the BedTools’ genomecov as follow:

1$ bedtools genomecov -bg -ibam sample_sorted.bam -g mm9_chr_size.txt -split > sample.bedgraph

Where:

  • -bg : report depth in BedGraph format.
  • -ibam : use BAM file as input for coverage.
  • -g : the genome file.
  • -split : Treat “split” BAM or BED12 entries as distinct BED intervals when computing coverage.

The Gene Symbol file can be downloaded from UCSC Table Browser.

  • genome: mouse
  • assembly: July 2007 (NCBI37/mm9)
  • group: Genes and Gene Predictions
  • track: REfSeq Genes
  • table: refGene
  • region: genome
  • output format: selected fields from primary and related tables
  • output file: mm9_30_03_2016_Refseq_id_from_UCSC.txt

Click ‘get output’ button, and in the next page select

  • name: Name of gene (usually transcript_id from GTF)
  • name2: Alternate name (e.g. gene_id from GTF)

Click ‘get output’ and save the file.

Usage

1. Generate region annotation

DaPars will use the extracted distal polyadenylation sites to infer the proximal polyadenylation sites based on the alignment files.

1$ python DaPars_Extract_Anno.py -b mm9_refseq_whole_gene.bed -s mm9_30_03_2016_Refseq_id_from_UCSC.txt -o mm9_refseq_extracted_3UTR.bed

Where:

  • -b GENE_BED_FILE : The gene model in BED format
  • -s Gene_Symbol_FILE : The mapping of transcripts to gene symbol, which can be extracted from UCSC Tables.
  • -o OUTPUT_FILE : The output of the extracted annotation region.

The structur of the DaPars folder looks like this:

1.
2├── DaPars_Extract_Anno.py
3├── DaPars_main.py
4├── mm9_30_03_2016_Refseq_id_from_UCSC.txt
5├── mm9_chr_size_chromInfo.txt
6└── mm9_refseq_whole_gene.bed

Since I am using Sun Grid Engine (SGE), I used the following job script to perform this step

 1#!/bin/sh
 2module load bedtools
 3
 4PROJECT_DIR=$HOME/Projects/RNAseq_Project
 5DAPARS=$HOME/dapars/
 6CHROMINFO=$HOME/dapars/mm9_chr_size_chromInfo.txt # The extracted chromInfo.txt.gz
 7MM9_REFSEQID=$HOME/dapars/mm9_30_03_2016_Refseq_id_from_UCSC.txt
 8MM9_BED=$HOME/dapars/mm9_refseq_whole_gene.bed
 9
10
11cat > $PROJECT_DIR/Region-Annotation.sh <<EOF
12python $DAPARS/DaPars_Extract_Anno.py -b $MM9_BED -s $MM9_REFSEQID -o mm9_refseq_extracted_3UTR.bed
13EOF
14
15qsub -l h_vmem=2G -V -cwd  -m e -M email@example.com $PROJECT_DIR/Region-Annotation.sh

2. Sample processing

The files generated in step 1 above will be used in step 2. Also for this step, you need to generate configure_file for each sample. For example:

 1#The following file is the result of step 1.
 2Annotated_3UTR=mm9_refseq_extracted_3UTR.bed
 3
 4#A comma-separated list of BedGraph files of samples from condition 1
 5Group1_Tophat_aligned_Wig=KO1.bedgraph,KO2.bedgraph,KO3.bedgraph
 6
 7#A comma-separated list of BedGraph files of samples from condition 2
 8Group2_Tophat_aligned_Wig=WT1.bedgraph,WT2.bedgraph,WT3.bedgraph
 9
10Output_directory=DaPars_Output_Condition1/
11Output_result_file=DaPars_Output_Condition1
12
13#At least how many samples passing the coverage threshold in two conditions
14Num_least_in_group1=1
15Num_least_in_group2=1
16
17Coverage_cutoff=30
18
19#Cutoff for FDR of P-values from Fisher exact test.
20
21FDR_cutoff=0.05
22PDUI_cutoff=0.5
23Fold_change_cutoff=0.59

The following is the SGE job script that I used to perform this step.

 1#!/bin/sh
 2module load bedtools
 3
 4PROJECT_DIR=$HOME/Projects/RNAseq_Project
 5DAPARS_SOURCE=$HOME/dapars
 6DAPARS_CONFIG_FILE=$HOME/dapars/configure_file
 7
 8GROUP01=Cell_type_01
 9GROUP02=Cell_type_02
10GROUP03=Cell_type_03
11GROUP04=Cell_type_04
12
13for G_ID in {01..04} #Where G_ID is Group ID
14do
15  export GROUP=GROUP${G_ID}
16  export GROUP_ID=${!GROUP}
17cat > $PROJECT_DIR/DaPars_${GROUP_ID}.sh <<EOF
18python $DAPARS/DaPars_main.py configure_file_${GROUP_ID}
19EOF
20
21qsub -l h_vmem=2G -V -cwd  -m e -M email@example.com $PROJECT_DIR/DaPars_${GROUP_ID}.sh
22
23done