Skip to Content

Preparing genome reference in FASTA format

To prepare genome reference in FASTA format for mouse assembly NCBI37/mm9, we have two options:

From UCSC

  • Using the mm9 assembly from UCSC golden Path.
  • Do not use the masked file chromFaMasked.tar.gz!
# download `chromFa.tar.gz ` from UCSC golden path
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
#or
sync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz .

# uncompress the downloaded file
tar -xvzf chromFa.tar.gz

# remove `*random.fa` chromosomes
rm -rf *_random.fa

# concatenate all FASTA files into a single file
cat *.fa > mm9.fa

# index the concatenated .fa file using `samtools`
module load samtools # required at HPC
samtools faidx mm9.fa

From Ensembl

  • Using Mus Musculus release-67 (Ensembl)
# download the following folder from Ensembl
lftp ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/dna/
mget *

# uncompress the downloaded *.fa.gz files
gzip -d *.fa.gz

# delete the masked version of the genome sequence which contains '_rm' in the name
rm -rf *_rm*

# concatenate all FASTA files into a single file
cat *.fa > mm9Ensembl.fa

# index the concatenated .fa file using `samtools`
module load samtools # required at HPC
samtools faidx mm9Ensembl.fa
  • One of the main differences between the two sources, which is very important for downstream applications, is chromosome annotation.
  • Here is a comparison between the header lines, also known as the identifier or description lines, used in the FASTA files from both sources.
  • While UCSC uses the chr prefix in front of the chromosome number, Ensembl merely uses the chromosome number.
# mm9 from UCSC
~/f/G/N/mm9_fasta ❯❯❯ cat mm9.fa | grep '>'
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chrM
>chrX
>chrY

# ENS67 from Ensembl
~/f/G/N/ENS67_fasta ❯❯❯ cat ENS67.fa | grep '>'
>10 dna:chromosome chromosome:NCBIM37:10:1:129993255:1
>11 dna:chromosome chromosome:NCBIM37:11:1:121843856:1
>12 dna:chromosome chromosome:NCBIM37:12:1:121257530:1
>13 dna:chromosome chromosome:NCBIM37:13:1:120284312:1
>14 dna:chromosome chromosome:NCBIM37:14:1:125194864:1
>15 dna:chromosome chromosome:NCBIM37:15:1:103494974:1
>16 dna:chromosome chromosome:NCBIM37:16:1:98319150:1
>17 dna:chromosome chromosome:NCBIM37:17:1:95272651:1
>18 dna:chromosome chromosome:NCBIM37:18:1:90772031:1
>19 dna:chromosome chromosome:NCBIM37:19:1:61342430:1
>1 dna:chromosome chromosome:NCBIM37:1:1:197195432:1
>2 dna:chromosome chromosome:NCBIM37:2:1:181748087:1
>3 dna:chromosome chromosome:NCBIM37:3:1:159599783:1
>4 dna:chromosome chromosome:NCBIM37:4:1:155630120:1
>5 dna:chromosome chromosome:NCBIM37:5:1:152537259:1
>6 dna:chromosome chromosome:NCBIM37:6:1:149517037:1
>7 dna:chromosome chromosome:NCBIM37:7:1:152524553:1
>8 dna:chromosome chromosome:NCBIM37:8:1:131738871:1
>9 dna:chromosome chromosome:NCBIM37:9:1:124076172:1
>MT dna:chromosome chromosome:NCBIM37:MT:1:16299:1
>X dna:chromosome chromosome:NCBIM37:X:1:166650296:1
>Y dna:chromosome chromosome:NCBIM37:Y:1:15902555:1