To prepare genome reference in FASTA format for mouse assembly NCBI37/mm9, we have two options:

From UCSC

  • Using the mm9 assembly from UCSC golden Path.
  • Do not use the masked file chromFaMasked.tar.gz!
 1# download `chromFa.tar.gz ` from UCSC golden path
 2wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
 3#or
 4sync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz .
 5
 6# uncompress the downloaded file
 7tar -xvzf chromFa.tar.gz
 8
 9# remove `*random.fa` chromosomes
10rm -rf *_random.fa
11
12# concatenate all FASTA files into a single file
13cat *.fa > mm9.fa
14
15# index the concatenated .fa file using `samtools`
16module load samtools # required at HPC
17samtools faidx mm9.fa

From Ensembl

  • Using Mus Musculus release-67 (Ensembl)
 1# download the following folder from Ensembl
 2lftp ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/dna/
 3mget *
 4
 5# uncompress the downloaded *.fa.gz files
 6gzip -d *.fa.gz
 7
 8# delete the masked version of the genome sequence which contains '_rm' in the name
 9rm -rf *_rm*
10
11# concatenate all FASTA files into a single file
12cat *.fa > mm9Ensembl.fa
13
14# index the concatenated .fa file using `samtools`
15module load samtools # required at HPC
16samtools faidx mm9Ensembl.fa
  • One of the main differences between the two sources, which is very important for downstream applications, is chromosome annotation.
  • Here is a comparison between the header lines, also known as the identifier or description lines, used in the FASTA files from both sources.
  • While UCSC uses the chr prefix in front of the chromosome number, Ensembl merely uses the chromosome number.
 1# mm9 from UCSC
 2~/f/G/N/mm9_fasta ❯❯❯ cat mm9.fa | grep '>'
 3>chr10
 4>chr11
 5>chr12
 6>chr13
 7>chr14
 8>chr15
 9>chr16
10>chr17
11>chr18
12>chr19
13>chr1
14>chr2
15>chr3
16>chr4
17>chr5
18>chr6
19>chr7
20>chr8
21>chr9
22>chrM
23>chrX
24>chrY
25
26# ENS67 from Ensembl
27~/f/G/N/ENS67_fasta ❯❯❯ cat ENS67.fa | grep '>'
28>10 dna:chromosome chromosome:NCBIM37:10:1:129993255:1
29>11 dna:chromosome chromosome:NCBIM37:11:1:121843856:1
30>12 dna:chromosome chromosome:NCBIM37:12:1:121257530:1
31>13 dna:chromosome chromosome:NCBIM37:13:1:120284312:1
32>14 dna:chromosome chromosome:NCBIM37:14:1:125194864:1
33>15 dna:chromosome chromosome:NCBIM37:15:1:103494974:1
34>16 dna:chromosome chromosome:NCBIM37:16:1:98319150:1
35>17 dna:chromosome chromosome:NCBIM37:17:1:95272651:1
36>18 dna:chromosome chromosome:NCBIM37:18:1:90772031:1
37>19 dna:chromosome chromosome:NCBIM37:19:1:61342430:1
38>1 dna:chromosome chromosome:NCBIM37:1:1:197195432:1
39>2 dna:chromosome chromosome:NCBIM37:2:1:181748087:1
40>3 dna:chromosome chromosome:NCBIM37:3:1:159599783:1
41>4 dna:chromosome chromosome:NCBIM37:4:1:155630120:1
42>5 dna:chromosome chromosome:NCBIM37:5:1:152537259:1
43>6 dna:chromosome chromosome:NCBIM37:6:1:149517037:1
44>7 dna:chromosome chromosome:NCBIM37:7:1:152524553:1
45>8 dna:chromosome chromosome:NCBIM37:8:1:131738871:1
46>9 dna:chromosome chromosome:NCBIM37:9:1:124076172:1
47>MT dna:chromosome chromosome:NCBIM37:MT:1:16299:1
48>X dna:chromosome chromosome:NCBIM37:X:1:166650296:1
49>Y dna:chromosome chromosome:NCBIM37:Y:1:15902555:1