Today, I am installing a Python package on one of the high-performance computing (HPC) systems that I work with. This CentOS 7 setup resembles a fortress, with its stringent security protocols, understandable due to hosting clinical data, but making it quite a challenge to install essential Python packages from public repositories.

The package I’m attempting to install is PyEnsembl, a handy tool for accessing Ensembl’s genetic information.

PyEnsembl Setup

I started by creating a new conda environment:

1conda create --name ensembl python=3.11

Since conda tends to be sluggish when resolving dependencies, I prefer to install mamba first, which serves as a faster alternative to conda. However, a recent change in conda has made it reasonably faster by incorporating the limbmamba solver (Alternatively, you can consider micromamba):

1conda install -n base -n defaults 'conda>=23.11'

After activating the environment, I proceeded to install pyensembl:

1conda activate ensembl
2conda install pyensembl

Genome Setup

Before using PyEnsembl, the first step is to download and convert the genome FASTA sequencing files and GTF (Gene Transfer Format) files, which contain genomic annotations, into a structured database format, specifically SQLite. PyEnsembl simplifies this process by parsing the annotation files, generating the database, and providing access to genomic information as Python objects, eliminating the need for direct interaction with the underlying database.

Let’s install the latest release of the human genome ‘GRCh38’ using the following commands:

1from pyensembl import EnsemblRelease
2ensembl_38 = EnsemblRelease(release=111, species="homo_sapiens")
3ensembl_38.download()
4ensembl_38.index()

However, I encountered an SSL-related error:

1urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in the certificate chain (_ss1.c:1002)

This issue stemmed from the SSL configuration in CentOS 7 rather than pyensembl itself. Instead of disabling security as mentioned here, I chose to manually download the required files so that pyensembl could access them from the cache directory.

First, I set up the cache directory:

1export PYENSEMBL_CACHE_DIR=~/pyensembl-cache

Then, I installed the genome using the command line:

1pyensembl install --species human --release 111

Every time I ran this command, it provided the URL of the required file, which I manually downloaded using wget. In total, I needed four files:

1wget https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz 
2wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
3wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
4wget https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz

With the files downloaded, I proceeded to install them:

1from pyensembl import EnsemblRelease
2ensembl_38 = EnsemblRelease(release=111, species="homo_sapiens")
3ensembl_38.download() # a redundant step after downloading the files above
4ensembl_38.index()

Dependencies Troubleshooting

However, I encountered another error:

1AttributeError: module 'polars' has no attribute 'enable_string_cache'. Did you mean: 'toggle_string_cache'?

Further investigation led me to a similar issue on the pyensembl library GitHub repository (issue), which was attributed to a breaking change in the polars library. The installed version of polars was outdated (0.14.28), so I attempted to upgrade to the latest version.

Since conda was not helpful in upgrading the package from the local outdated offline repository, I chose to download the ``.whl` file from PyPI directly and use pip instead. I typically avoid using pip because it conflicts with conda.

 1# didn't work
 2install -c conda-forge --name ensembl polars
 3
 4# Download the wheel `.whl` file for `polars`
 5wget https://files.pythonhosted.org/packages/2a/b6/89628dbea4624ba0c0c4d960441645b4fa77f830ee3dbbe62bccf4cb0a87/polars-0.20.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
 6
 7# didn't work
 8conda install --no-deps polars-0.20.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
 9
10# worked
11pip install polars-0.20.7-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

After upgrading polars, I encountered yet another error:

1AttributeError: module 'numpy' has no attribute 'typeDict'. Did you mean: 'sctypeDict'?

A search led me to another issue on the pyensembl GitHub repository (issue), which attributed the problem to the datacache package. The installed version was also outdated (1.1.5).

Once again, I had to resort to PyPI and using pip:

1# Download the wheel `.whl` file for `datacache`
2wget https://files.pythonhosted.org/packages/86/34/606bbcfa507ddc0209f75865e3684a61853925f99b9ffd164d78e2b86271/datacache-1.4.0-py3-none-any.whl
3
4pip install datacache-1.4.0-py3-none-any.whl

With both packages upgraded, I tried installing again:

1pyensembl install --species human --release 111

And finally, it worked as expected. Time to return to my data analysis!