Python Exception Handling

In genomic data analysis, we often use a pipeline function to process data stored in a dataframe by calling several mini-functions. Each mini-function may modify the dataframe by adding a new column with new values and then filter out the rows that do not meet certain criteria. However, this may result in an empty dataframe if none of the rows satisfy the filters and can lead to errors or unexpected results when the pipeline function tries to perform more operations on the empty dataframe....

February 24, 2024 · 8 min · Firas Sadiyah

Directing Python Output in Terminal

When writing Python scripts, it’s common to want to display information messages in the user terminal and possibly save them to a file like output.log. Therefore, it’s essential to understand the levels of logging in Python, the types of outputs in Unix-like systems, and how to direct the appropriate type of logging to the right output. Understanding Output Streams in Unix In Unix systems, there are two primary streams for output: standard output (stdout) and standard error (stderr)....

February 18, 2024 · 4 min · Firas Sadiyah

Running PyEnsembl on CentOS HPC

Today, I am installing a Python package on one of the high-performance computing (HPC) systems that I work with. This CentOS 7 setup resembles a fortress, with its stringent security protocols, understandable due to hosting clinical data, but making it quite a challenge to install essential Python packages from public repositories. The package I’m attempting to install is PyEnsembl, a handy tool for accessing Ensembl’s genetic information. PyEnsembl Setup I started by creating a new conda environment:...

February 11, 2024 · 3 min · Firas Sadiyah

Debugging Python codebases using PyCharm and VSCode

Debugging, an essential process for identifying and rectifying errors in a computer program, is particularly crucial for computational biologists dealing with complex codebases that often involve intricate mathematical models, data analysis, and simulations. Merely reading the code may not suffice to grasp the logic and functionality of the project. To gain a deeper understanding, you may need to run the code, examine the variables, and observe the outputs. Towards this end, the integrated debuggers in PyCharm and VSCode prove invaluable....

February 4, 2024 · 5 min · Firas Sadiyah

Managing Python virtual environments

When it comes to Python development on macOS, I rely on a combination of two tools that have served me exceptionally well over the past few years: Pyenv and Poetry. Pyenv provides an elegant solution for managing different Python versions on my system, while Poetry simplifies dependency management and the creation of virtual environments for my projects. In this article, I will guide you through the process of setting up Pyenv to install a specific Python version and then using Poetry to create a virtual environment for your project....

January 28, 2024 · 5 min · Firas Sadiyah

Running NetMHCPan on Apple Silicon

NetMHCPan, a widely used tool for predicting peptide binding to major histocompatibility complex (MHC) molecules, is essential for understanding immune responses. However, the tool’s binaries are currently available only for the x86_64 architecture, whether on Darwin (macOS) or Linux. As I intended to conduct test runs on my Apple Silicon device (arm64), I encountered the following error: 1netMHCpan: no binaries found for Darwin_arm64 /net/sund-nas.win.dtu.dk/storage/services/www/packages/netMHCpan/4.1/netMHCpan-4.1/Darwin_arm64/bin/netMHCpan To address this, one option is to run NetMHCPan using Rosetta, a dynamic binary translator that translates executable code on-the-fly....

January 23, 2024 · 3 min · Firas Sadiyah

Setting up a reproducible R environment on macOS

Using renv is an excellent choice for maintaining a clean and reproducible R environment on macOS. Here, I will share my experiences and provide a guide on setting up R on macOS. The post is divided into the following sections: Installing system dependencies required for R libraries using Homebrew. Installing R libraries using renv. Saving and restoring R environments using renv. Installing R libraries hosted on private repositories. Building R libraries from source using Makevars....

October 10, 2022 · 5 min · Firas Sadiyah

Running NextFlow using AWS Batch

Processing Genomics workflows require a variety of tools with distinct computing resources requirements. Some of these requirements are merely beyond the configuration of powerful workstations. High-performance clusters (HPC) offer the possibility to scale up computing resources to meet the increasing demands of these processes. However, maintaining the infrastructure of an on-premises HPC may not be a viable solution for many parities. A more cost-effective method is to provide computer resources dynamically by leveraging cloud computing....

May 1, 2021 · 6 min · Firas Sadiyah

Using reticulate in RStudio with pyenv

When developing in Python, it is generally a good practice not to rely on the Python version that ships with the operating system (OS). This is to ensure that the system version of Python remains relatively ‘clean’ for the OS processes. In addition, by installing a custom version(s) of Python, we open many possibilities. For one, it gives us control over which specific version(s) to use in our projects, and two, by using a virtual environment manager, we ensure that each project has access to its own tailored list of packages....

April 28, 2021 · 4 min · Firas Sadiyah

Using data.table with OpenMP support

If you are facing difficulties with large data sets in R, using data.table could provide a performance boost. However, when loading data.table, especially on macOS, you might encounter a warning indicating the absence of OpenMP support, causing data.table to operate in a single-threaded mode. This limitation prevents you from fully utilizing the potential benefits of using data.table and taking advantage of the underlying hardware. 1library(data.table) 2data.table 1.14.0 using 1 threads (see ?...

April 26, 2021 · 4 min · Firas Sadiyah

Using RegEx to capture file name in Groovy/NextFlow

My sequencing files are named according to the folowing pattern lane5651_AAGAGGCA_00h_Cell_WT3_L008_R1.fastq.gz. I would like to capture the Sample ID as 00h_Cell_WT3 in order to name all downstream files accordingly. To this end, I wrote the following snippet: 1#!/usr/bin/env nextflow 2 3// fastq files are stored in reads as paired ends R1 and R4 4params.reads = 'reads/lane*_*_*_*_R{1,4}.fastq.gz' 5 6Channel 7 .fromFilePairs(params.reads, flat: true) 8 .map { prefix, file1, file2 -> tuple(getSampleID(prefix), file1, file2) } 9 ....

October 6, 2018 · 2 min · Firas Sadiyah

An overview of Illumina multiplexing

Multiplexing Index 1 (i7) is always read. Index 2 (i5) is read only in dual index setting. Single indexing Dual indexing Single end (SE) R1, R2 R1, R2, R3 Paired end (PE) R1, R2, R4 R1, R2, R3, R4 Multiplexing - dual index reads - paired end

June 10, 2018 · 1 min · Firas Sadiyah

Preparing genome reference in FASTA format

To prepare genome reference in FASTA format for mouse assembly NCBI37/mm9, we have two options: From UCSC Using the mm9 assembly from UCSC golden Path. Do not use the masked file chromFaMasked.tar.gz! 1# download `chromFa.tar.gz ` from UCSC golden path 2wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz 3#or 4sync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz . 5 6# uncompress the downloaded file 7tar -xvzf chromFa.tar.gz 8 9# remove `*random.fa` chromosomes 10rm -rf *_random.fa 11 12# concatenate all FASTA files into a single file 13cat *....

December 10, 2017 · 2 min · Firas Sadiyah

macOS setup for data science

Here, I summarise how I setup my macOS for the purpose of Data Science. Ultimately, I should invest some time into automating the process. Install Command Line Developer Tools 1xcode-select --install Install HomeBrew 1/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" Turn off analytic (optional) 1brew analytics off Install iTerm 1brew install iterm Configure iTerm To download different colour schemes, visit iTerm Themes Install Go2Sehll Download and install from Go2Shell, then configure it to work with iTerm:...

October 8, 2017 · 3 min · Firas Sadiyah

Ubuntu setup for data science

This posts shows how I setup a new Ubuntu for Data Science. Update Ubuntu 1sudo apt update 2sudo apt upgrade Configure apt It’s always a good idea to check the baseline setup. So, let’s check the current apt keys. 1sudo apt-key list Then, let’s check the sources. 1cat /etc/apt/sources.list 2cd /etc/apt/sources.list.d/ And finally, let’s check the users and group. 1cut -d: -f1 /etc/passwd # will list all local users 2cat /etc/passwd # will list all local users with groups and other properties 3getent passwd # same as above 4 5cut -d: -f1 /etc/group # will list all local groups 6cat /etc/group # will list all local groups 7id -Gn 8groups 9 10getent group groupname # will list all members of groupaname 11id username # will list all the groups a particular username belongs to Configure Unity To get minimise window on clicking its icon:...

September 16, 2017 · 7 min · Firas Sadiyah

Mounting remote drives locally using sshfs

To mount a network drive as a local one, you can use sshfs. Generate SSH key 1$ mkdir ~/.ssh #if it does not already exist 2$ chmod 700 ~/.ssh 3$ cd ~/.ssh 4$ ssh-keygen -t rsa 5$ enter a keyname 6$ enter a passphrase 7$ ssh-copy-id -i [path to rsa file] USER@SERVER Where -i indicates where the rsa file is located. The .pub key is the one needed to be copied....

August 29, 2017 · 1 min · Firas Sadiyah

Managing Python virtual environments

Installation 1# macOS 2brew install python3 python-pip 3 4# Ubuntu 5sudo apt install python3 python-pip 6 7# Update pip to the latest version 8pip install --upgrade pip setuptools wheel 9pip3 install --upgrade pip setuptools wheel 10 11# Install virtualenvwrapper 12pip3 install virtualenv virtualenvwrapper Create projects directory 1mkdir ~/Projects Configure .profile (Bash) or .zprofile (ZSH): 1# needed for virtualenvwrapper 2export WORKON_HOME=$HOME/.virtualenvs 3export PROJECT_HOME=$HOME/Projects 4export VIRTUALENVWRAPPER_PYTHON=/bi/home/USER/.linuxbrew/bin/python3 5export VIRTUALENVWRAPPER_VIRTUALENV=/bi/home/USER/.linuxbrew/bin/virtualenv 6export PIP_REQUIRE_VIRTUALENV=true 7source /bi/home/USER/.linuxbrew/bin/virtualenvwrapper.sh Reload profile 1source ~/....

August 1, 2017 · 2 min · Firas Sadiyah

DaPars for alternative polyadenylation analysis

A colleague of mine asked me for help in using DaPars for analysing alternative polyadenylation in their RNA-seq dataset. So, I thought to write a short post here to describe how I use it. From Xia et al. 2014 Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq. Installation Download the source files of DaPars from GitHub and extract the files:...

March 19, 2017 · 4 min · Firas Sadiyah