Posts on Firas Sadiyah

Posts on Firas Sadiyah https://firas.io/posts/ Recent content in Posts on Firas Sadiyah Firas Sadiyah https://firas.io/favicon.png https://firas.io/favicon.png Hugo -- gohugo.io en Sat, 24 Feb 2024 12:00:00 +0000 Python Exception Handling https://firas.io/posts/python-exception-handling/ Sat, 24 Feb 2024 12:00:00 +0000 https://firas.io/posts/python-exception-handling/ In genomic data analysis, we often use a pipeline function to process data stored in a dataframe by calling several mini-functions. Each mini-function may modify the dataframe by adding a new column with new values and then filter out the rows that do not meet certain criteria. However, this may result in an empty dataframe if none of the rows satisfy the filters and can lead to errors or unexpected results when the pipeline function tries to perform more operations on the empty dataframe. Directing Python Output in Terminal https://firas.io/posts/python-output-terminal/ Sun, 18 Feb 2024 12:00:00 +0000 https://firas.io/posts/python-output-terminal/ When writing Python scripts, it’s common to want to display information messages in the user terminal and possibly save them to a file like output.log. Therefore, it’s essential to understand the levels of logging in Python, the types of outputs in Unix-like systems, and how to direct the appropriate type of logging to the right output. Understanding Output Streams in Unix In Unix systems, there are two primary streams for output: standard output (stdout) and standard error (stderr). Running PyEnsembl on CentOS HPC https://firas.io/posts/pyensembl_centos_setup/ Sun, 11 Feb 2024 12:00:00 +0000 https://firas.io/posts/pyensembl_centos_setup/ Today, I am installing a Python package on one of the high-performance computing (HPC) systems that I work with. This CentOS 7 setup resembles a fortress, with its stringent security protocols, understandable due to hosting clinical data, but making it quite a challenge to install essential Python packages from public repositories. The package I’m attempting to install is PyEnsembl, a handy tool for accessing Ensembl’s genetic information. PyEnsembl Setup I started by creating a new conda environment: Debugging Python codebases using PyCharm and VSCode https://firas.io/posts/python-code-debugging/ Sun, 04 Feb 2024 12:00:00 +0000 https://firas.io/posts/python-code-debugging/ Debugging, an essential process for identifying and rectifying errors in a computer program, is particularly crucial for computational biologists dealing with complex codebases that often involve intricate mathematical models, data analysis, and simulations. Merely reading the code may not suffice to grasp the logic and functionality of the project. To gain a deeper understanding, you may need to run the code, examine the variables, and observe the outputs. Towards this end, the integrated debuggers in PyCharm and VSCode prove invaluable. Managing Python virtual environments https://firas.io/posts/python_env_mgmt/ Sun, 28 Jan 2024 12:00:00 +0000 https://firas.io/posts/python_env_mgmt/ When it comes to Python development on macOS, I rely on a combination of two tools that have served me exceptionally well over the past few years: Pyenv and Poetry. Pyenv provides an elegant solution for managing different Python versions on my system, while Poetry simplifies dependency management and the creation of virtual environments for my projects. In this article, I will guide you through the process of setting up Pyenv to install a specific Python version and then using Poetry to create a virtual environment for your project. Running NetMHCPan on Apple Silicon https://firas.io/posts/macos_netmhcpan/ Tue, 23 Jan 2024 12:00:00 +0000 https://firas.io/posts/macos_netmhcpan/ NetMHCPan, a widely used tool for predicting peptide binding to major histocompatibility complex (MHC) molecules, is essential for understanding immune responses. However, the tool’s binaries are currently available only for the x86_64 architecture, whether on Darwin (macOS) or Linux. As I intended to conduct test runs on my Apple Silicon device (arm64), I encountered the following error: 1netMHCpan: no binaries found for Darwin_arm64 /net/sund-nas.win.dtu.dk/storage/services/www/packages/netMHCpan/4.1/netMHCpan-4.1/Darwin_arm64/bin/netMHCpan To address this, one option is to run NetMHCPan using Rosetta, a dynamic binary translator that translates executable code on-the-fly. Setting up a reproducible R environment on macOS https://firas.io/posts/r_macos/ Mon, 10 Oct 2022 12:00:00 +0000 https://firas.io/posts/r_macos/ Using renv is an excellent choice for maintaining a clean and reproducible R environment on macOS. Here, I will share my experiences and provide a guide on setting up R on macOS. The post is divided into the following sections: Installing system dependencies required for R libraries using Homebrew. Installing R libraries using renv. Saving and restoring R environments using renv. Installing R libraries hosted on private repositories. Building R libraries from source using Makevars. Running NextFlow using AWS Batch https://firas.io/posts/nextflow_aws/ Sat, 01 May 2021 00:00:00 +0000 https://firas.io/posts/nextflow_aws/ Processing Genomics workflows require a variety of tools with distinct computing resources requirements. Some of these requirements are merely beyond the configuration of powerful workstations. High-performance clusters (HPC) offer the possibility to scale up computing resources to meet the increasing demands of these processes. However, maintaining the infrastructure of an on-premises HPC may not be a viable solution for many parities. A more cost-effective method is to provide computer resources dynamically by leveraging cloud computing. Using reticulate in RStudio with pyenv https://firas.io/posts/pyenv_rstudio/ Wed, 28 Apr 2021 00:00:00 +0000 https://firas.io/posts/pyenv_rstudio/ When developing in Python, it is generally a good practice not to rely on the Python version that ships with the operating system (OS). This is to ensure that the system version of Python remains relatively ‘clean’ for the OS processes. In addition, by installing a custom version(s) of Python, we open many possibilities. For one, it gives us control over which specific version(s) to use in our projects, and two, by using a virtual environment manager, we ensure that each project has access to its own tailored list of packages. Using data.table with OpenMP support https://firas.io/posts/data_table_openmp/ Mon, 26 Apr 2021 00:00:00 +0000 https://firas.io/posts/data_table_openmp/ If you are facing difficulties with large data sets in R, using data.table could provide a performance boost. However, when loading data.table, especially on macOS, you might encounter a warning indicating the absence of OpenMP support, causing data.table to operate in a single-threaded mode. This limitation prevents you from fully utilizing the potential benefits of using data.table and taking advantage of the underlying hardware. 1library(data.table) 2data.table 1.14.0 using 1 threads (see ? Using RegEx to capture file name in Groovy/NextFlow https://firas.io/posts/regex_groovy/ Sat, 06 Oct 2018 12:00:00 +0000 https://firas.io/posts/regex_groovy/ My sequencing files are named according to the folowing pattern lane5651_AAGAGGCA_00h_Cell_WT3_L008_R1.fastq.gz. I would like to capture the Sample ID as 00h_Cell_WT3 in order to name all downstream files accordingly. To this end, I wrote the following snippet: 1#!/usr/bin/env nextflow 2 3// fastq files are stored in reads as paired ends R1 and R4 4params.reads = 'reads/lane*_*_*_*_R{1,4}.fastq.gz' 5 6Channel 7 .fromFilePairs(params.reads, flat: true) 8 .map { prefix, file1, file2 -> tuple(getSampleID(prefix), file1, file2) } 9 . An overview of Illumina multiplexing https://firas.io/posts/illumina_sbs/ Sun, 10 Jun 2018 12:00:00 +0000 https://firas.io/posts/illumina_sbs/ Multiplexing Index 1 (i7) is always read. Index 2 (i5) is read only in dual index setting. Single indexing Dual indexing Single end (SE) R1, R2 R1, R2, R3 Paired end (PE) R1, R2, R4 R1, R2, R3, R4 Multiplexing - dual index reads - paired end Preparing genome reference in FASTA format https://firas.io/posts/genome_reference/ Sun, 10 Dec 2017 12:00:00 +0000 https://firas.io/posts/genome_reference/ To prepare genome reference in FASTA format for mouse assembly NCBI37/mm9, we have two options: From UCSC Using the mm9 assembly from UCSC golden Path. Do not use the masked file chromFaMasked.tar.gz! 1# download `chromFa.tar.gz ` from UCSC golden path 2wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz 3#or 4sync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz . 5 6# uncompress the downloaded file 7tar -xvzf chromFa.tar.gz 8 9# remove `*random.fa` chromosomes 10rm -rf *_random.fa 11 12# concatenate all FASTA files into a single file 13cat *. macOS setup for data science https://firas.io/posts/macos_setup/ Sun, 08 Oct 2017 12:00:00 +0000 https://firas.io/posts/macos_setup/ Here, I summarise how I setup my macOS for the purpose of Data Science. Ultimately, I should invest some time into automating the process. Install Command Line Developer Tools 1xcode-select --install Install HomeBrew 1/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" Turn off analytic (optional) 1brew analytics off Install iTerm 1brew install iterm Configure iTerm To download different colour schemes, visit iTerm Themes Install Go2Sehll Download and install from Go2Shell, then configure it to work with iTerm: Ubuntu setup for data science https://firas.io/posts/ubuntu_setup/ Sat, 16 Sep 2017 12:00:00 +0000 https://firas.io/posts/ubuntu_setup/ This posts shows how I setup a new Ubuntu for Data Science. Update Ubuntu 1sudo apt update 2sudo apt upgrade Configure apt It’s always a good idea to check the baseline setup. So, let’s check the current apt keys. 1sudo apt-key list Then, let’s check the sources. 1cat /etc/apt/sources.list 2cd /etc/apt/sources.list.d/ And finally, let’s check the users and group. 1cut -d: -f1 /etc/passwd # will list all local users 2cat /etc/passwd # will list all local users with groups and other properties 3getent passwd # same as above 4 5cut -d: -f1 /etc/group # will list all local groups 6cat /etc/group # will list all local groups 7id -Gn 8groups 9 10getent group groupname # will list all members of groupaname 11id username # will list all the groups a particular username belongs to Configure Unity To get minimise window on clicking its icon: Mounting remote drives locally using sshfs https://firas.io/posts/macos_sshfs/ Tue, 29 Aug 2017 12:00:00 +0000 https://firas.io/posts/macos_sshfs/ To mount a network drive as a local one, you can use sshfs. Generate SSH key 1$ mkdir ~/.ssh #if it does not already exist 2$ chmod 700 ~/.ssh 3$ cd ~/.ssh 4$ ssh-keygen -t rsa 5$ enter a keyname 6$ enter a passphrase 7$ ssh-copy-id -i [path to rsa file] USER@SERVER Where -i indicates where the rsa file is located. The .pub key is the one needed to be copied. Managing Python virtual environments https://firas.io/posts/python_virtualwrapper/ Tue, 01 Aug 2017 12:00:00 +0000 https://firas.io/posts/python_virtualwrapper/ Installation 1# macOS 2brew install python3 python-pip 3 4# Ubuntu 5sudo apt install python3 python-pip 6 7# Update pip to the latest version 8pip install --upgrade pip setuptools wheel 9pip3 install --upgrade pip setuptools wheel 10 11# Install virtualenvwrapper 12pip3 install virtualenv virtualenvwrapper Create projects directory 1mkdir ~/Projects Configure .profile (Bash) or .zprofile (ZSH): 1# needed for virtualenvwrapper 2export WORKON_HOME=$HOME/.virtualenvs 3export PROJECT_HOME=$HOME/Projects 4export VIRTUALENVWRAPPER_PYTHON=/bi/home/USER/.linuxbrew/bin/python3 5export VIRTUALENVWRAPPER_VIRTUALENV=/bi/home/USER/.linuxbrew/bin/virtualenv 6export PIP_REQUIRE_VIRTUALENV=true 7source /bi/home/USER/.linuxbrew/bin/virtualenvwrapper.sh Reload profile 1source ~/. DaPars for alternative polyadenylation analysis https://firas.io/posts/dapars/ Sun, 19 Mar 2017 12:00:00 +0000 https://firas.io/posts/dapars/ A colleague of mine asked me for help in using DaPars for analysing alternative polyadenylation in their RNA-seq dataset. So, I thought to write a short post here to describe how I use it. From Xia et al. 2014 Here we develop a novel bioinformatics algorithm (DaPars) for the de novo identification of dynamic APAs from standard RNA-seq. Installation Download the source files of DaPars from GitHub and extract the files: