In genomic data analysis, we often use a pipeline function to process data stored in a dataframe by calling several mini-functions. Each mini-function may modify the dataframe by adding a new column with new values and then filter out the rows that do not meet certain criteria. However, this may result in an empty dataframe if none of the rows satisfy the filters and can lead to errors or unexpected results when the pipeline function tries to perform more operations on the empty dataframe. To avoid this situation, we can use two strategies. First, we can check if the DataFrame is non-empty before applying any logic in each mini-function. Second, we can make the pipeline function fail graciously if it receives an empty DataFrame from any of the mini-functions by using a custom exception and a try-except block. Let’s take a look.
What is Exception Handling?
At times, Python code encounters errors that halt program execution. These errors, known as exceptions, arise for various reasons. For instance, attempting to divide a number by zero triggers a ZeroDivisionError
exception. To prevent the program from crashing, we can use exception handling techniques to deal with the errors in different ways. For example, we can show a clear message to the user, write the error details to a log file, or try a different approach to solve the problem.
One way to handle exceptions in Python is to use the try-except statement. This statement allows us to test a block of code for possible errors and execute another block of code when an error occurs. The general syntax of the try-except statement is:
1try:
2 # code that might cause an error
3except:
4 # code to handle the error
We can also specify the type of error we want to handle after the except
keyword. This way, we can have different blocks of code for different errors. For example, we can handle the ZeroDivisionError
and the ValueError
separately:
1try:
2 # code that might cause a ZeroDivisionError or a ValueError
3except ZeroDivisionError:
4 # code to handle the ZeroDivisionError
5except ValueError:
6 # code to handle the ValueError
In addition, we can specify our own custom error types if we need to handle more specific errors, as below.
Minimal Example
In the following minimal example, process_vcf()
is a pipeline function that calls several mini-functions, each of which may return an empty dataframe. Let’s explore how we can apply the concept of exception handling to manage the situation where a mini-function returns an empty dataframe.
1import pandas as pd
2
3# create a dataframe with some genome variants
4vcf_df = pd.DataFrame({"chrom": ["chr1", "chr2", "chr3", "chr4"],
5 "start": [1000, 2000, 3000, 4000],
6 "end": [1010, 2010, 3010, 4010],
7 "ref": ["A", "C", "G", "T"],
8 "alt": ["T", "G", "C", "A"]})
9
10def filter_by_region(df):
11 # filter by a region of interest, this operation will return an empty dataframe
12 roi_chrom = "chr1"
13 roi_start = 0
14 roi_end = 5000
15 # filter the dataframe by the chromosome and the overlapping range
16 return df[(df["chrom"] == roi_chrom) & (df["start"] < roi_end) & (df["end"] < roi_start)].copy()
17
18def calculate_vaf(df):
19 # calculate the variant allele frequency
20 df.loc[:, "vaf"] = df["alt"].apply(lambda x: x.count(",") + 1) / (df["ref"].apply(len) + df["alt"].apply(len))
21 return df
22
23def process_vcf(df):
24 df = filter_by_region(df)
25 df = calculate_vaf(df)
26 # ... more mini functions
27 print(df)
28
29process_vcf(vcf_df)
Adding Check for Empty DataFrame
The first strategy is to check if the dataframe is non-empty before applying any logic in each mini-function. This can be achieved using the .empty
property of the dataframe, which returns True
if the dataframe has no rows or columns, and False
otherwise.
1def calculate_vaf(df):
2 # calculate the variant allele frequency
3 if not df.empty: # check if the dataframe is empty
4 df.loc[:, "vaf"] = df["alt"].apply(lambda x: x.count(",") + 1) / (df["ref"].apply(len) + df["alt"].apply(len))
5 return df
This way, we can avoid performing any operations on an empty dataframe that may cause errors or unexpected results.
Implementing Try-Except
The second strategy is to make the pipeline function fail graciously if it receives an empty dataframe from any of the mini-functions by using 1) a custom exception and 2) a try-except block.
A custom exception is a user-defined class that inherits from the base Exception
class and allows us to create our own type of exception. For example, we can create a custom exception called EmptyDataFrameException
as follows:
1class EmptyDataFrameException(Exception):
2 pass
This class will simply pass the message that we want to display when the exception is raised.
Now, let’s integrate this custom exception into our functions. First, we’ll modify the calculate_vaf()
function to raise EmptyDataFrameException
when the dataframe becomes empty:
1def calculate_vaf(df):
2 # calculate the variant allele frequency
3 if not df.empty: # check if the dataframe is empty
4 df.loc[:, "vaf"] = df["alt"].apply(lambda x: x.count(",") + 1) / (df["ref"].apply(len) + df["alt"].apply(len))
5 return df
6 # If dataframe is empty after operation, raise an exception
7 if df.empty:
8 msg = "calculating variant allele frequency resulted in an empty dataframe"
9 raise EmptyDataFrameException(msg)
10
11 return df
If the input dataframe is empty, it will raise an EmptyDataFrameException
with a clear message that indicates the source of the error.
Next, we’ll modify the process_vcf()
function to catch the EmptyDataFrameException
and handle it gracefully. We can use a try-except block to wrap the code that may raise the exception. If the exception occurs, we’ll print the message, return the empty dataframe without any further processing. For example:
1def process_vcf(df):
2 try:
3 df = filter_by_region(df)
4 df = calculate_vaf(df)
5 # ... more mini functions
6 print(df)
7 except EmptyDataFrameException as e:
8 print(f"Exiting the program due to {e}")
9 return df
This code will try to execute the pipeline function and print the final dataframe. However, if any of the mini-functions raises an EmptyDataFrameException
, it will catch it and print the message that explains the source of the error. This way, we can avoid any further errors or unexpected results that may occur due to the empty dataframe.
Else and Finally
In addition to the try
and except
blocks, we can also use the else
and finally
blocks to handle different scenarios when an exception occurs. The else
block is used to execute some code when no exception occurs in the try
block.
The syntax of the try-except-else statement is:
1try:
2 # code that may cause an exception
3except:
4 # code to run when an exception occurs
5else:
6 # code to run when no exception occurs
Let’s explore how we can incorporate them into our genomic example by introducing a new function that utilises the else block to print a message upon successful completion of an operation. Additionally, this example illustrates the use of multiple except statements, which handle different error types including ZeroDivisionError, ValueError, and KeyError.
1def calculate_allele_frequency(df, allele, pop_size):
2 # calculate the allele frequency of a variant in the dataframe, given a population size
3 try:
4 # check if the population size is positive
5 if pop_size <= 0:
6 # raise a ValueError if the population size is zero or negative
7 raise ValueError("The population size must be positive.")
8 # get the number of chromosomes with the allele
9 allele_count = df['ref'].str.count(allele).sum() + df['alt'].str.count(allele).sum()
10 # calculate the allele frequency
11 allele_freq = allele_count / pop_size
12 # print the allele frequency
13 print(f"The allele frequency of {allele} is {allele_freq:.4f}.")
14 except ZeroDivisionError:
15 # print an error message if the population size is zero
16 print("The population size cannot be zero.")
17 except ValueError as e:
18 # print the error message if the population size is negative
19 print(e)
20 except KeyError:
21 # print an error message if the allele is not valid
22 print(f"The allele {allele} is not valid. It must be one of A, C, G, or T.")
23 else:
24 # print a message when no exception occurs
25 print("The calculation was successful.")
This way, we can provide feedback to the user when the operation is interrupted due to any of the errors defined above, as well as when it is completed without any errors.
Finally, we can use the finally
block to execute some code regardless of whether an exception occurs or not. This is useful for cleaning up resources or closing files. The syntax of the try-except-finally statement is:
1try:
2 # code that may cause an exception
3except:
4 # code to run when an exception occurs
5finally:
6 # code to run always
Here’s how we can modify our genomic example to ensure that the DataFrame is always saved to a file, regardless of whether an exception occurs or not.
1def process_vcf(df):
2 try:
3 df = filter_by_region(df)
4 df = calculate_vaf(df)
5 calculate_allele_frequency(df, "A", 1000)
6 # ... more mini functions
7 print(df)
8 except EmptyDataFrameException as e:
9 print(f"Exiting the program due to {e}")
10 return df
11 finally:
12 df.to_csv("output.csv", index=False)
13 print("The DataFrame was saved to output.csv")
14
15process_vcf(vcf_df)
The complete code can be found in this gist
Conclusion
To handle errors due to an empty dataframe, we can use two strategies: check if the DataFrame is non-empty before applying any logic in each mini function and make the pipeline function fail graciously if it receives an empty DataFrame from any of the mini functions by using a custom exception and a try-except block.