In genomic data analysis, we often use a pipeline function to process data stored in a dataframe by calling several mini-functions. Each mini-function may modify the dataframe by adding a new column with new values and then filter out the rows that do not meet certain criteria. However, this may result in an empty dataframe if none of the rows satisfy the filters and can lead to errors or unexpected results when the pipeline function tries to perform more operations on the empty dataframe. To avoid this situation, we can use two strategies. First, we can check if the DataFrame is non-empty before applying any logic in each mini-function. Second, we can make the pipeline function fail graciously if it receives an empty DataFrame from any of the mini-functions by using a custom exception and a try-except block. Let’s take a look.

What is Exception Handling?

At times, Python code encounters errors that halt program execution. These errors, known as exceptions, arise for various reasons. For instance, attempting to divide a number by zero triggers a ZeroDivisionError exception. To prevent the program from crashing, we can use exception handling techniques to deal with the errors in different ways. For example, we can show a clear message to the user, write the error details to a log file, or try a different approach to solve the problem.

One way to handle exceptions in Python is to use the try-except statement. This statement allows us to test a block of code for possible errors and execute another block of code when an error occurs. The general syntax of the try-except statement is:

1try:
2    # code that might cause an error
3except:
4    # code to handle the error

We can also specify the type of error we want to handle after the except keyword. This way, we can have different blocks of code for different errors. For example, we can handle the ZeroDivisionError and the ValueError separately:

1try:
2    # code that might cause a ZeroDivisionError or a ValueError
3except ZeroDivisionError:
4    # code to handle the ZeroDivisionError
5except ValueError:
6    # code to handle the ValueError

In addition, we can specify our own custom error types if we need to handle more specific errors, as below.

Minimal Example

In the following minimal example, process_vcf() is a pipeline function that calls several mini-functions, each of which may return an empty dataframe. Let’s explore how we can apply the concept of exception handling to manage the situation where a mini-function returns an empty dataframe.

 1import pandas as pd
 2
 3# create a dataframe with some genome variants
 4vcf_df = pd.DataFrame({"chrom": ["chr1", "chr2", "chr3", "chr4"], 
 5                       "start": [1000, 2000, 3000, 4000], 
 6                       "end": [1010, 2010, 3010, 4010], 
 7                       "ref": ["A", "C", "G", "T"], 
 8                       "alt": ["T", "G", "C", "A"]})
 9
10def filter_by_region(df):
11    # filter by a region of interest, this operation will return an empty dataframe
12    roi_chrom = "chr1"
13    roi_start = 0
14    roi_end = 5000
15    # filter the dataframe by the chromosome and the overlapping range
16    return df[(df["chrom"] == roi_chrom) & (df["start"] < roi_end) & (df["end"] < roi_start)].copy()
17
18def calculate_vaf(df):
19    # calculate the variant allele frequency
20    df.loc[:, "vaf"] = df["alt"].apply(lambda x: x.count(",") + 1) / (df["ref"].apply(len) + df["alt"].apply(len))
21    return df
22       
23def process_vcf(df):
24    df = filter_by_region(df)
25    df = calculate_vaf(df)
26    # ... more mini functions
27    print(df)
28        
29process_vcf(vcf_df)

Adding Check for Empty DataFrame

The first strategy is to check if the dataframe is non-empty before applying any logic in each mini-function. This can be achieved using the .empty property of the dataframe, which returns True if the dataframe has no rows or columns, and False otherwise.

1def calculate_vaf(df):
2    # calculate the variant allele frequency
3    if not df.empty: # check if the dataframe is empty
4        df.loc[:, "vaf"] = df["alt"].apply(lambda x: x.count(",") + 1) / (df["ref"].apply(len) + df["alt"].apply(len))
5        return df

This way, we can avoid performing any operations on an empty dataframe that may cause errors or unexpected results.

Implementing Try-Except

The second strategy is to make the pipeline function fail graciously if it receives an empty dataframe from any of the mini-functions by using 1) a custom exception and 2) a try-except block.

A custom exception is a user-defined class that inherits from the base Exception class and allows us to create our own type of exception. For example, we can create a custom exception called EmptyDataFrameException as follows:

1class EmptyDataFrameException(Exception):
2    pass

This class will simply pass the message that we want to display when the exception is raised.

Now, let’s integrate this custom exception into our functions. First, we’ll modify the calculate_vaf() function to raise EmptyDataFrameException when the dataframe becomes empty:

 1def calculate_vaf(df):
 2    # calculate the variant allele frequency
 3    if not df.empty: # check if the dataframe is empty
 4        df.loc[:, "vaf"] = df["alt"].apply(lambda x: x.count(",") + 1) / (df["ref"].apply(len) + df["alt"].apply(len))
 5        return df
 6    # If dataframe is empty after operation, raise an exception
 7    if df.empty:
 8        msg = "calculating variant allele frequency resulted in an empty dataframe"
 9        raise EmptyDataFrameException(msg)
10
11    return df

If the input dataframe is empty, it will raise an EmptyDataFrameException with a clear message that indicates the source of the error.

Next, we’ll modify the process_vcf() function to catch the EmptyDataFrameException and handle it gracefully. We can use a try-except block to wrap the code that may raise the exception. If the exception occurs, we’ll print the message, return the empty dataframe without any further processing. For example:

1def process_vcf(df):
2    try:
3        df = filter_by_region(df)
4        df = calculate_vaf(df)
5        # ... more mini functions
6        print(df)
7    except EmptyDataFrameException as e:
8        print(f"Exiting the program due to {e}")
9        return df

This code will try to execute the pipeline function and print the final dataframe. However, if any of the mini-functions raises an EmptyDataFrameException, it will catch it and print the message that explains the source of the error. This way, we can avoid any further errors or unexpected results that may occur due to the empty dataframe.

Else and Finally

In addition to the try and except blocks, we can also use the else and finally blocks to handle different scenarios when an exception occurs. The else block is used to execute some code when no exception occurs in the try block. The syntax of the try-except-else statement is:

1try:
2    # code that may cause an exception
3except:
4    # code to run when an exception occurs
5else:
6    # code to run when no exception occurs

Let’s explore how we can incorporate them into our genomic example by introducing a new function that utilises the else block to print a message upon successful completion of an operation. Additionally, this example illustrates the use of multiple except statements, which handle different error types including ZeroDivisionError, ValueError, and KeyError.

 1def calculate_allele_frequency(df, allele, pop_size):
 2    # calculate the allele frequency of a variant in the dataframe, given a population size
 3    try:
 4        # check if the population size is positive
 5        if pop_size <= 0:
 6            # raise a ValueError if the population size is zero or negative
 7            raise ValueError("The population size must be positive.")
 8        # get the number of chromosomes with the allele
 9        allele_count = df['ref'].str.count(allele).sum() + df['alt'].str.count(allele).sum()
10        # calculate the allele frequency
11        allele_freq = allele_count / pop_size
12        # print the allele frequency
13        print(f"The allele frequency of {allele} is {allele_freq:.4f}.")
14    except ZeroDivisionError:
15        # print an error message if the population size is zero
16        print("The population size cannot be zero.")
17    except ValueError as e:
18        # print the error message if the population size is negative
19        print(e)
20    except KeyError:
21        # print an error message if the allele is not valid
22        print(f"The allele {allele} is not valid. It must be one of A, C, G, or T.")
23    else:
24        # print a message when no exception occurs
25        print("The calculation was successful.")

This way, we can provide feedback to the user when the operation is interrupted due to any of the errors defined above, as well as when it is completed without any errors.

Finally, we can use the finally block to execute some code regardless of whether an exception occurs or not. This is useful for cleaning up resources or closing files. The syntax of the try-except-finally statement is:

1try:
2    # code that may cause an exception
3except:
4    # code to run when an exception occurs
5finally:
6    # code to run always

Here’s how we can modify our genomic example to ensure that the DataFrame is always saved to a file, regardless of whether an exception occurs or not.

 1def process_vcf(df):
 2    try:
 3        df = filter_by_region(df)
 4        df = calculate_vaf(df)
 5        calculate_allele_frequency(df, "A", 1000)
 6        # ... more mini functions
 7        print(df)
 8    except EmptyDataFrameException as e:
 9        print(f"Exiting the program due to {e}")
10        return df
11    finally:
12        df.to_csv("output.csv", index=False)
13        print("The DataFrame was saved to output.csv")
14
15process_vcf(vcf_df)

The complete code can be found in this gist

Conclusion

To handle errors due to an empty dataframe, we can use two strategies: check if the DataFrame is non-empty before applying any logic in each mini function and make the pipeline function fail graciously if it receives an empty DataFrame from any of the mini functions by using a custom exception and a try-except block.