Checkpoints in Data Analysis Code - Assertions
09 Apr 2020Image by Nancy Hill from Pixabay
One of the functions of data analysis is to derive meaningful interpretation from the raw data that we collected. This process typically involves data transformation, manipulation, shaping and visualisation. Sometimes, we go through this repeatedly through different processing pipelines.
The more we transform or manipulate the data, the higher probability that we are going to make a mistake somewhere - e.g. an unintentional cartesian join instead of an inner join, coding the empty value incorrectly, incorrect matrix dimensions, mismatch column/row indices etc. If these mistakes were not detected early enough, it would invalidate all the analysis performed after that. It is an incentive to be able to detect these problems earlier. I am going to talk about a way to put some checkpoints into our code to safeguard these mistakes.
Difference between analysis project and software development project
Before we dive into the checkpoints, I would like to distinguish between creating a software project and data analysis project. Most commonly, we detect a software ‘bug’ when the software does not perform the functionality correctly, according to our expectations. In a way, we can detect the anomaly by observing the software behaviour. However, in a data analysis project, the behaviour is less explicit because most of the data analysis has the data as input and analysis/numbers as the output. This behaviour is pretty straightforward - however, how do we know the final numbers produced is correct, or whether we should trust the p-value produced? I hope this comparison strengthens the need for a more vigilant checking at each step of the data transformation.
Assertions
The concept of assertion is rather old in software programming. It is different from software unit testing where unit testing aims to verify the ‘behaviour’ of an independent function, while assertions is meant to be checking the code execution at a certain point in time meets the expectations. Assertions are a finer level compared to unit testing.
Most of the implementation of assertion in programming language involves calling a function, for example, assert(condition, message)
. This function will check if the condition is true. If the condition is false, the message will be displayed and code execution will be halted.
You might think this is quite straightforward and it can be implemented as if-else
statement easily in every piece of code. However, having define the assertions explicitly , the semantic of the code focus on the checking condition, rather than the conditional flow, which improves readability. Let’s look at a simple example in Matlab:
matrix = [1 2 3; 4 5 6; 7 8 9];
assert(size(matrix, 1)==3, "The matrix should have 3 rows");
assert(size(matrix, 2)==3, "The matrix should have 3 columns");
In the example above, we construct a 3x3 matrix, and assert that after the matrix construction, the number of rows and columns should be 3. It is a very simple example, but demonstrate the usage of the assertions.
In R, there are several functions such as stopifnot(all_equal(A, B))
to help you to make explicit assertions. I found these functions are very useful within each step of the data processing to make sure the data meets the expectations at every step.
Thoughts
Although we have various tools to help us to put checkpoints along the data pipelines, ultimately it is up to us to define what criteria or checks we want.
In reality, it is a waste of time to put checks like the example above because we trust the programming language that we used will give us a 3x3 matrix if we create one. The checkpoints are particularly important if you perform many transformations (filters, transpose) or involves a lot of row/column indices selection.
So next time, when you have doubts about your result or analysis, after performing many transformation/processing gymnastics, it is time to go back and put some checkpoints in each processing stage to ensure the sub-dataset is correctly defined, at the same time will increase your confidence that you are looking at the right result.