Parallel Execution in R - Divide and Conquer
13 Aug 2020(Image by Robert V. Ruggiero)
Parallel processing is not a new concept in computing. With the need to process/analyse massive amount of datasets today efficiently, the idea of parallelism seems like an attractive idea. There are a few goodies in R that I came across previously that not only can help to speed up the processing, but also make you think about the structuring of your code - splitting your code into smaller independent chunks, and these chunks can be executed independently across CPUs or GPUs.
Let’s start with an example. Imagine you have a 5000x5000 matrix with real numbers, you want to perform some arithmetic operations (say Operation1) on the matrix, and derived another matrix that has the same size. The most obvious way to perform this is to have a nested for-loop:
new_matrix <- matrix(nrow = 5000, ncol = 5000)
original_matrix <- matrix(2, nrow = 5000, ncol = 5000)
Operation1 <- function(a, power) {
return (a^power)
}
system.time({
for (i in 1:nrow(original_matrix)) {
for (j in 1:ncol(original_matrix)) {
new_matrix[i, j] <- Operation1(original_matrix[i, j], 3) # power of 3
}
}
})
# Post execution check (see my previous post about assertion)
stopifnot(length(which(new_matrix==(2^3))) == nrow(new_matrix) * ncol(new_matrix))
There is nothing wrong with the code above. In fact, it is quite easy to read. Execution of the above code took around 20 seconds.
user system elapsed
18.694 0.341 19.803
Before we move into how we can modify the code above so that it can be performed in parallel, we should have a look at how to modify the code above to be more “functional”. Functional programming is a type of programming language that focuses on mainly modelling/coding in functions. It is by design that functional programming makes parallel processing easier because each unit (function) can be sent to different processors (cores) to process. However, the premise to make this works is that the unit of function needs to be independent, so that all unit of functions can be executed concurrently. In other words, one function cannot depends on the result from another function.
Assuming that the example above is independent, then we can refactor the code to use some nice “functional” helper function (such as lapply, sapply, vapply, mapply and to hide away the for-loops. First, we need to identify the actual unit of function. Here, it is obvious that Operation1 can be that unit of function. So we will use the parallel function mclapply to abstract out both for nested loop
new_matrix <- matrix(nrow = 5000, ncol = 5000)
original_matrix <- matrix(2, nrow = 5000, ncol = 5000)
Operation1 <- function(a, power) {
return (a^power)
}
Operation1Parallel <- function(row, col, power, target_matrix) {
return (target_matrix[row, col]^power)
}
library(foreach)
library(doParallel)
cores <- detectCores()
print(paste0("Number of cores: ", cores))
system.time({
result <- mclapply(1:nrow(original_matrix),
FUN=Operation1Parallel,
col = 1:ncol(original_matrix),
power = 3,
target_matrix = original_matrix,
mc.cores = cores)
})
# Post execution check (see my previous post about assertion)
new_matrix = unlist(result)
stopifnot(length(which(new_matrix==(2^3))) == nrow(new_matrix) * ncol(new_matrix))
Executing the code above surprisingly only takes 1.5 seconds on my 4-core Macbook Pro:
user system elapsed
0.695 0.432 1.419
Note that the last statement is an assertion to check all the values in the final result are the same as the first output matrix with the nested for-loop. So now we have two ways to achieve the same result - one with traditional for-loop and one took advantage with multiple cores (parallel) processing and reduce the processing time by 93%!!!.
Although it seems amazing, few complications (or more like paradigm shift in terms of designing) that I think with parallelism are:
- We need to think in terms of single unit of execution (keep things stateless). If we need multiple communications between different parts of the system, then how to model the parallel layers will be important. This is similar to how basic RNN by default cannot take advantage of parallelism but basic CNN can (within the same layer).
- How to aggregate the results once the calculation is completed. Here, one thing I found useful is using
unlist
operations to reduce the dimension of the result.