Are you doing parallel computations in R? Then use BiocParallel

It’s the morning of the first day of oral conferences at #ENAR2016. I feel like I have a spidey sense since I woke up 3 min after an email from Jeff Leek; just a funny coincidence. Anyhow, I promised Valerie Obenchain at #Bioc2014 that I would write a post about one of my favorite Bioconductor packages: BiocParallel (Morgan, Obenchain, Lang, and Thompson, 2016). By now it’s on the top 5% of downloaded Bioconductor packages, so many people know about it or are unaware that their favorite package uses it behind the scenes.

BioconductorLogo

While I haven’t blogged about BiocParallel yet, I did give a presentation about it at our computing club back in April 2nd, 2015. See it here (source). I’m going to follow its structure in this post.

Parallel computing

Before even thinking about using BiocParallel you have to decide whether parallel computing is the thing you need.

Cloud joke

While I’m not talking about cloud computing, I still find this picture funny.

Parallel diagram

There’s different types of parallel computing, but what I’m referring to here is called embarrassingly parallel where you have a task to do for a set of inputs, you split your inputs into subsets and perform the task on these subsets. Performing this task for one input a a time is called serial programming and it’s what we do in most cases when using functions like lapply() or for loops.

plot(y = 10 / (1:10), 1:10, xlab = 'Number of cores', ylab = 'Time',
    main = 'Ideal scenario', type = 'o', col = 'blue',
    cex = 2, cex.axis = 2, cex.lab = 1.5, cex.main = 2, pch = 16)

center

You might be running a simulation for a different set of parameters (a parameter grid) and running each simulation could take some time. Parallel computing can help you speed up this problem. In the ideal scenario, the higher number of computing cores (units that evaluate subsets of your inputs) the less time you need to run your full analysis.

plot(y = 10 / (1:10), 1:10, xlab = 'Number of cores', ylab = 'Time',
    main = 'Reality', type = 'o', col = 'blue',
    cex = 2, cex.axis = 2, cex.lab = 1.5, cex.main = 2, pch = 16)
lines(y = 10 / (1:10) * c(1, 1.05^(2:10) ), 1:10, col = 'red',
    type = 'o', cex = 2)

center

However, in reality parallel computing is not cost-free. It involves some communication costs, like sending the data to the cores, aggregating the results in a way that you can then easily use, among other things. So, it’ll be a bit slower than the ideal scenario but you can potentially still greatly reduce the overall time.

Having said all of the above, lets say that you now want to do some parallel computing in R. Where do you start? A pretty good place to start is the CRAN Task View: High-Performance and Parallel Computing with R. There you’ll find a lot of information about different packages that enable you to do parallel computing with R.

Confusing word cloud

But you’ll soon be lost in a sea of new terms.

Why use BiocParallel?

  • It’s simple to use.
  • You can try different parallel backends without changing your code.
  • You can use it to submit cluster jobs.
  • You’ll have access to great support from the Bioconductor developer team.

Those are the big reasons of why I use BiocParallel. But let me go through them a bit more slowly.

Birthday example

I’m going to use as an example the birthday problem where you want to find out empirically the probability that two people share the same birthday in a room.

birthday <- function(n) {
    m <- 10000
    x <- numeric(m)
    for(i in seq_len(m)) {
        b <- sample(seq_len(365), n, replace = TRUE)
        x[i] <- ifelse(length(unique(b)) == n, 0, 1)
    }
    mean(x)
}

Naive birthday code

Once you have written the code for it, you can then use lapply() or a for loop to calculate the results.

system.time( lapply(seq_len(100), birthday) )
##    user  system elapsed 
##  25.610   0.442  27.430

Takes around 25 seconds.

Via doMC

If you looked at CRAN Task View: High-Performance and Parallel Computing with R you might have found the doMC (Analytics and Weston, 2015).

It allows you to run computations in parallel as shown below.

library('doMC')
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(2)
system.time( x <- foreach(j = seq_len(100)) %dopar% birthday(j) )
##    user  system elapsed 
##  12.819   0.246  13.309

While it’s a bit faster, the main problem is that you had to change your code in order to be able to use it.

With BiocParallel

This is how you would run things with BiocParallel.

library('BiocParallel')
system.time( y <- bplapply(seq_len(100), birthday) )
##    user  system elapsed 
##   0.021   0.011  16.095

The only change here is using bplapply() instead of lapply(), so just 2 characters. Well, that and loading the BiocParallel package.

BiocParallel’s advantages

There are many computation backends and one of the strongest features of BiocParallel is that it’s easy to switch between them. For example, my computer can run the following options:

registered()
## $MulticoreParam
## class: MulticoreParam 
##   bpjobname:BPJOB; bpworkers:2; bptasks:0; bptimeout:Inf; bpRNGseed:; bpisup:FALSE
##   bplog:FALSE; bpthreshold:INFO; bplogdir:NA
##   bpstopOnError:FALSE; bpprogressbar:FALSE
##   bpresultdir:NA
## cluster type: FORK 
## 
## $SnowParam
## class: SnowParam 
##   bpjobname:BPJOB; bpworkers:2; bptasks:0; bptimeout:Inf; bpRNGseed:; bpisup:FALSE
##   bplog:FALSE; bpthreshold:INFO; bplogdir:NA
##   bpstopOnError:FALSE; bpprogressbar:FALSE
##   bpresultdir:NA
## cluster type: SOCK 
## 
## $SerialParam
## class: SerialParam 
##   bplog:FALSE; bpthreshold:INFO
##   bpcatchErrors:FALSE

If I was doing this in our computing cluster, I would see even more options.

Now lets say that I want to test different computation backends, or even run things in serial mode so I can trace a bug down more easily. Well, all I have to do is change the BPPARAM argument as shown below.

## Test in serial mode
system.time( y.serial <- bplapply(1:10, birthday,
    BPPARAM = SerialParam()) )
##    user  system elapsed 
##   2.577   0.033   2.733
## Try Snow
system.time( y.snow <- bplapply(1:10, birthday, 
    BPPARAM = SnowParam(workers = 2)) )
##    user  system elapsed 
##   0.027   0.006   2.436

Talking about computing clusters, you might be interested in using BatchJobs (Bischl, Lang, Mersmann, Rahnenführer, et al., 2015) just like Prasad Patil did for his PhD work. Well, with BiocParallel you can also chose to use the BatchJobs backend. I have code showing this at the presentation I referenced earlier.

Where do I start?

If you are convinced about using BiocParallel, which I hope you are by now, check out the Introduction to BiocParallel vignette available at BiocParallel’s landing page. It explains in more detail how to use it and it’s rich set of features. But if you just want to jump right in and start playing around with it, install it by running the following code:

install.packages("BiocManager")
BiocManager::install("BiocParallel")

Conclusions

Like I said earlier, BiocParallel is simple to use and has definite advantages over other solutions.

  • You can try different parallel backends without changing your code.
  • You can use it to submit cluster jobs.
  • You’ll have access to great support from the Bioconductor developer team. See the biocparallel tag at the support website.

Have fun using it!

Reproducibility

## Reproducibility info
library('devtools')
session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-03-07
## Packages ------------------------------------------------------------------
##  package        * version  date       source        
##  bibtex           0.4.0    2014-12-31 CRAN (R 3.2.0)
##  BiocParallel   * 1.4.3    2015-12-16 Bioconductor  
##  bitops           1.0-6    2013-08-17 CRAN (R 3.2.0)
##  codetools        0.2-14   2015-07-15 CRAN (R 3.2.2)
##  devtools       * 1.10.0   2016-01-23 CRAN (R 3.2.3)
##  digest           0.6.9    2016-01-08 CRAN (R 3.2.3)
##  doMC           * 1.3.4    2015-10-13 CRAN (R 3.2.0)
##  evaluate         0.8      2015-09-18 CRAN (R 3.2.0)
##  foreach        * 1.4.3    2015-10-13 CRAN (R 3.2.0)
##  formatR          1.2.1    2015-09-18 CRAN (R 3.2.0)
##  futile.logger    1.4.1    2015-04-20 CRAN (R 3.2.0)
##  futile.options   1.0.0    2010-04-06 CRAN (R 3.2.0)
##  httr             1.1.0    2016-01-28 CRAN (R 3.2.3)
##  iterators      * 1.0.8    2015-10-13 CRAN (R 3.2.0)
##  knitcitations  * 1.0.7    2015-10-28 CRAN (R 3.2.0)
##  knitr          * 1.12.3   2016-01-22 CRAN (R 3.2.3)
##  lambda.r         1.1.7    2015-03-20 CRAN (R 3.2.0)
##  lubridate        1.5.0    2015-12-03 CRAN (R 3.2.3)
##  magrittr         1.5      2014-11-22 CRAN (R 3.2.0)
##  memoise          1.0.0    2016-01-29 CRAN (R 3.2.3)
##  plyr             1.8.3    2015-06-12 CRAN (R 3.2.1)
##  R6               2.1.2    2016-01-26 CRAN (R 3.2.3)
##  Rcpp             0.12.3   2016-01-10 CRAN (R 3.2.3)
##  RCurl            1.95-4.7 2015-06-30 CRAN (R 3.2.1)
##  RefManageR       0.10.6   2016-02-15 CRAN (R 3.2.3)
##  RJSONIO          1.3-0    2014-07-28 CRAN (R 3.2.0)
##  snow             0.4-1    2015-10-31 CRAN (R 3.2.0)
##  stringi          1.0-1    2015-10-22 CRAN (R 3.2.0)
##  stringr          1.0.0    2015-04-30 CRAN (R 3.2.0)
##  XML              3.98-1.3 2015-06-30 CRAN (R 3.2.0)

References

Citations made with knitcitations (Boettiger, 2015).

[1] R. Analytics and S. Weston. doMC: Foreach Parallel Adaptor for 'parallel'. R package version 1.3.4. 2015. URL: http://CRAN.R-project.org/package=doMC.

[2] B. Bischl, M. Lang, O. Mersmann, J. Rahnenführer, et al. “BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments”. In: Journal of Statistical Software 64.11 (2015), pp. 1–25. URL: http://www.jstatsoft.org/v64/i11/.

[3] C. Boettiger. knitcitations: Citations for 'Knitr' Markdown Files. R package version 1.0.7. 2015. URL: http://CRAN.R-project.org/package=knitcitations.

[4] M. Morgan, V. Obenchain, M. Lang and R. Thompson. BiocParallel: Bioconductor facilities for parallel evaluation. R package version 1.4.3. 2016.

Want more?

Check other @jhubiostat student blogs at Bmore Biostats as well as topics on #rstats.

Leonardo Collado-Torres
Leonardo Collado-Torres
Investigator @ LIBD, Assistant Professor, Department of Biostatistics @ JHBSPH

#rstats @Bioconductor/🧠 genomics @LieberInstitute/@lcgunam @jhubiostat @jtleek @andrewejaffe alumni/@LIBDrstats @CDSBMexico co-founder

comments powered by Disqus