Using plyr and doMC for quick and easy apply-family functions
A few weeks back I dedicated a short amount of time to actually read what plyr
(Wickham, 2011) is about and I was surprised. The whole idea behind plyr
is very simple: expand the apply()
family to do things easy. plyr
has many functions whose name ends with ply
which is short of apply. Then, the functions are identified by two letters before ply
which are abbreviations for the input (first letter) and output (second one). For instance, ddply
takes an input a data.frame
and returns a data.frame
while ldply
takes as input a list
and returns a data.frame
.
The syntax is pretty straight forward. For example, here are the arguments for ddply
:
library(plyr)
args(ddply)
## function (.data, .variables, .fun = NULL, ..., .progress = "none",
## .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)
## NULL
What we basically have to specify are
.data
which in general is the name of the inputdata.frame
,.variables
which is a vector (note the use of the.
function) of variable names. In this case,ddply
is very useful for applying some function to subsets of the data as specified by these variables,.fun
which is the actual function we want to run,- and
...
which are parameter options for the function we are running.
From the ddply
help page we have the following examples:
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
## group sex mean sd
## 1 A F 40.48 12.72
## 2 A M 34.48 15.28
## 3 B F 36.05 9.98
## 4 B M 38.35 7.97
## 5 C F 20.04 1.86
## 6 C M 43.81 10.72
# An example using a formula for .variables
ddply(baseball[1:100, ], ~year, nrow)
## year V1
## 1 1871 7
## 2 1872 13
## 3 1873 13
## 4 1874 15
## 5 1875 17
## 6 1876 15
## 7 1877 17
## 8 1878 3
# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))
## lg nrow ncol
## 1 65 22
## 2 AA 171 22
## 3 AL 10007 22
## 4 FL 37 22
## 5 NL 11378 22
## 6 PL 32 22
## 7 UA 9 22
But this is not the end of the story! Something I really liked about plyr
is that it can be parallelized via the foreach
(Analytics, 2012) package. I don’t know much about foreach
, but all I learnt is that you have to use other packages such as doMC
(Analytics, 2013) to actually run the code. It’s like foreach
specifies the infraestructure to communicate in parallel (and split jobs) and packages like doMC
tailor it for specific environments like for running in multi-core.
Running things in parallel can then be very easy. Basically, you load the packages, specify the number of cores, and run your ply
function. Here is a short example:
## Load packages
library(plyr)
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Specify the number of cores
registerDoMC(4)
## Check how many cores we are using
getDoParWorkers()
## [1] 4
## Run your ply function
ddply(dfx, .(group, sex), summarize, mean = round(mean(age), 2), sd = round(sd(age),
2), .parallel = TRUE)
## group sex mean sd
## 1 A F 40.48 12.72
## 2 A M 34.48 15.28
## 3 B F 36.05 9.98
## 4 B M 38.35 7.97
## 5 C F 20.04 1.86
## 6 C M 43.81 10.72
In case that you are interested, here is a short shell script for knitting an Rmd file in the cluster and specifying the appropriate number of cores to then use plyr
and doMC
.
#!/bin/bash
# To run it in the current working directory
#$ -cwd
# To get an email after the job is done
#$ -m e
# To speficy that we want 4 cores
#$ -pe local 4
# The name of the job
#$ -N myPlyJob
echo "**** Job starts ****"
date
# Knit your file: assuming it's called FileToKnit.Rmd
Rscript -e "library(knitr); knit2html('FileToKnit.Rmd')"
echo "**** Job ends ****"
date
Lets say that the bash script is named script.sh
. Then you can submit it to the cluster queue using
qsub script.sh
This is what I used to re-format a large data.frame
in a few minutes in the cluster for the #jhsph753 class homework project.
So, thank you again Hadley Wickham for making awesome R packages!
Citations made with knitcitations
(Boettiger, 2013).
- Revolution Analytics, (2013) doMC: Foreach parallel adaptor for the multicore package. http://CRAN.R-project.org/package=doMC
- Revolution Analytics, (2012) foreach: Foreach looping construct for R. http://CRAN.R-project.org/package=foreach
- Carl Boettiger, knitcitations: Citations for knitr markdown files. https://github.com/cboettig/knitcitations
- Hadley Wickham, (2011) The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software 40 (1) http://www.jstatsoft.org/v40/i01/