R packagesAs a modern statistician, one of the most fundamental contributions you can make is to create and distribute software that implements the methods you develop. I have gone so far as to say if you write a paper without software, that paper doesn't exist.
The purposes of this guide are:
R package.R package?Cause you know, you do what your advisor says and stuff.
But there are some real reasons to write software as a statistician that I think are critically important:
Most importantly might be that creating an R package is building something. It is something you can point to and say, "I made that". Leaving aside all the tangible benefits to your career, the profession, etc. it is maybe the most gratifying feeling you get when working on research.
R packageAs soon as you have 2 functions.
Why 2? After you have more than one function it starts to get easy to lose track of what your functions do, it starts to be tempting to name your functions foo() or tempfunction() or some other such nonsense. You are also tempted to put all of the functions in one file and just source it. That was what I did with my first project, which ended up being an epically comical set of about 3,000 lines of code in one R file. Ask my advisor about it sometime, he probably is still laughing about it.
To start writing an R package you need:
R (R Core Team, 2014)- I mean, unless you are a wizard.devtools (Wickham and Chang, 2014), roxygen2 (Wickham, Danenberg, and Eugster, 2014) (suggested by devtools), knitr (Xie, 2014b).Rcpp .The first step in creating your R package is to give it a name. Hadley has some ideas about it. Here are our rules:
Almost all of our packages will eventually go on Bioconductor (Gentleman, Carey, Bates, and others, 2004). So we are going to use the versioning scheme that is compatible with that platform (with some helpful suggestions from Kasper H.).
The format of the version number will always be x.y.z. When you start any new package the version number should be 0.1.0. Every time you make any change public (e.g., push to GitHub) you should increase z in the version number. If you are making local commits but not making them public to other people you don't need to increase z. You should stay in version 0.1.z basically up until you are ready to submit to Bioconductor (or CRAN) for release.
Before release you can increase y if you perform a major redesign of how the functions are organized or are used. You should never increase x before release.
The first time you submit the package to Bioconductor you should submit it as version number 0.99.z. That way on the next release of Bioconductor it will get bumped to 1.0.0. The next devel version will get bumped to 1.1.0 on Bioconductor. Immediately after releasing, if you plan to keep working on the project, you should bump your version on GitHub to 1.1.0.
Thereafter, again you should keep increasing z every time you make a public change. If you do a major reorganization you should increase y.
Run this code from R to create your package. It will create a directory called packagename and put some stuff in it (more on this stuff in a second).
## Setup
install.packages(c("devtools", "roxygen2", "knitr"))
## Load the libraries
library("devtools")
library("roxygen2")
library("knitr")
## Create the package directory
create("packagename")
All packages that are developed by the Leek group will be hosted on GitHub. The accounts are free and it makes it so much easier to share code/get other people to help you with your code. Here are the steps to getting your package on GitHub:
packagename directory on your local machine, run the commands: git initgit remote add origin git@github.com:yourusername/packagename.gitpackagename directory called README.mdgit add *git commit -m 'initial commit'git push -u origin masterIn summary:
mkdir packagename
cd packagename
git init
git remote add origin git@github.com:yourusername/packagename.git
git add *
git commit -m 'initial commit'
git push -u origin master
Once you're familiar with basic git and GitHub workflows, GitHub has some more advanced features that you can take advantage of. In particular, github flow is an excellent way to manage contributions, and GitHub organizations can provide a central location for multiple people (e.g. in a lab) to collaborate on software.
R packageR functionsThe R functions you have written go in the R/ directory in the packagename folder. Each of your R functions should go in a separate file with a .R extension. We are going to use capital R for the extension of the files.
Why? Don't ask questions.
If you define a new class call the .R file classname-class.R. For example, if you are creating the leek class of objects it would be called leek-class.R. If you are defining a new method for the class it should be named newclass-methodname-method.R. For example, a plotting method for the leek class would go in a .R file called leek-plot-method.R.
DESCRIPTIONThe DESCRIPTION file is a plain text file that gets generated with the devtools::create() command.
author name <authoremail@host.com> for example: Jeff Leek <jleek@jhsph.edu> and should be comma separated.R packages your software uses/depends on) should be listed in a comma separated list after the R version. One of the dependencies should be the knitr (Xie, 2014b) package for the vignette.VignetteBuilder: knitrSuggests: knitr, BiocStyleFor example:
Package: packagename
Type: Package
Title: A sentence
Version: 0.1.0
Date: 2013-09-30
Authors@R: c(person("Jeff", "Leek", role = c("aut", "cre", "ths"),
email = "jleek@jhsph.edu"))
Depends:
R(>= 3.0.2)
Suggests:
knitr,
BiocStyle
Description: A couple sentences that expand the title
License: Artistic-2.0
I will try to keep the stylistic requirements minimal because they would likely drive you nuts. For now there are:
You can set these as defaults (say in Sublime, Textmate or RStudio) then you don't have to worry about them anymore. If you find lines going longer than 80 columns, you will need to write the lines into functions, etc.
We also recommend using formatR (Xie, 2014a) to nicely format your code. For example, use the tidy_source() function to reformat the code in a R source file.
This is how I feel about the relative importance of various components of statistical software development:

Ideally your software is easy to understand and just works. But this isn't Apple and you don't have a legion of test users to try it out for you. So more likely than not, at least the first several versions of your software will be at least a little hard to use. The first versions will also probably be slower than you would like them to be.
But if your software solves a real problem (it should!) and is well documented (it will be!) then people will use it and you will have a positive impact on the world.
Documentation has two main components. The first component is help files, which go into the man/ folder. The second component is vignettes which will go in a folder called `vignettes/ which you will have to create. I'll tackle each of these separately.
These files document each of the functions/methods/classes you will expose to your users. The good news is that you don't have to write these files yourself. You will use the roxygen2 (Wickham, Danenberg, and Eugster, 2014) package to create the man files. To use roxygen2 you will need to document your functions in the .R files with comments formatted in a specific way. Right before your functions you should have a set of comments that are denoted by the symbol #'. They are structured in the following way:
#' A one sentence description of what your function does
#'
#' A more detailed description of what the function is and how
#' it works. It may be a paragraph that should not be separated
#' by any spaces.
#'
#' @param inputParameter1 A description of the input parameter \code{inputParameter1}
#' @param inputParameter2 A description of the input parameter \code{inputParameter2}
#'
#' @return output A description of the object the function outputs
#'
#' @keywords keywords
#'
#' @export
#'
#' @examples
#' R code here showing how your function works
myfunction <- function(inputParameter1, inputParameter2){
## Awesome code!
return(result)
}
You include the @export command if you want the function to be exported (i.e. visible) to your end users. Hadley has a pretty comprehensive guide where you can learn a lot more about how roxygen2 works. Your function follows immediately after the comments.
When you have saved functions with roxygen2 style comments you can create the .Rd files (the man files themselves) by running:
library("devtools")
document("packagename")
on the package folder. The package folder must be in the current working directory where you are editing.
Please read Hadley's guide in its entirety to understand how to document packages and in particular, how roxygen2 deals with collation and namespaces.
Documentation in the help files is important and is the primary way that people will figure out your functions if they get stuck. But it is equally (maybe more) critical that you help people get started. The way that you do that is to create a vignette. For our purposes, a vignette is a tutorial that includes the following components:
We will write Vignettes in knitr (Xie, 2014b). Vignettes can generate either HTML from R markdown, or pdf from latex. In either case, the files should be in packagename/vignettes/. During package building they will be moved to packagename/inst/doc. For HTML vignettes, the vignette files should be vignette.Rmd. For PDF vignettes, it should be vignette.Rnw. Here is some information from Yihui about building vignettes in knitr (Xie, 2014b).
For latex vignettes, you should use the BiocStyle (Morgan, Oles, and Huber, 2014) package to style your vignette. This means you will need to add this code to the preamble of your Rnw file:
<<style, eval=TRUE, echo=FALSE, results='asis'>>=
BiocStyle::latex()
@
See the BiocStyle vignette for commands that you can use when creating your vignette (e.g. \Biocpkg{IRanges} for referencing a Bioconductor package).
Alternatively, you can make HTML vignettes. See the BiocStyle HTML vignette on how to do so. Or check how we've made others using knitrBootstrap (Hester, 2014) in the derfinder package which render into these pages.
For our purposes anyone who wrote a function exposed to users in the package (a function that has an `@export in the documentation) will be listed as an author.
You will be a maintainer and Jeff will be a maintainer. If you can sucker one of your fellow students into maintaining the package as well, you can list them, but they must make the same commitment to 5 years of support.
One thing I think a lot of academics (definitely including myself here) struggle with is the need to be "new" and "innovative" with everything they do. There is a strong selective pressure for these qualities in academia.
But when writing software it is very, very important not to reinvent every single wheel you see. One person who is awesome at blending existing tools and exponentially building value is Ramnath Vaidyanathan. He built slidify (Vaidyanathan, 2012) on top of knitr (Xie, 2014b) and rCharts (Vaidyanathan, 2013) on top of existing D3 libraries. They allowed him to create awesome software without having to solve every single problem.
Before writing general purpose functions (say for regression or for calculating p-values or for making plots) make sure you search for functions that already exist (or ask Jeff/your fellow students if you aren't sure).
An important and balancing principle is that you should try to keep the number of dependencies for your package as small as possible. You should also try to use packages that you trust will be maintained. Some ways to tell if a package is "trustworthy" are to check the number of downloads/users (higher is better), check to see if the package is being actively updated (on GitHub/Bioconductor/CRAN) and there is a history of updates, and check to see if the authors of the packages routinely maintain important packages (like Hadley, Yihui, Ramnath, Martin Morgan, etc.). Keep in mind your commitment (see below) when making decisions about whose functions to use.
A major temptation for everyone creating code is to generate a bunch of features they think that users will want without actually testing those features out. The problem is that each new feature you create in your package will monotonically increase the number of dependencies and the amount of code you have to maintain. In general, the principle should be to create exactly enough functions that the users can install your package, perform your analysis, and return the results, with no extraneous functionality.
Specifically, be wary of things like GUIs or Shiny apps. Given the heavy emphasis placed on reproducibility these days, it is rarely the case that real/important analyses will be performed in a point and click format.
If you are way into creating products that point-and-click users will be interested in I'm very happy to support you in that, since I think those things are cool, probably the future, and can certainly raise the interest in your work. But they present a potentially major difficulty in maintenance and should be placed in separate packages on your own account.
Unit tests are an important for the following reasons:
Your advisor tends to rush through things. While there is a strong interest in being competitive/getting things done in the Leek group, there is an even stronger interest in doing them right and for the long term. We will use the testthat (Wickham, 2011) framework (described here) for building unit tests. A couple of suggestions for using the framework are:
To use the testthat (Wickham, 2011) package you will put a file called test-area-packagename.R in the inst/tests directory, where area is the name of the broad class of functions you are testing. Then add another file called run-all.R to a separate directory called tests (see for example the stringr (Wickham, 2012) package). The run-all.R function has this code in it:
library(testthat)
library(mypackage)
test_package("mypackage")
This means that whenever you run R CMD check on your package, you will get an error if one of the unit tests fails. This is important, because it will be one way for me to check your code and to tell you if you have broken your code with an update.
Here is an example unit test, so you get the idea of what I'm looking for. Below is a function for performing differential expression analysis on a matrix.
#' A function for differential expression analysis
#'
#' This function takes a matrix of gene expression data
#' (genes in rows, samples in columns) and a factor variable
#' with two levels and performs differential expression analysis
#'
#' @param data A gene expression data matrix with genes in rows and samples in columns
#' @param grp A two-level factor variable with two levels
#'
#' @return pValues The p-values from the differential expression test.
#'
#' @keywords differential expression
#'
#' @export
#'
#' @examples
#' R code here showing how your function works
deFunction <- function(dat, grp){
if(!is.factor(grp)){stop("grp variable must be a factor")}
if(length(unique(grp))!=2){stop("grp variable must have exactly two levels")}
if(any(genefilter::rowSds(dat)==0)){stop("some genes have zero variance")}
result = genefilter::rowttests(dat,grp)$p.value
return(result)
}
Now create a new file called test-defunction.R. The code in that file is:
context("tests on inputs")
test_that("tests for grp variable",{
set.seed(12345)
dat <- matrix(rnorm(100*30),nrow=100,ncol=30)
grp <- rep(c(0,1),each=15)
expect_that(deFunction(dat,grp),throws_error("grp variable must be a factor"))
grp <- as.factor(rep(c(0,1,2),each=10))
expect_that(deFunction(dat,grp),throws_error("grp variable must have exactly two levels"))
grp <- as.factor(rep(0,30))
expect_that(deFunction(dat,grp),throws_error("grp variable must have exactly two levels"))
})
test_that("tests for dat variable",{
set.seed(12345)
grp <- as.factor(rep(c(0,1),each=15))
dat <- matrix(0,nrow=100,ncol=30)
expect_that(deFunction(dat,grp),throws_error("some genes have zero variance; t-test won't work"))
})
context("test on outputs")
test_that("test p-values are numeric and non-zero",{
set.seed(12345)
grp <- as.factor(rep(c(0,1),each=15))
dat <- matrix(matrix(rnorm(100*30)),nrow=100,ncol=30)
expect_that(deFunction(dat,grp),is_a("numeric"))
expect_that(all(deFunction(dat,grp) > 0),is_true())
})
If you run the command: test_file("test-defunction.R") you should get the output:
tests on inputs : ....
test on outputs : ..
without any error messages. But if you accidentally delete one of your error checking messages at the beginning of the function and rerun the tests, it will tell you which context and which test broke.
The thing to keep in mind here is that you would be doing these tests on the fly/manually anyway. So you might as well write them into a set of unit tests that will give you an idea of where your functions are breaking. Some things that you should be testing:
For a full example, check derfinder's unit tests.
You may be interested in automatically testing your packages with R CMD check while you are developing it in GitHub. As a Leek group protocol, we have described how to do so in Routinely testing your R package with Travis.
Hopefully your package will have a ton of users. Inevitably, they will try to use the software for purposes you did not intend. Some of these will be happy things (software you wrote for microarrays being used in sequencing). Sometimes they will be unhappy - people using it completely out of context and getting wonky answers.
A major component of dummy proofing is writing thorough documentation and vignettes (see the above sections). When you see weird cases (reported from users or cases you find yourself) add them to the documents. There is one additional important component of dummy proofing we will use: input dummy proofing (related to unit testing).
Most of the time, if you input non-matching arguments or arguments of the wrong class into R functions a cryptic error will result. This will cause you headaches when maintaining your software. So you should add some commands with the stop function at the beginning of each function that throw errors if the arguments don't have the correct class or would result in silly output (like the zero variance gene test in the unit testing example above).
When possible, assume some default parameters for the users and reduce the list of arguments they see. You can achieve this using the triple dots argument (...) to hide a set of advanced arguments.
In the derfinder (Collado-Torres, Frazee, Jaffe, and Leek, 2014) package we decided to use a helper function to identify some advanced arguments: check it here and how it's used in another function. There are other ways you can reduce the number of arguments, but the idea is that doing so will reduce the user's confusion level.
Once you have developed a package you should use it yourself for a couple of weeks. Then you should have at least one other student use it without you giving them any instructions other than telling them where some data are - that way you can test your documentation. Finally, you should meet with Jeff and get ready to release it to Bioconductor.
The steps in releasing to Bioconductor are:
library("devtools")
check_doc("packagename") ## Only for checking the documentation
system.time(check("packagename", args='--no-build-vignettes')) ## R CMD check with time information
y and z) to the next odd number for z. In the commit comments, state that this is the new devel version.You are vested in creating the software you are working on now since it is part of your training and will definitely help your short term career in terms of getting jobs/gaining visibility. But your time as a graduate student is (happily for you) limited. After a few years you will graduate and head off to an awesome job. The software you built as a student may be less important to you.
Hopefully, though, it is still very important to the users who are applying that software to solve the most pressing problems in molecular biology and medicine. It is unfair to those users if your software breaks down and can't be installed/used.
So if you undertake a research project in the Leek group which involves software development (pretty much all of them will) you commit to maintaining that software for at least 5 years after you graduate. Of course, I can't hold you to that like a contract or anything, but think of it as an honor thing.
5 years is a long time. It is most of the way toward tenure (in academia) or probably 3 years after you have started your own awesome company and appeared on Techcrunch. So it is worth thinking about ways you can ensure that the maintenance will be as low as possible. Specifically:
The first version of this tutorial was written by Jeff Leek (@simplystats). Hopefully he can sucker his students into contributing since they are much, much better R programmers than he is.
Thanks to the beauty of Github a few others have made contributions.
knitrBootstrap (Hester, 2014) to format the page, contributed the protocol to test your R package with Travis.Citations made with knitcitations (Boettiger, 2014).
[1] C. Boettiger. knitcitations: Citations for knitr markdown files. R package version 1.0.2. 2014. URL: https://github.com/cboettig/knitcitations.
[2] L. Collado-Torres, A. Frazee, A. Jaffe and J. Leek. derfinder: Fast differential expression analysis of RNA-seq data at base-pair resolution. R package version 0.99.5. 2014. URL: https://github.com/lcolladotor/derfinder.
[3] R. C. Gentleman, V. J. Carey, D. M. Bates and others. “Bioconductor: Open software development for computational biology and bioinformatics”. In: Genome Biology 5 (2004), p. R80. URL: http://genomebiology.com/2004/5/10/R80.
[4] J. Hester. knitrBootstrap: Knitr Bootstrap framework. R package version 1.0.0. 2014. URL: https://github.com/jimhester/.
[5] M. Morgan, A. Oles and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 1.3.14. 2014.
[6] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2014. URL: http://www.R-project.org/.
[7] R. Vaidyanathan. rCharts: Interactive Charts using Javascript Visualization Libraries. R package version 0.4.5. 2013.
[8] R. Vaidyanathan. slidify: Generate reproducible html5 slides from R markdown. R package version 0.4.5. 2012. URL: http://ramnathv.github.com/slidify/.
[9] H. Wickham. stringr: Make it easier to work with strings. R package version 0.6.2. 2012. URL: http://CRAN.R-project.org/package=stringr.
[10] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[11] H. Wickham and W. Chang. devtools: Tools to make developing R code easier. R package version 1.6. 2014. URL: http://CRAN.R-project.org/package=devtools.
[12] H. Wickham, P. Danenberg and M. Eugster. roxygen2: In-source documentation for R. R package version 4.0.2. 2014. URL: http://CRAN.R-project.org/package=roxygen2.
[13] Y. Xie. formatR: Format R Code Automatically. R package version 1.0. 2014. URL: http://CRAN.R-project.org/package=formatR.
[14] Y. Xie. knitr: A general-purpose package for dynamic report generation in R. R package version 1.6. 2014. URL: http://yihui.name/knitr/.
## To render this page use:
source('render.R')
Packages used:
Date this page was last rendered:
## [1] "2014-10-04 12:30:34 EDT"