This protocol was contributed by L. Collado-Torres as is available at jtleek.com/protocols/.
This protocol explains how to create a feature count matrix from coverage data stored in BigWig files. This feature count matrix can then be used for differential expression analyses using packages such as DESeq2
, edgeR-robust
(Zhou, Lindsay, and Robinson, 2014), limma
(Ritchie, Phipson, Wu, Hu, et al., 2015), or derfinder-2014
(Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014).
Frequently researchers are interested in differential expression analyses when working with RNA-seq data. Doing so involves aligning the sequence reads to the reference genome/transcriptome to identify which features (exons, genes) were expressed. Commonly, a features count matrix is created with one row per feature and one column per sample. The number in each cell of the matrix corresponds to the number of reads overlapping the feature in question for the given sample.
Going into more detail, the sequence reads mapped to the genome are stored in sequence alignment files, most commonly in BAM format. This format stores information which can be useful for other workflows. However, for differential expression analyses, the information in BAM files can be significantly compressed by storing just the coverage information. That is, the number of reads overlapping each base of the genome. BigWig files can efficiently store the coverage information. Inside R, the rtracklayer
(Lawrence, Gentleman, and Carey, 2009) allows users to read data from BigWig files, and with some code, we can construct the count matrix.
This protocol exemplifies some code for creating such a matrix using example BigWig files available in derfinderData
(Collado-Torres, Jaffe, and Leek, 2015) which contains data from the BrainSpan
project (BrainSpan, 2011).
First, lets locate a set of BigWig files from the amygdala.
## Load data
library('derfinderData')
## Locate bigWig files
files <- dir(system.file('extdata', 'AMY', package = 'derfinderData'), full.names = TRUE)
names(files) <- gsub('\\.bw', '', dir(system.file('extdata', 'AMY', package = 'derfinderData')))
head(files)
## HSB113
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB113.bw"
## HSB123
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB123.bw"
## HSB126
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB126.bw"
## HSB130
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB130.bw"
## HSB135
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB135.bw"
## HSB136
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB136.bw"
Next, we need to have a GRanges
object with the features we are interested in. Normally, that would be all the exons or genes of a given organism. However, in this example we will only use the exons for chromosome 21 of the human genome (version hg19). That information is conveniently stored in derfinder
(Collado-Torres, Frazee, Love, Irizarry, et al., 2015).
## Find exons
library('derfinder')
exons <- genomicState$fullGenome[genomicState$fullGenome$theRegion == 'exon']
exons
## GRanges object with 2658 ranges and 4 metadata columns:
## seqnames ranges strand | theRegion
## |
## 1 chr21 [9690071, 9690100] + | exon
## 2 chr21 [9692178, 9692207] + | exon
## 3 chr21 [9711935, 9712038] + | exon
## 5 chr21 [9746121, 9746151] + | exon
## 7 chr21 [9748830, 9748899] + | exon
## ... ... ... ... ... ...
## 4985 chr21 [47740937, 47744424] - | exon
## 4986 chr21 [48018531, 48019416] - | exon
## 4988 chr21 [48020195, 48020288] - | exon
## 4990 chr21 [48022191, 48022329] - | exon
## 4992 chr21 [48024926, 48025035] - | exon
## tx_id tx_name
##
## 1 72512 uc002zkg.1
## 2 72513 uc021wgt.1
## 3 72514 uc011abu.2
## 5 72514 uc011abu.2
## 7 72514 uc011abu.2
## ... ... ...
## 4985 73460,73461,73462,... uc002zjf.3,uc002zjg.1,uc010gqj.2,...
## 4986 73464,73465 uc002zju.1,uc002zjv.1
## 4988 73465 uc002zjv.1
## 4990 73464,73465 uc002zju.1,uc002zjv.1
## 4992 73464,73465 uc002zju.1,uc002zjv.1
## gene
##
## 1
## 2
## 3
## 5
## 7
## ... ...
## 4985 194
## 4986 227
## 4988 227
## 4990 227
## 4992 227
## -------
## seqinfo: 1 sequence from hg19 genome
Now, using rtracklayer
we can import the coverage data into our R session and create the count matrix.
## Import data and create count matrix
library('rtracklayer')
bw <- BigWigFileList(files)
counts <- matrix(NA, nrow = length(exons), ncol = length(bw))
colnames(counts) <- names(bw)
for(i in seq_len(length(bw))) {
coverage <- import(bw[[i]], as = 'RleList')$chr21
counts[, i] <- sum(Views(coverage, ranges(exons)))
}
## Divide by read length and round to integer numbers
counts <- round(counts / 76, 0)
## Explore a little portion of the count matrix
dim(counts)
## [1] 2658 12
counts[2653:2658, 1:6]
## HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## [1,] 0 0 0 0 0 0
## [2,] 5 1 2 1 1 1
## [3,] 7 291 252 152 440 264
## [4,] 0 1 1 0 0 0
## [5,] 2 65 70 36 102 77
## [6,] 0 8 10 2 10 14
Note that derfinder
has functions for performing the above operation in parallel for a large set of files and/or chromosomes.
Once we have created the count matrix, we can proceed to use the differential expression analysis tool of our choice. In this example we'll use DESeq2
to find differentially expressed exons between the adult and fetal samples. The phenotypic information is stored in derfinderData
and once we extract it, we can use the DESeqDataSetFromMatrix()
function to create the type of object DESeq2
uses. Once we have that object, we can perform the differential expression analysis and continue from there. Please check the documentation for DESeq2
for more information.
## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')
## Perform DESeq2 analysis
library('DESeq2')
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, data.frame(group = pheno$group), ~ group)
## Perform DE analysis
dse <- DESeq(dse, test = 'LRT', reduced = ~ 1)
## Extract results
deseq <- exons
mcols(deseq) <- cbind(mcols(deseq), results(dse))
## Explore results
deseq
## GRanges object with 2658 ranges and 10 metadata columns:
## seqnames ranges strand | theRegion
## |
## 1 chr21 [9690071, 9690100] + | exon
## 2 chr21 [9692178, 9692207] + | exon
## 3 chr21 [9711935, 9712038] + | exon
## 5 chr21 [9746121, 9746151] + | exon
## 7 chr21 [9748830, 9748899] + | exon
## ... ... ... ... ... ...
## 4985 chr21 [47740937, 47744424] - | exon
## 4986 chr21 [48018531, 48019416] - | exon
## 4988 chr21 [48020195, 48020288] - | exon
## 4990 chr21 [48022191, 48022329] - | exon
## 4992 chr21 [48024926, 48025035] - | exon
## tx_id tx_name
##
## 1 72512 uc002zkg.1
## 2 72513 uc021wgt.1
## 3 72514 uc011abu.2
## 5 72514 uc011abu.2
## 7 72514 uc011abu.2
## ... ... ...
## 4985 73460,73461,73462,... uc002zjf.3,uc002zjg.1,uc010gqj.2,...
## 4986 73464,73465 uc002zju.1,uc002zjv.1
## 4988 73465 uc002zjv.1
## 4990 73464,73465 uc002zju.1,uc002zjv.1
## 4992 73464,73465 uc002zju.1,uc002zjv.1
## gene baseMean log2FoldChange lfcSE stat
##
## 1 0
## 2 0
## 3 0
## 5 0
## 7 0
## ... ... ... ... ... ...
## 4985 194 3.443649 1.358225 0.7013654 3.9489755
## 4986 227 166.809942 -2.645397 1.0123859 6.0519380
## 4988 227 0.267442 -1.258079 2.1787538 0.3414066
## 4990 227 41.821053 -2.597086 1.0032177 5.9903045
## 4992 227 5.025438 -3.021444 1.0248336 8.3198395
## pvalue padj
##
## 1
## 2
## 3
## 5
## 7
## ... ... ...
## 4985 0.046899914 0.28338405
## 4986 0.013891042 0.15765652
## 4988 0.559018411
## 4990 0.014384719 0.15765652
## 4992 0.003921439 0.08339072
## -------
## seqinfo: 1 sequence from hg19 genome
## How many have significant p-values?
sum(deseq$pvalue < 0.05, na.rm = TRUE)
## [1] 87
## How many have significant FDR adjusted p-values?
sum(deseq$padj < 0.05, na.rm = TRUE)
## [1] 7
From here, you can proceed to use other Bioconductor packages for different downstream analyses.
Citations made with knitcitations
(Boettiger, 2015).
[1] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.6. 2015. URL: http://CRAN.R-project.org/package=knitcitations.
[2] BrainSpan. “Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.” 2011. URL: http://developinghumanbrain.org.
[3] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.
[4] L. Collado-Torres, A. Jaffe and J. Leek. derfinderData: Processed BigWigs from BrainSpan for examples. R package version 0.103.1. 2015. URL: https://github.com/lcolladotor/derfinderData.
[5] A. C. Frazee, S. Sabunciyan, K. D. Hansen, R. A. Irizarry, et al. “Differential expression analysis of RNA-seq data at single-base resolution”. In: Biostatistics 15 (3) (2014), pp. 413-426. DOI: 10.1093/biostatistics/kxt053. URL: http://biostatistics.oxfordjournals.org/content/15/3/413.long.
[6] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.
[7] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies”. In: Nucleic Acids Research 43.7 (2015), p. e47.
[8] X. Zhou, H. Lindsay and M. D. Robinson. “Robustly detecting differential expression in RNA sequencing data using observation weights”. In: Nucleic Acids Research 42 (2014), p. e91.
R
session information
## setting value
## version R version 3.2.0 Patched (2015-05-18 r68382)
## system x86_64, darwin10.8.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz Europe/Vienna
## package * version date source
## acepack 1.3-3.3 2014-11-24 CRAN (R 3.2.0)
## annotate 1.47.0 2015-04-17 Bioconductor
## AnnotationDbi 1.31.16 2015-06-10 Bioconductor
## bibtex 0.4.0 2014-12-31 CRAN (R 3.2.0)
## Biobase * 2.29.1 2015-05-03 Bioconductor
## BiocGenerics * 0.15.2 2015-06-05 Bioconductor
## BiocParallel 1.3.25 2015-06-12 Bioconductor
## biomaRt 2.25.1 2015-04-21 Bioconductor
## Biostrings 2.37.2 2015-05-07 Bioconductor
## bitops 1.0-6 2013-08-17 CRAN (R 3.2.0)
## bumphunter 1.9.1 2015-05-05 Bioconductor
## cluster 2.0.1 2015-01-31 CRAN (R 3.2.0)
## codetools 0.2-11 2015-03-10 CRAN (R 3.2.0)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## curl 0.9 2015-06-19 CRAN (R 3.2.1)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
## derfinder * 1.3.3 2015-06-15 Bioconductor
## derfinderData * 0.103.1 2015-06-16 Bioconductor
## derfinderHelper 1.3.0 2015-04-17 Bioconductor
## DESeq2 * 1.9.10 2015-05-28 Bioconductor
## devtools * 1.8.0 2015-05-09 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## doRNG 1.6 2014-03-07 CRAN (R 3.2.0)
## evaluate 0.7 2015-04-21 CRAN (R 3.2.0)
## foreach 1.4.2 2014-04-11 CRAN (R 3.2.0)
## foreign 0.8-63 2015-02-20 CRAN (R 3.2.0)
## formatR 1.2 2015-04-21 CRAN (R 3.2.0)
## Formula 1.2-1 2015-04-07 CRAN (R 3.2.0)
## futile.logger 1.4.1 2015-04-20 CRAN (R 3.2.0)
## futile.options 1.0.0 2010-04-06 CRAN (R 3.2.0)
## genefilter 1.51.0 2015-04-17 Bioconductor
## geneplotter 1.47.0 2015-04-17 Bioconductor
## GenomeInfoDb * 1.5.7 2015-06-02 Bioconductor
## GenomicAlignments 1.5.9 2015-05-13 Bioconductor
## GenomicFeatures 1.21.13 2015-06-09 Bioconductor
## GenomicFiles 1.5.3 2015-05-13 Bioconductor
## GenomicRanges * 1.21.15 2015-06-09 Bioconductor
## ggplot2 1.0.1.9000 2015-06-12 Github (hadley/ggplot2@3b6a126)
## git2r 0.10.1 2015-05-07 CRAN (R 3.2.0)
## gridExtra 0.9.1 2012-08-09 CRAN (R 3.2.0)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## Hmisc 3.16-0 2015-04-30 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## httr 1.0.0 2015-06-25 CRAN (R 3.2.1)
## IRanges * 2.3.12 2015-06-16 Bioconductor
## iterators 1.0.7 2014-04-11 CRAN (R 3.2.0)
## knitcitations * 1.0.6 2015-05-26 CRAN (R 3.2.0)
## knitr 1.10.5 2015-05-06 CRAN (R 3.2.0)
## knitrBootstrap 1.0.0 2015-06-09 Github (jimhester/knitrBootstrap@76c41f0)
## lambda.r 1.1.7 2015-03-20 CRAN (R 3.2.0)
## lattice 0.20-31 2015-03-30 CRAN (R 3.2.0)
## latticeExtra 0.6-26 2013-08-15 CRAN (R 3.2.0)
## locfit 1.5-9.1 2013-04-20 CRAN (R 3.2.0)
## lubridate 1.3.3 2013-12-31 CRAN (R 3.2.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## markdown 0.7.7 2015-04-22 CRAN (R 3.2.0)
## MASS 7.3-40 2015-03-21 CRAN (R 3.2.0)
## Matrix 1.2-1 2015-06-01 CRAN (R 3.2.0)
## matrixStats 0.14.0 2015-02-14 CRAN (R 3.2.0)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
## nnet 7.3-9 2015-02-11 CRAN (R 3.2.0)
## pkgmaker 0.22 2014-05-14 CRAN (R 3.2.0)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
## proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
## qvalue 2.1.0 2015-04-17 Bioconductor
## R6 2.0.1 2014-10-29 CRAN (R 3.2.0)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.2.0)
## Rcpp * 0.11.6 2015-05-01 CRAN (R 3.2.0)
## RcppArmadillo * 0.5.200.1.0 2015-06-05 CRAN (R 3.2.0)
## RCurl 1.95-4.6 2015-04-24 CRAN (R 3.2.0)
## RefManageR 0.8.63 2015-06-09 CRAN (R 3.2.0)
## registry 0.2 2012-01-24 CRAN (R 3.2.0)
## reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
## RJSONIO 1.3-0 2014-07-28 CRAN (R 3.2.0)
## rmarkdown * 0.7 2015-06-13 CRAN (R 3.2.0)
## rngtools 1.2.4 2014-03-06 CRAN (R 3.2.0)
## rpart 4.1-9 2015-02-24 CRAN (R 3.2.0)
## Rsamtools 1.21.8 2015-06-03 Bioconductor
## RSQLite 1.0.0 2014-10-25 CRAN (R 3.2.0)
## rtracklayer * 1.29.10 2015-06-13 Bioconductor
## rversions 1.0.1 2015-06-06 CRAN (R 3.2.0)
## S4Vectors * 0.7.6 2015-06-16 Bioconductor
## scales 0.2.5 2015-06-12 CRAN (R 3.2.0)
## stringi 0.5-5 2015-06-29 CRAN (R 3.2.1)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## SummarizedExperiment * 0.1.5 2015-05-17 Bioconductor
## survival 2.38-2 2015-06-12 CRAN (R 3.2.0)
## XML 3.98-1.2 2015-05-31 CRAN (R 3.2.0)
## xml2 0.1.1 2015-06-02 CRAN (R 3.2.0)
## xtable 1.7-4 2014-09-12 CRAN (R 3.2.0)
## XVector 0.9.1 2015-04-30 Bioconductor
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
## zlibbioc 1.15.0 2015-04-17 Bioconductor