BigWigs to count matrix

This protocol was contributed by L. Collado-Torres as is available at jtleek.com/protocols/.

Overview

This protocol explains how to create a feature count matrix from coverage data stored in BigWig files. This feature count matrix can then be used for differential expression analyses using packages such as DESeq2 , edgeR-robust (Zhou, Lindsay, and Robinson, 2014), limma (Ritchie, Phipson, Wu, Hu, et al., 2015), or derfinder-2014 (Frazee, Sabunciyan, Hansen, Irizarry, et al., 2014).

Introduction

Frequently researchers are interested in differential expression analyses when working with RNA-seq data. Doing so involves aligning the sequence reads to the reference genome/transcriptome to identify which features (exons, genes) were expressed. Commonly, a features count matrix is created with one row per feature and one column per sample. The number in each cell of the matrix corresponds to the number of reads overlapping the feature in question for the given sample.

Going into more detail, the sequence reads mapped to the genome are stored in sequence alignment files, most commonly in BAM format. This format stores information which can be useful for other workflows. However, for differential expression analyses, the information in BAM files can be significantly compressed by storing just the coverage information. That is, the number of reads overlapping each base of the genome. BigWig files can efficiently store the coverage information. Inside R, the rtracklayer (Lawrence, Gentleman, and Carey, 2009) allows users to read data from BigWig files, and with some code, we can construct the count matrix.

This protocol exemplifies some code for creating such a matrix using example BigWig files available in derfinderData (Collado-Torres, Jaffe, and Leek, 2015) which contains data from the BrainSpan project (BrainSpan, 2011).

Example

First, lets locate a set of BigWig files from the amygdala.

## Load data
library('derfinderData')

## Locate bigWig files
files <- dir(system.file('extdata', 'AMY', package = 'derfinderData'),  full.names = TRUE)
names(files) <- gsub('\\.bw', '', dir(system.file('extdata', 'AMY', package = 'derfinderData')))
head(files)
##                                                                                                      HSB113 
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB113.bw" 
##                                                                                                      HSB123 
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB123.bw" 
##                                                                                                      HSB126 
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB126.bw" 
##                                                                                                      HSB130 
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB130.bw" 
##                                                                                                      HSB135 
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB135.bw" 
##                                                                                                      HSB136 
## "/Library/Frameworks/R.framework/Versions/3.2patched/Resources/library/derfinderData/extdata/AMY/HSB136.bw"

Next, we need to have a GRanges object with the features we are interested in. Normally, that would be all the exons or genes of a given organism. However, in this example we will only use the exons for chromosome 21 of the human genome (version hg19). That information is conveniently stored in derfinder (Collado-Torres, Frazee, Love, Irizarry, et al., 2015).

## Find exons
library('derfinder')
## Warning: replacing previous import by 'scales::alpha' when loading 'Hmisc'
exons <- genomicState$fullGenome[genomicState$fullGenome$theRegion == 'exon']
exons
## GRanges object with 2658 ranges and 4 metadata columns:
##        seqnames               ranges strand   |   theRegion
##                            | 
##      1    chr21   [9690071, 9690100]      +   |        exon
##      2    chr21   [9692178, 9692207]      +   |        exon
##      3    chr21   [9711935, 9712038]      +   |        exon
##      5    chr21   [9746121, 9746151]      +   |        exon
##      7    chr21   [9748830, 9748899]      +   |        exon
##    ...      ...                  ...    ... ...         ...
##   4985    chr21 [47740937, 47744424]      -   |        exon
##   4986    chr21 [48018531, 48019416]      -   |        exon
##   4988    chr21 [48020195, 48020288]      -   |        exon
##   4990    chr21 [48022191, 48022329]      -   |        exon
##   4992    chr21 [48024926, 48025035]      -   |        exon
##                        tx_id                              tx_name
##                                      
##      1                 72512                           uc002zkg.1
##      2                 72513                           uc021wgt.1
##      3                 72514                           uc011abu.2
##      5                 72514                           uc011abu.2
##      7                 72514                           uc011abu.2
##    ...                   ...                                  ...
##   4985 73460,73461,73462,... uc002zjf.3,uc002zjg.1,uc010gqj.2,...
##   4986           73464,73465                uc002zju.1,uc002zjv.1
##   4988                 73465                           uc002zjv.1
##   4990           73464,73465                uc002zju.1,uc002zjv.1
##   4992           73464,73465                uc002zju.1,uc002zjv.1
##                 gene
##        
##      1              
##      2              
##      3              
##      5              
##      7              
##    ...           ...
##   4985           194
##   4986           227
##   4988           227
##   4990           227
##   4992           227
##   -------
##   seqinfo: 1 sequence from hg19 genome

Now, using rtracklayer we can import the coverage data into our R session and create the count matrix.

## Import data and create count matrix
library('rtracklayer')
## Warning: package 'rtracklayer' was built under R version 3.2.1
## Loading required package: GenomicRanges
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rep.int, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
bw <- BigWigFileList(files)
counts <- matrix(NA, nrow = length(exons), ncol = length(bw))
colnames(counts) <- names(bw)
for(i in seq_len(length(bw))) {
    coverage <- import(bw[[i]], as = 'RleList')$chr21
    counts[, i] <- sum(Views(coverage, ranges(exons)))
}
## Divide by read length and round to integer numbers
counts <- round(counts / 76, 0)

## Explore a little portion of the count matrix 
dim(counts)
## [1] 2658   12
counts[2653:2658, 1:6]
##      HSB113 HSB123 HSB126 HSB130 HSB135 HSB136
## [1,]      0      0      0      0      0      0
## [2,]      5      1      2      1      1      1
## [3,]      7    291    252    152    440    264
## [4,]      0      1      1      0      0      0
## [5,]      2     65     70     36    102     77
## [6,]      0      8     10      2     10     14

Note that derfinder has functions for performing the above operation in parallel for a large set of files and/or chromosomes.

Once we have created the count matrix, we can proceed to use the differential expression analysis tool of our choice. In this example we'll use DESeq2 to find differentially expressed exons between the adult and fetal samples. The phenotypic information is stored in derfinderData and once we extract it, we can use the DESeqDataSetFromMatrix() function to create the type of object DESeq2 uses. Once we have that object, we can perform the differential expression analysis and continue from there. Please check the documentation for DESeq2 for more information.

## Get pheno table
pheno <- subset(brainspanPheno, structure_acronym == 'AMY')

## Perform DESeq2 analysis
library('DESeq2')
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
## Round matrix and specify design
dse <- DESeqDataSetFromMatrix(counts, data.frame(group = pheno$group), ~ group)
## converting counts to integer mode
## Perform DE analysis
dse <- DESeq(dse, test = 'LRT', reduced = ~ 1)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## Extract results
deseq <- exons
mcols(deseq) <- cbind(mcols(deseq), results(dse))

## Explore results
deseq
## GRanges object with 2658 ranges and 10 metadata columns:
##        seqnames               ranges strand   |   theRegion
##                            | 
##      1    chr21   [9690071, 9690100]      +   |        exon
##      2    chr21   [9692178, 9692207]      +   |        exon
##      3    chr21   [9711935, 9712038]      +   |        exon
##      5    chr21   [9746121, 9746151]      +   |        exon
##      7    chr21   [9748830, 9748899]      +   |        exon
##    ...      ...                  ...    ... ...         ...
##   4985    chr21 [47740937, 47744424]      -   |        exon
##   4986    chr21 [48018531, 48019416]      -   |        exon
##   4988    chr21 [48020195, 48020288]      -   |        exon
##   4990    chr21 [48022191, 48022329]      -   |        exon
##   4992    chr21 [48024926, 48025035]      -   |        exon
##                        tx_id                              tx_name
##                                      
##      1                 72512                           uc002zkg.1
##      2                 72513                           uc021wgt.1
##      3                 72514                           uc011abu.2
##      5                 72514                           uc011abu.2
##      7                 72514                           uc011abu.2
##    ...                   ...                                  ...
##   4985 73460,73461,73462,... uc002zjf.3,uc002zjg.1,uc010gqj.2,...
##   4986           73464,73465                uc002zju.1,uc002zjv.1
##   4988                 73465                           uc002zjv.1
##   4990           73464,73465                uc002zju.1,uc002zjv.1
##   4992           73464,73465                uc002zju.1,uc002zjv.1
##                 gene   baseMean log2FoldChange     lfcSE      stat
##                  
##      1                        0                       
##      2                        0                       
##      3                        0                       
##      5                        0                       
##      7                        0                       
##    ...           ...        ...            ...       ...       ...
##   4985           194   3.443649       1.358225 0.7013654 3.9489755
##   4986           227 166.809942      -2.645397 1.0123859 6.0519380
##   4988           227   0.267442      -1.258079 2.1787538 0.3414066
##   4990           227  41.821053      -2.597086 1.0032177 5.9903045
##   4992           227   5.025438      -3.021444 1.0248336 8.3198395
##             pvalue       padj
##            
##      1               
##      2               
##      3               
##      5               
##      7               
##    ...         ...        ...
##   4985 0.046899914 0.28338405
##   4986 0.013891042 0.15765652
##   4988 0.559018411       
##   4990 0.014384719 0.15765652
##   4992 0.003921439 0.08339072
##   -------
##   seqinfo: 1 sequence from hg19 genome
## How many have significant p-values?
sum(deseq$pvalue < 0.05, na.rm = TRUE)
## [1] 87
## How many have significant FDR adjusted p-values?
sum(deseq$padj < 0.05, na.rm = TRUE)
## [1] 7

From here, you can proceed to use other Bioconductor packages for different downstream analyses.

References

Citations made with knitcitations (Boettiger, 2015).

[1] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.6. 2015. URL: http://CRAN.R-project.org/package=knitcitations.

[2] BrainSpan. “Atlas of the Developing Human Brain [Internet]. Funded by ARRA Awards 1RC2MH089921-01, 1RC2MH090047-01, and 1RC2MH089929-01.” 2011. URL: http://developinghumanbrain.org.

[3] L. Collado-Torres, A. C. Frazee, M. I. Love, R. A. Irizarry, et al. “derfinder: Software for annotation-agnostic RNA-seq differential expression analysis”. In: bioRxiv (2015). DOI: 10.1101/015370. URL: http://www.biorxiv.org/content/early/2015/02/19/015370.abstract.

[4] L. Collado-Torres, A. Jaffe and J. Leek. derfinderData: Processed BigWigs from BrainSpan for examples. R package version 0.103.1. 2015. URL: https://github.com/lcolladotor/derfinderData.

[5] A. C. Frazee, S. Sabunciyan, K. D. Hansen, R. A. Irizarry, et al. “Differential expression analysis of RNA-seq data at single-base resolution”. In: Biostatistics 15 (3) (2014), pp. 413-426. DOI: 10.1093/biostatistics/kxt053. URL: http://biostatistics.oxfordjournals.org/content/15/3/413.long.

[6] M. Lawrence, R. Gentleman and V. Carey. “rtracklayer: an R package for interfacing with genome browsers”. In: Bioinformatics 25 (2009), pp. 1841-1842. DOI: 10.1093/bioinformatics/btp328. URL: http://bioinformatics.oxfordjournals.org/content/25/14/1841.abstract.

[7] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies”. In: Nucleic Acids Research 43.7 (2015), p. e47.

[8] X. Zhou, H. Lindsay and M. D. Robinson. “Robustly detecting differential expression in RNA sequencing data using observation weights”. In: Nucleic Acids Research 42 (2014), p. e91.

R session information

## Session info -----------------------------------------------------------------------------------------------------------
##  setting  value                                      
##  version  R version 3.2.0 Patched (2015-05-18 r68382)
##  system   x86_64, darwin10.8.0                       
##  ui       X11                                        
##  language (EN)                                       
##  collate  en_US.UTF-8                                
##  tz       Europe/Vienna
## Packages ---------------------------------------------------------------------------------------------------------------
##  package              * version     date       source                                   
##  acepack                1.3-3.3     2014-11-24 CRAN (R 3.2.0)                           
##  annotate               1.47.0      2015-04-17 Bioconductor                             
##  AnnotationDbi          1.31.16     2015-06-10 Bioconductor                             
##  bibtex                 0.4.0       2014-12-31 CRAN (R 3.2.0)                           
##  Biobase              * 2.29.1      2015-05-03 Bioconductor                             
##  BiocGenerics         * 0.15.2      2015-06-05 Bioconductor                             
##  BiocParallel           1.3.25      2015-06-12 Bioconductor                             
##  biomaRt                2.25.1      2015-04-21 Bioconductor                             
##  Biostrings             2.37.2      2015-05-07 Bioconductor                             
##  bitops                 1.0-6       2013-08-17 CRAN (R 3.2.0)                           
##  bumphunter             1.9.1       2015-05-05 Bioconductor                             
##  cluster                2.0.1       2015-01-31 CRAN (R 3.2.0)                           
##  codetools              0.2-11      2015-03-10 CRAN (R 3.2.0)                           
##  colorspace             1.2-6       2015-03-11 CRAN (R 3.2.0)                           
##  curl                   0.9         2015-06-19 CRAN (R 3.2.1)                           
##  DBI                    0.3.1       2014-09-24 CRAN (R 3.2.0)                           
##  derfinder            * 1.3.3       2015-06-15 Bioconductor                             
##  derfinderData        * 0.103.1     2015-06-16 Bioconductor                             
##  derfinderHelper        1.3.0       2015-04-17 Bioconductor                             
##  DESeq2               * 1.9.10      2015-05-28 Bioconductor                             
##  devtools             * 1.8.0       2015-05-09 CRAN (R 3.2.0)                           
##  digest                 0.6.8       2014-12-31 CRAN (R 3.2.0)                           
##  doRNG                  1.6         2014-03-07 CRAN (R 3.2.0)                           
##  evaluate               0.7         2015-04-21 CRAN (R 3.2.0)                           
##  foreach                1.4.2       2014-04-11 CRAN (R 3.2.0)                           
##  foreign                0.8-63      2015-02-20 CRAN (R 3.2.0)                           
##  formatR                1.2         2015-04-21 CRAN (R 3.2.0)                           
##  Formula                1.2-1       2015-04-07 CRAN (R 3.2.0)                           
##  futile.logger          1.4.1       2015-04-20 CRAN (R 3.2.0)                           
##  futile.options         1.0.0       2010-04-06 CRAN (R 3.2.0)                           
##  genefilter             1.51.0      2015-04-17 Bioconductor                             
##  geneplotter            1.47.0      2015-04-17 Bioconductor                             
##  GenomeInfoDb         * 1.5.7       2015-06-02 Bioconductor                             
##  GenomicAlignments      1.5.9       2015-05-13 Bioconductor                             
##  GenomicFeatures        1.21.13     2015-06-09 Bioconductor                             
##  GenomicFiles           1.5.3       2015-05-13 Bioconductor                             
##  GenomicRanges        * 1.21.15     2015-06-09 Bioconductor                             
##  ggplot2                1.0.1.9000  2015-06-12 Github (hadley/ggplot2@3b6a126)          
##  git2r                  0.10.1      2015-05-07 CRAN (R 3.2.0)                           
##  gridExtra              0.9.1       2012-08-09 CRAN (R 3.2.0)                           
##  gtable                 0.1.2       2012-12-05 CRAN (R 3.2.0)                           
##  Hmisc                  3.16-0      2015-04-30 CRAN (R 3.2.0)                           
##  htmltools              0.2.6       2014-09-08 CRAN (R 3.2.0)                           
##  httr                   1.0.0       2015-06-25 CRAN (R 3.2.1)                           
##  IRanges              * 2.3.12      2015-06-16 Bioconductor                             
##  iterators              1.0.7       2014-04-11 CRAN (R 3.2.0)                           
##  knitcitations        * 1.0.6       2015-05-26 CRAN (R 3.2.0)                           
##  knitr                  1.10.5      2015-05-06 CRAN (R 3.2.0)                           
##  knitrBootstrap         1.0.0       2015-06-09 Github (jimhester/knitrBootstrap@76c41f0)
##  lambda.r               1.1.7       2015-03-20 CRAN (R 3.2.0)                           
##  lattice                0.20-31     2015-03-30 CRAN (R 3.2.0)                           
##  latticeExtra           0.6-26      2013-08-15 CRAN (R 3.2.0)                           
##  locfit                 1.5-9.1     2013-04-20 CRAN (R 3.2.0)                           
##  lubridate              1.3.3       2013-12-31 CRAN (R 3.2.0)                           
##  magrittr               1.5         2014-11-22 CRAN (R 3.2.0)                           
##  markdown               0.7.7       2015-04-22 CRAN (R 3.2.0)                           
##  MASS                   7.3-40      2015-03-21 CRAN (R 3.2.0)                           
##  Matrix                 1.2-1       2015-06-01 CRAN (R 3.2.0)                           
##  matrixStats            0.14.0      2015-02-14 CRAN (R 3.2.0)                           
##  memoise                0.2.1       2014-04-22 CRAN (R 3.2.0)                           
##  munsell                0.4.2       2013-07-11 CRAN (R 3.2.0)                           
##  nnet                   7.3-9       2015-02-11 CRAN (R 3.2.0)                           
##  pkgmaker               0.22        2014-05-14 CRAN (R 3.2.0)                           
##  plyr                   1.8.3       2015-06-12 CRAN (R 3.2.0)                           
##  proto                  0.3-10      2012-12-22 CRAN (R 3.2.0)                           
##  qvalue                 2.1.0       2015-04-17 Bioconductor                             
##  R6                     2.0.1       2014-10-29 CRAN (R 3.2.0)                           
##  RColorBrewer           1.1-2       2014-12-07 CRAN (R 3.2.0)                           
##  Rcpp                 * 0.11.6      2015-05-01 CRAN (R 3.2.0)                           
##  RcppArmadillo        * 0.5.200.1.0 2015-06-05 CRAN (R 3.2.0)                           
##  RCurl                  1.95-4.6    2015-04-24 CRAN (R 3.2.0)                           
##  RefManageR             0.8.63      2015-06-09 CRAN (R 3.2.0)                           
##  registry               0.2         2012-01-24 CRAN (R 3.2.0)                           
##  reshape2               1.4.1       2014-12-06 CRAN (R 3.2.0)                           
##  RJSONIO                1.3-0       2014-07-28 CRAN (R 3.2.0)                           
##  rmarkdown            * 0.7         2015-06-13 CRAN (R 3.2.0)                           
##  rngtools               1.2.4       2014-03-06 CRAN (R 3.2.0)                           
##  rpart                  4.1-9       2015-02-24 CRAN (R 3.2.0)                           
##  Rsamtools              1.21.8      2015-06-03 Bioconductor                             
##  RSQLite                1.0.0       2014-10-25 CRAN (R 3.2.0)                           
##  rtracklayer          * 1.29.10     2015-06-13 Bioconductor                             
##  rversions              1.0.1       2015-06-06 CRAN (R 3.2.0)                           
##  S4Vectors            * 0.7.6       2015-06-16 Bioconductor                             
##  scales                 0.2.5       2015-06-12 CRAN (R 3.2.0)                           
##  stringi                0.5-5       2015-06-29 CRAN (R 3.2.1)                           
##  stringr                1.0.0       2015-04-30 CRAN (R 3.2.0)                           
##  SummarizedExperiment * 0.1.5       2015-05-17 Bioconductor                             
##  survival               2.38-2      2015-06-12 CRAN (R 3.2.0)                           
##  XML                    3.98-1.2    2015-05-31 CRAN (R 3.2.0)                           
##  xml2                   0.1.1       2015-06-02 CRAN (R 3.2.0)                           
##  xtable                 1.7-4       2014-09-12 CRAN (R 3.2.0)                           
##  XVector                0.9.1       2015-04-30 Bioconductor                             
##  yaml                   2.1.13      2014-06-12 CRAN (R 3.2.0)                           
##  zlibbioc               1.15.0      2015-04-17 Bioconductor
Date this protocol was last modified: 2015-07-06 02:21:32.