R/regionMatrix.R
regionMatrix.Rd
Given a set of un-filtered coverage data (see fullCoverage), create candidate regions by applying a cutoff on the coverage values, and obtain a count matrix where the number of rows corresponds to the number of candidate regions and the number of columns corresponds to the number of samples. The values are the mean coverage for a given sample for a given region.
regionMatrix(
fullCov,
cutoff = 5,
L,
totalMapped = 8e+07,
targetSize = 8e+07,
runFilter = TRUE,
returnBP = TRUE,
...
)
A list where each element is the result from
loadCoverage used with returnCoverage = TRUE
. Can be generated
using fullCoverage. If runFilter = FALSE
, then
returnMean = TRUE
must have been used.
The base-pair level cutoff to use. It's behavior is controlled
by filter
.
The width of the reads used. Either a vector of length 1 or length equal to the number of samples.
A vector with the total number of reads mapped for each
sample. The vector should be in the same order as the samples in
fullCov
. Providing this argument adjusts the coverage to reads in
targetSize
library prior to filtering. See getTotalMapped for
calculating this vector.
The target library size to adjust the coverage to. Used
only when totalMapped
is specified. By default, it adjusts to
libraries with 80 million reads.
This controls whether to run filterData or not. If
set to FALSE
then returnMean = TRUE
must have been used to
create each element of fullCov
and the data must have been
normalized (totalMapped
equal to targetSize
).
If TRUE
, returns $bpCoverage
explained below.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way.
Default: UCSC
. Passed to
extendedMapSeqlevels via getRegionCoverage.
Default: homo_sapiens
. Passed to
extendedMapSeqlevels via getRegionCoverage.
Default: NULL
. Passed to
extendedMapSeqlevels via getRegionCoverage.
Passed to filterData, findRegions and define_cluster.
Note that filterData is used internally
by loadCoverage (and hence fullCoverage
) and has the important
arguments totalMapped
and targetSize
which can be used to
normalize the coverage by library size. If you already used these arguments
when creating the fullCov
object, then don't specify them a second
time in regionMatrix. If you have not used these arguments, we
recommend using them to normalize the mean coverage.
A list with one entry per chromosome. Then per chromosome, a list with three components.
A set of regions based on the coverage filter cutoff as returned by findRegions.
A list with one element per region. Each element is a matrix with numbers of rows equal to the number of base pairs in the region and number of columns equal to the number of samples. It contains the base-level coverage information for the regions. Only returned when returnBP = TRUE
.
A matrix with the mean coverage by sample for each candidate region.
This function uses several other derfinder-package functions. Inspect the code if interested.
You should use at most one core per chromosome.
## Create some toy data
library("IRanges")
x <- Rle(round(runif(1e4, max = 10)))
y <- Rle(round(runif(1e4, max = 10)))
z <- Rle(round(runif(1e4, max = 10)))
fullCov <- list("chr21" = DataFrame(x, y, z))
## Calculate a proxy of library size
libSize <- sapply(fullCov$chr21, sum)
## Run region matrix normalizing the coverage
regionMat <- regionMatrix(
fullCov = fullCov, maxRegionGap = 10L,
maxClusterGap = 300L, L = 36, totalMapped = libSize, targetSize = 4e4
)
#> 2024-12-13 15:13:25.576159 regionMatrix: processing chr21
#> 2024-12-13 15:13:25.576608 filterData: normalizing coverage
#> 2024-12-13 15:13:25.583704 filterData: done normalizing coverage
#> 2024-12-13 15:13:25.596135 filterData: originally there were 10000 rows, now there are 2586 rows. Meaning that 74.14 percent was filtered.
#> 2024-12-13 15:13:25.59662 findRegions: identifying potential segments
#> 2024-12-13 15:13:25.598913 findRegions: segmenting information
#> 2024-12-13 15:13:25.601395 findRegions: identifying candidate regions
#> 2024-12-13 15:13:25.626213 findRegions: identifying region clusters
#> 2024-12-13 15:13:25.71001 getRegionCoverage: processing chr21
#> 2024-12-13 15:13:25.721348 getRegionCoverage: done processing chr21
#> 2024-12-13 15:13:25.723182 regionMatrix: calculating coverageMatrix
#> 2024-12-13 15:13:25.727549 regionMatrix: adjusting coverageMatrix for 'L'
if (FALSE) { # \dontrun{
## You can alternatively use filterData() on fullCov to reduce the required
## memory before using regionMatrix(). This can be useful when mc.cores > 1
filteredCov <- lapply(fullCov, filterData,
returnMean = TRUE, filter = "mean",
cutoff = 5, totalMapped = libSize, targetSize = 4e4
)
regionMat2 <- regionMatrix(filteredCov,
maxRegionGap = 10L,
maxClusterGap = 300L, L = 36, runFilter = FALSE
)
} # }
## regionMatrix() can work with multiple chrs as shown below.
fullCov2 <- list("chr21" = DataFrame(x, y, z), "chr22" = DataFrame(x, y, z))
regionMat2 <- regionMatrix(
fullCov = fullCov2, maxRegionGap = 10L,
maxClusterGap = 300L, L = 36, totalMapped = libSize, targetSize = 4e4
)
#> 2024-12-13 15:13:25.739236 regionMatrix: processing chr21
#> 2024-12-13 15:13:25.739638 filterData: normalizing coverage
#> 2024-12-13 15:13:25.745563 filterData: done normalizing coverage
#> 2024-12-13 15:13:25.757288 filterData: originally there were 10000 rows, now there are 2586 rows. Meaning that 74.14 percent was filtered.
#> 2024-12-13 15:13:25.75773 findRegions: identifying potential segments
#> 2024-12-13 15:13:25.76004 findRegions: segmenting information
#> 2024-12-13 15:13:25.762624 findRegions: identifying candidate regions
#> 2024-12-13 15:13:25.78816 findRegions: identifying region clusters
#> 2024-12-13 15:13:25.870698 getRegionCoverage: processing chr21
#> 2024-12-13 15:13:25.881839 getRegionCoverage: done processing chr21
#> 2024-12-13 15:13:25.883481 regionMatrix: calculating coverageMatrix
#> 2024-12-13 15:13:25.887555 regionMatrix: adjusting coverageMatrix for 'L'
#> 2024-12-13 15:13:25.888007 regionMatrix: processing chr22
#> 2024-12-13 15:13:25.888372 filterData: normalizing coverage
#> 2024-12-13 15:13:25.894159 filterData: done normalizing coverage
#> 2024-12-13 15:13:25.906066 filterData: originally there were 10000 rows, now there are 2586 rows. Meaning that 74.14 percent was filtered.
#> 2024-12-13 15:13:25.906513 findRegions: identifying potential segments
#> 2024-12-13 15:13:25.908762 findRegions: segmenting information
#> 2024-12-13 15:13:25.911284 findRegions: identifying candidate regions
#> 2024-12-13 15:13:25.936164 findRegions: identifying region clusters
#> 2024-12-13 15:13:26.018848 getRegionCoverage: processing chr22
#> 2024-12-13 15:13:26.050456 getRegionCoverage: done processing chr22
#> 2024-12-13 15:13:26.052435 regionMatrix: calculating coverageMatrix
#> 2024-12-13 15:13:26.056613 regionMatrix: adjusting coverageMatrix for 'L'
## Combine results from multiple chromosomes
library("GenomicRanges")
## First extract the data
regs <- unlist(GRangesList(lapply(regionMat2, "[[", "regions")))
covMat <- do.call(rbind, lapply(regionMat2, "[[", "coverageMatrix"))
covBp <- do.call(c, lapply(regionMat2, "[[", "bpCoverage"))
## Force the names to match
names(regs) <- rownames(covMat) <- names(covBp) <- seq_len(length(regs))
## Combine into a list (not really needed)
mergedRegionMat <- list(
"regions" = regs, "coverageMatrix" = covMat,
"bpCoverage" = covBp
)