Find genomic regions for which a numeric vector is above (or below) predefined thresholds. In other words, this function finds the candidate Differentially Expressed Regions (candidate DERs). This is similar to regionFinder and is a helper function for calculatePvalues.
findRegions(
position = NULL,
fstats,
chr,
oneTable = TRUE,
maxClusterGap = 300L,
cutoff = quantile(fstats, 0.99, na.rm = TRUE),
segmentIR = NULL,
smooth = FALSE,
weights = NULL,
smoothFunction = bumphunter::locfitByCluster,
...
)
A logical Rle of genomic positions. This is generated in
loadCoverage. Note that it gets updated in preprocessCoverage
if colsubset
is not NULL
.
A numeric Rle with the F-statistics. Usually obtained using calculateStats.
A single element character vector specifying the chromosome name.
If TRUE
only one GRanges is returned.
Otherwise, a GRangesList with two components is returned: one for the
regions with positive values and one for the negative values.
This determines the maximum gap between candidate DERs.
It should be greater than maxRegionGap
(0 by default).
Threshold applied to the fstats
used to determine the
regions.
An IRanges object with the genomic positions that are potentials DERs. This is used in calculatePvalues to speed up permutation calculations.
Whether to smooth the F-statistics (fstats
) or not. This
is by default FALSE
. For RNA-seq data we recommend using FALSE
.
Weights used by the smoother as described in smoother.
A function to be used for smoothing the F-statistics.
Two functions are provided by the bumphunter
package:
loessByCluster and runmedByCluster. If
you are using your own custom function, it has to return a named list with
an element called $fitted
that contains the smoothed F-statistics and
an element claled $smoothed
that is a logical vector indicating
whether the F-statistics were smoothed or not. If they are not smoothed, the
original values will be used.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way.
If TRUE
a DataFrame is returned that has only basic
information on the candidate DERs. This is used in calculatePvalues
to speed up permutation calculations. Default: FALSE
.
This determines the maximum number of gaps between two genomic positions to be considered part of the same candidate region. The default is 0L.
Passed to extendedMapSeqlevels and the internal function
.getSegmentsRle
that has by default verbose = FALSE
.
When smooth = TRUE
, ...
is passed to the internal function
.smootherFstats
. This internal function has the advanced argument
maxClusterGap
(same as above) and passes ...
to
define_cluster and the formal arguments of smoothFun
.
Either a GRanges or a GRangesList as determined by oneTable
.
Each of them has the following metadata variables.
The mean of the values of y
for the given region.
The absolute value of the sum of the values of y
for
the given region.
The start position of the region in terms of the index
for y
.
The end position of the region in terms of the index for
y
.
The cluser ID.
The total length of the cluster.
regionFinder adapted to Rle world.
Rafael A. Irizarry, Martin Aryee, Hector Corrada Bravo, Kasper D. Hansen and Harris A. Jaffee. bumphunter: Bump Hunter. R package version 1.1.10.
## Preprocess the data
prep <- preprocessCoverage(genomeData,
cutoff = 0, scalefac = 32, chunksize = 1e3,
colsubset = NULL
)
## Get the F statistics
fstats <- genomeFstats
## Find the regions
regs <- findRegions(prep$position, fstats, "chr21", verbose = TRUE)
#> 2023-05-07 06:01:18.768514 findRegions: identifying potential segments
#> 2023-05-07 06:01:18.771864 findRegions: segmenting information
#> 2023-05-07 06:01:18.772179 .getSegmentsRle: segmenting with cutoff(s) 19.7936614060235
#> 2023-05-07 06:01:18.779094 findRegions: identifying candidate regions
#> 2023-05-07 06:01:18.815964 findRegions: identifying region clusters
regs
#> GRanges object with 3 ranges and 6 metadata columns:
#> seqnames ranges strand | value area indexStart
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer>
#> up chr21 47409011-47409016 * | 20.2182 121.3091 82
#> up chr21 47409051 * | 20.3485 20.3485 122
#> up chr21 47409522-47409533 * | 21.6702 260.0421 128
#> indexEnd cluster clusterL
#> <integer> <Rle> <Rle>
#> up 87 1 41
#> up 122 1 41
#> up 139 2 12
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
if (FALSE) {
## Once you have the regions you can proceed to annotate them
library("bumphunter")
genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene)
annotation <- matchGenes(regs, genes)
annotation
}
# Find regions with smoothing the F-statistics by bumphunter::runmedByCluster
regs_smooth <- findRegions(prep$position, fstats, "chr21",
verbose = TRUE,
smoothFunction = bumphunter::runmedByCluster
)
#> 2023-05-07 06:01:18.872142 findRegions: identifying potential segments
#> 2023-05-07 06:01:18.875353 findRegions: segmenting information
#> 2023-05-07 06:01:18.875669 .getSegmentsRle: segmenting with cutoff(s) 19.7936614060235
#> 2023-05-07 06:01:18.882329 findRegions: identifying candidate regions
#> 2023-05-07 06:01:18.919534 findRegions: identifying region clusters
## Compare against the object regs obtained earlier
regs_smooth
#> GRanges object with 3 ranges and 6 metadata columns:
#> seqnames ranges strand | value area indexStart
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer>
#> up chr21 47409011-47409016 * | 20.2182 121.3091 82
#> up chr21 47409051 * | 20.3485 20.3485 122
#> up chr21 47409522-47409533 * | 21.6702 260.0421 128
#> indexEnd cluster clusterL
#> <integer> <Rle> <Rle>
#> up 87 1 41
#> up 122 1 41
#> up 139 2 12
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths