First, this function finds the regions of interest according to specified cutoffs. Then it permutes the samples and re-calculates the F-statistics. The area of the statistics from these segments are then used to calculate p-values for the original regions.
calculatePvalues(
coveragePrep,
models,
fstats,
nPermute = 1L,
seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute),
chr,
cutoff = quantile(fstats, 0.99, na.rm = TRUE),
significantCut = c(0.05, 0.1),
lowMemDir = NULL,
smooth = FALSE,
weights = NULL,
smoothFunction = bumphunter::locfitByCluster,
...
)
A list with $coverageProcessed
,
$mclapplyIndex
, and $position
normally generated using
preprocessCoverage.
A list with $mod
and $mod0
normally generated
using makeModels.
A numerical Rle with the F-statistics normally generated using calculateStats.
The number of permutations. Note that for a full chromosome,
a small amount (10) of permutations is sufficient. If set to 0, no
permutations are performed and thus no null regions are used, however, the
$regions
component is created.
An integer vector of length nPermute
specifying the
seeds to be used for each permutation. If NULL
no seeds are used.
A single element character vector specifying the chromosome name. This argument is passed to findRegions.
F-statistic cutoff to use to determine segments.
A vector of length two specifiying the cutoffs used to determine significance. The first element is used to determine significance for the P-values, while the second element is used for the Q-values (FDR adjusted P-values).
The directory where the processed chunks are saved when
using preprocessCoverage with a specified lowMemDir
.
Whether to smooth the F-statistics (fstats
) or not. This
is by default FALSE
. For RNA-seq data we recommend using FALSE
.
Weights used by the smoother as described in smoother.
A function to be used for smoothing the F-statistics.
Two functions are provided by the bumphunter
package:
loessByCluster and runmedByCluster. If
you are using your own custom function, it has to return a named list with
an element called $fitted
that contains the smoothed F-statistics and
an element claled $smoothed
that is a logical vector indicating
whether the F-statistics were smoothed or not. If they are not smoothed, the
original values will be used.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way.
This argument is passed to fstats.apply and should be the same as the one used in preprocessCoverage. Default: 32.
Has to be either 'Matrix' (default), 'Rle' or 'regular'. See details in fstats.apply.
A single value to adjust that is added in the denominator of the F-stat calculation. Useful when the Residual Sum of Squares of the alternative model is very small. Default: 0.
If TRUE
then the regions are saved before
calculating q-values, and then overwritten once the q-values are written.
This argument was introduced to save the results from the permutations (can
take some time) to investigate the problem described at
https://support.bioconductor.org/p/62026/
Passed to internal functions of findRegions. Default: 0.
Passed to findRegions, smoothFunction
and
define_cluster.
A list with four components:
is a GRanges with metadata columns given by
findRegions with the additional metadata column pvalues
:
p-value of the region calculated via permutations of the samples;
qvalues
: the qvalues calculated using qvalue;
significant
: whether the p-value is less than 0.05 (by default);
significantQval
: whether the q-value is less than 0.10 (by default).
It also includes the mean coverage of the region (mean from the mean
coverage at each base calculated in preprocessCoverage). Furthermore,
if groupInfo
was not NULL
in preprocessCoverage, then
the group mean coverage is calculated as well as the log 2 fold change
(using group 1 as the reference).
is a numeric Rle with the mean of the null statistics by segment.
is a numeric Rle with the length of each of the segments
in the null distribution. The area can be obtained by multiplying the
absolute nullstats
by the corresponding lengths.
is a Rle with the permutation number from which the null region originated from.
## Collapse the coverage information
collapsedFull <- collapseFullCoverage(list(genomeData$coverage),
verbose = TRUE
)
#> 2023-05-07 06:01:07.083085 collapseFullCoverage: Sorting fullCov
#> 2023-05-07 06:01:07.08555 collapseFullCoverage: Collapsing chromosomes information by sample
## Calculate library size adjustments
sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), verbose = TRUE)
#> 2023-05-07 06:01:07.087735 sampleDepth: Calculating sample quantiles
#> 2023-05-07 06:01:07.093398 sampleDepth: Calculating sample adjustments
## Build the models
group <- genomeInfo$pop
adjustvars <- data.frame(genomeInfo$gender)
models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars)
## Preprocess the data
## Automatic chunksize used to then compare 1 vs 4 cores in the 'do not run'
## section
prep <- preprocessCoverage(genomeData,
groupInfo = group, cutoff = 0,
scalefac = 32, chunksize = NULL, colsubset = NULL, mc.cores = 4
)
## Get the F statistics
fstats <- genomeFstats
## We recommend determining the cutoff to use based on the F-distribution
## although you could also based it on the observed F-statistics.
## In this example we use a low cutoff used for illustrative purposes
cutoff <- 1
## Calculate the p-values and define the regions of interest.
regsWithP <- calculatePvalues(prep, models, fstats,
nPermute = 1, seeds = 1,
chr = "chr21", cutoff = cutoff, mc.cores = 1, method = "regular"
)
#> 2023-05-07 06:01:07.420015 calculatePvalues: identifying data segments
#> 2023-05-07 06:01:07.423844 findRegions: segmenting information
#> 2023-05-07 06:01:07.427514 findRegions: identifying candidate regions
#> 2023-05-07 06:01:07.472595 findRegions: identifying region clusters
#> 2023-05-07 06:01:07.572004 calculatePvalues: calculating F-statistics for permutation 1 and seed 1
#> 2023-05-07 06:01:07.844803 findRegions: segmenting information
#> 2023-05-07 06:01:07.848173 findRegions: identifying candidate regions
#> 2023-05-07 06:01:07.897063 calculatePvalues: calculating the p-values
#> 2023-05-07 06:01:07.919936 calculatePvalues: skipping q-value calculation.
regsWithP
#> $regions
#> GRanges object with 33 ranges and 14 metadata columns:
#> seqnames ranges strand | value area indexStart
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <integer>
#> up chr21 47409522-47409560 * | 15.05364 587.092 128
#> up chr21 47411924-47411986 * | 7.09670 447.092 618
#> up chr21 47417352-47417397 * | 8.77845 403.808 1292
#> up chr21 47412662-47412724 * | 5.67122 357.287 843
#> up chr21 47408998-47409026 * | 10.68525 309.872 69
#> .. ... ... ... . ... ... ...
#> up chr21 47417666-47417672 * | 1.21552 8.50863 1390
#> up chr21 47409685-47409688 * | 1.21585 4.86342 192
#> up chr21 47417335-47417337 * | 1.59162 4.77487 1275
#> up chr21 47418068 * | 3.99766 3.99766 1434
#> up chr21 47410311-47410312 * | 1.14363 2.28725 246
#> indexEnd cluster clusterL meanCoverage meanCEU meanYRI
#> <integer> <Rle> <Rle> <numeric> <numeric> <numeric>
#> up 166 3 167 0.415219 0.612943 0.0000000
#> up 680 6 389 0.584741 0.837491 0.0539683
#> up 1337 10 338 0.364656 0.533126 0.0108696
#> up 905 7 63 0.501280 0.739985 0.0000000
#> up 97 2 232 0.195773 0.288998 0.0000000
#> .. ... ... ... ... ... ...
#> up 1396 10 338 0.2258065 0.333333 0.0
#> up 195 3 167 0.4758065 0.654762 0.1
#> up 1277 10 338 0.1290323 0.190476 0.0
#> up 1434 11 1 0.0322581 0.047619 0.0
#> up 247 4 159 0.7258065 1.071429 0.0
#> log2FoldChangeYRIvsCEU pvalues significant qvalues significantQval
#> <numeric> <numeric> <factor> <logical> <logical>
#> up -Inf 0.0175439 TRUE <NA> <NA>
#> up -3.95589 0.0175439 TRUE <NA> <NA>
#> up -5.61611 0.0526316 FALSE <NA> <NA>
#> up -Inf 0.1578947 FALSE <NA> <NA>
#> up -Inf 0.1929825 FALSE <NA> <NA>
#> .. ... ... ... ... ...
#> up -Inf 0.859649 FALSE <NA> <NA>
#> up -2.71097 0.894737 FALSE <NA> <NA>
#> up -Inf 0.894737 FALSE <NA> <NA>
#> up -Inf 0.894737 FALSE <NA> <NA>
#> up -Inf 0.929825 FALSE <NA> <NA>
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $nullStats
#> numeric-Rle of length 56 with 54 runs
#> Lengths: 1 1 1 1 ... 1 1 1 1
#> Values : 2.25473 3.63681 4.12769 9.73262 ... 1.16174 8.10124 5.20438 1.16174
#>
#> $nullWidths
#> integer-Rle of length 56 with 48 runs
#> Lengths: 1 1 1 1 2 2 1 1 2 1 1 ... 1 1 1 1 2 1 1 1 1 1
#> Values : 9 20 54 45 27 54 38 21 36 3 8 ... 63 45 36 38 36 3 1 47 39 1
#>
#> $nullPermutation
#> integer-Rle of length 56 with 1 run
#> Lengths: 56
#> Values : 1
#>
if (FALSE) {
## Calculate again, but with 10 permutations instead of just 1
regsWithP <- calculatePvalues(prep, models, fstats,
nPermute = 10, seeds = 1:10,
chr = "chr21", cutoff = cutoff, mc.cores = 2, method = "regular"
)
## Check that they are the same as the previously calculated regions
library(testthat)
expect_that(regsWithP, equals(genomeRegions))
## Histogram of the theoretical p-values by region
hist(pf(regsWithP$regions$value, df1 - df0, n - df1), main = "Distribution
original p-values by region", freq = FALSE)
## Histogram of the permutted p-values by region
hist(regsWithP$regions$pvalues, main = "Distribution permutted p-values by
region", freq = FALSE)
## MA style plot
library("ggplot2")
ma <- data.frame(
mean = regsWithP$regions$meanCoverage,
log2FoldChange = regsWithP$regions$log2FoldChangeYRIvsCEU
)
ggplot(ma, aes(x = log2(mean), y = log2FoldChange)) +
geom_point() +
ylab("Fold Change (log2)") +
xlab("Mean coverage (log2)") +
labs(title = "MA style plot")
## Annotate the results
library("bumphunter")
genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene)
annotation <- matchGenes(regsWithP$regions, genes)
head(annotation)
}