Calculate p-values and identify regions

First, this function finds the regions of interest according to specified cutoffs. Then it permutes the samples and re-calculates the F-statistics. The area of the statistics from these segments are then used to calculate p-values for the original regions.

calculatePvalues(
  coveragePrep,
  models,
  fstats,
  nPermute = 1L,
  seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute),
  chr,
  cutoff = quantile(fstats, 0.99, na.rm = TRUE),
  significantCut = c(0.05, 0.1),
  lowMemDir = NULL,
  smooth = FALSE,
  weights = NULL,
  smoothFunction = bumphunter::locfitByCluster,
  ...
)

Arguments

coveragePrep

A list with $coverageProcessed, $mclapplyIndex, and $position normally generated using preprocessCoverage.

models

A list with $mod and $mod0 normally generated using makeModels.

fstats

A numerical Rle with the F-statistics normally generated using calculateStats.

nPermute

The number of permutations. Note that for a full chromosome, a small amount (10) of permutations is sufficient. If set to 0, no permutations are performed and thus no null regions are used, however, the $regions component is created.

seeds

An integer vector of length nPermute specifying the seeds to be used for each permutation. If NULL no seeds are used.

chr

A single element character vector specifying the chromosome name. This argument is passed to findRegions.

cutoff

F-statistic cutoff to use to determine segments.

significantCut

A vector of length two specifiying the cutoffs used to determine significance. The first element is used to determine significance for the P-values, while the second element is used for the Q-values (FDR adjusted P-values).

lowMemDir

The directory where the processed chunks are saved when using preprocessCoverage with a specified lowMemDir.

smooth

Whether to smooth the F-statistics (fstats) or not. This is by default FALSE. For RNA-seq data we recommend using FALSE.

weights

Weights used by the smoother as described in smoother.

smoothFunction

A function to be used for smoothing the F-statistics. Two functions are provided by the bumphunter package: loessByCluster and runmedByCluster. If you are using your own custom function, it has to return a named list with an element called $fitted that contains the smoothed F-statistics and an element claled $smoothed that is a logical vector indicating whether the F-statistics were smoothed or not. If they are not smoothed, the original values will be used.

...

Arguments passed to other methods and/or advanced arguments. Advanced arguments:

verbose: If TRUE basic status updates will be printed along the way.
scalefac: This argument is passed to fstats.apply and should be the same as the one used in preprocessCoverage. Default: 32.
method: Has to be either 'Matrix' (default), 'Rle' or 'regular'. See details in fstats.apply.
adjustF: A single value to adjust that is added in the denominator of the F-stat calculation. Useful when the Residual Sum of Squares of the alternative model is very small. Default: 0.
writeOutput: If TRUE then the regions are saved before calculating q-values, and then overwritten once the q-values are written. This argument was introduced to save the results from the permutations (can take some time) to investigate the problem described at https://support.bioconductor.org/p/62026/
maxRegionGap: Passed to internal functions of findRegions. Default: 0.

Passed to findRegions, smoothFunction and define_cluster.

Value

A list with four components:

regions: is a GRanges with metadata columns given by findRegions with the additional metadata column pvalues: p-value of the region calculated via permutations of the samples; qvalues: the qvalues calculated using qvalue; significant: whether the p-value is less than 0.05 (by default); significantQval: whether the q-value is less than 0.10 (by default). It also includes the mean coverage of the region (mean from the mean coverage at each base calculated in preprocessCoverage). Furthermore, if groupInfo was not NULL in preprocessCoverage, then the group mean coverage is calculated as well as the log 2 fold change (using group 1 as the reference).
nullStats: is a numeric Rle with the mean of the null statistics by segment.
nullWidths: is a numeric Rle with the length of each of the segments in the null distribution. The area can be obtained by multiplying the absolute nullstats by the corresponding lengths.
nullPermutation: is a Rle with the permutation number from which the null region originated from.

Author

Leonardo Collado-Torres

Examples

## Collapse the coverage information
collapsedFull <- collapseFullCoverage(list(genomeData$coverage),
    verbose = TRUE
)
#> 2023-05-07 06:01:07.083085 collapseFullCoverage: Sorting fullCov
#> 2023-05-07 06:01:07.08555 collapseFullCoverage: Collapsing chromosomes information by sample

## Calculate library size adjustments
sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), verbose = TRUE)
#> 2023-05-07 06:01:07.087735 sampleDepth: Calculating sample quantiles
#> 2023-05-07 06:01:07.093398 sampleDepth: Calculating sample adjustments

## Build the models
group <- genomeInfo$pop
adjustvars <- data.frame(genomeInfo$gender)
models <- makeModels(sampleDepths, testvars = group, adjustvars = adjustvars)

## Preprocess the data
## Automatic chunksize used to then compare 1 vs 4 cores in the 'do not run'
## section
prep <- preprocessCoverage(genomeData,
    groupInfo = group, cutoff = 0,
    scalefac = 32, chunksize = NULL, colsubset = NULL, mc.cores = 4
)

## Get the F statistics
fstats <- genomeFstats

## We recommend determining the cutoff to use based on the F-distribution
## although you could also based it on the observed F-statistics.

## In this example we use a low cutoff used for illustrative purposes
cutoff <- 1

## Calculate the p-values and define the regions of interest.
regsWithP <- calculatePvalues(prep, models, fstats,
    nPermute = 1, seeds = 1,
    chr = "chr21", cutoff = cutoff, mc.cores = 1, method = "regular"
)
#> 2023-05-07 06:01:07.420015 calculatePvalues: identifying data segments
#> 2023-05-07 06:01:07.423844 findRegions: segmenting information
#> 2023-05-07 06:01:07.427514 findRegions: identifying candidate regions
#> 2023-05-07 06:01:07.472595 findRegions: identifying region clusters
#> 2023-05-07 06:01:07.572004 calculatePvalues: calculating F-statistics for permutation 1 and seed 1
#> 2023-05-07 06:01:07.844803 findRegions: segmenting information
#> 2023-05-07 06:01:07.848173 findRegions: identifying candidate regions
#> 2023-05-07 06:01:07.897063 calculatePvalues: calculating the p-values
#> 2023-05-07 06:01:07.919936 calculatePvalues: skipping q-value calculation.
regsWithP
#> $regions
#> GRanges object with 33 ranges and 14 metadata columns:
#>      seqnames            ranges strand |     value      area indexStart
#>         <Rle>         <IRanges>  <Rle> | <numeric> <numeric>  <integer>
#>   up    chr21 47409522-47409560      * |  15.05364   587.092        128
#>   up    chr21 47411924-47411986      * |   7.09670   447.092        618
#>   up    chr21 47417352-47417397      * |   8.77845   403.808       1292
#>   up    chr21 47412662-47412724      * |   5.67122   357.287        843
#>   up    chr21 47408998-47409026      * |  10.68525   309.872         69
#>   ..      ...               ...    ... .       ...       ...        ...
#>   up    chr21 47417666-47417672      * |   1.21552   8.50863       1390
#>   up    chr21 47409685-47409688      * |   1.21585   4.86342        192
#>   up    chr21 47417335-47417337      * |   1.59162   4.77487       1275
#>   up    chr21          47418068      * |   3.99766   3.99766       1434
#>   up    chr21 47410311-47410312      * |   1.14363   2.28725        246
#>       indexEnd cluster clusterL meanCoverage   meanCEU   meanYRI
#>      <integer>   <Rle>    <Rle>    <numeric> <numeric> <numeric>
#>   up       166       3      167     0.415219  0.612943 0.0000000
#>   up       680       6      389     0.584741  0.837491 0.0539683
#>   up      1337      10      338     0.364656  0.533126 0.0108696
#>   up       905       7       63     0.501280  0.739985 0.0000000
#>   up        97       2      232     0.195773  0.288998 0.0000000
#>   ..       ...     ...      ...          ...       ...       ...
#>   up      1396      10      338    0.2258065  0.333333       0.0
#>   up       195       3      167    0.4758065  0.654762       0.1
#>   up      1277      10      338    0.1290323  0.190476       0.0
#>   up      1434      11        1    0.0322581  0.047619       0.0
#>   up       247       4      159    0.7258065  1.071429       0.0
#>      log2FoldChangeYRIvsCEU   pvalues significant   qvalues significantQval
#>                   <numeric> <numeric>    <factor> <logical>       <logical>
#>   up                   -Inf 0.0175439       TRUE       <NA>            <NA>
#>   up               -3.95589 0.0175439       TRUE       <NA>            <NA>
#>   up               -5.61611 0.0526316       FALSE      <NA>            <NA>
#>   up                   -Inf 0.1578947       FALSE      <NA>            <NA>
#>   up                   -Inf 0.1929825       FALSE      <NA>            <NA>
#>   ..                    ...       ...         ...       ...             ...
#>   up                   -Inf  0.859649       FALSE      <NA>            <NA>
#>   up               -2.71097  0.894737       FALSE      <NA>            <NA>
#>   up                   -Inf  0.894737       FALSE      <NA>            <NA>
#>   up                   -Inf  0.894737       FALSE      <NA>            <NA>
#>   up                   -Inf  0.929825       FALSE      <NA>            <NA>
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $nullStats
#> numeric-Rle of length 56 with 54 runs
#>   Lengths:       1       1       1       1 ...       1       1       1       1
#>   Values : 2.25473 3.63681 4.12769 9.73262 ... 1.16174 8.10124 5.20438 1.16174
#> 
#> $nullWidths
#> integer-Rle of length 56 with 48 runs
#>   Lengths:  1  1  1  1  2  2  1  1  2  1  1 ...  1  1  1  1  2  1  1  1  1  1
#>   Values :  9 20 54 45 27 54 38 21 36  3  8 ... 63 45 36 38 36  3  1 47 39  1
#> 
#> $nullPermutation
#> integer-Rle of length 56 with 1 run
#>   Lengths: 56
#>   Values :  1
#> 
if (FALSE) {
## Calculate again, but with 10 permutations instead of just 1
regsWithP <- calculatePvalues(prep, models, fstats,
    nPermute = 10, seeds = 1:10,
    chr = "chr21", cutoff = cutoff, mc.cores = 2, method = "regular"
)

## Check that they are the same as the previously calculated regions
library(testthat)
expect_that(regsWithP, equals(genomeRegions))

## Histogram of the theoretical p-values by region
hist(pf(regsWithP$regions$value, df1 - df0, n - df1), main = "Distribution
    original p-values by region", freq = FALSE)

## Histogram of the permutted p-values by region
hist(regsWithP$regions$pvalues, main = "Distribution permutted p-values by
    region", freq = FALSE)

## MA style plot
library("ggplot2")
ma <- data.frame(
    mean = regsWithP$regions$meanCoverage,
    log2FoldChange = regsWithP$regions$log2FoldChangeYRIvsCEU
)
ggplot(ma, aes(x = log2(mean), y = log2FoldChange)) +
    geom_point() +
    ylab("Fold Change (log2)") +
    xlab("Mean coverage (log2)") +
    labs(title = "MA style plot")

## Annotate the results
library("bumphunter")
genes <- annotateTranscripts(TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene)
annotation <- matchGenes(regsWithP$regions, genes)
head(annotation)
}

Arguments

Value

See also

Author

Examples