Run the derfinder analysis on a chromosome

This is a major wrapper for running several key functions from this package. It is meant to be used after loadCoverage has been used for a specific chromosome. The steps run include makeModels, preprocessCoverage, calculateStats, calculatePvalues and annotating with annotateTranscripts and matchGenes.

analyzeChr(
  chr,
  coverageInfo,
  models,
  cutoffPre = 5,
  cutoffFstat = 1e-08,
  cutoffType = "theoretical",
  nPermute = 1,
  seeds = as.integer(gsub("-", "", Sys.Date())) + seq_len(nPermute),
  groupInfo,
  txdb = NULL,
  writeOutput = TRUE,
  runAnnotation = TRUE,
  lowMemDir = file.path(chr, "chunksDir"),
  smooth = FALSE,
  weights = NULL,
  smoothFunction = bumphunter::locfitByCluster,
  ...
)

Arguments

chr

Used for naming the output files when writeOutput=TRUE and the resulting GRanges object.

coverageInfo

A list containing a DataFrame –$coverage– with the coverage data and a logical Rle –$position– with the positions that passed the cutoff. This object is generated using loadCoverage. You should have specified a cutoff value for loadCoverage unless that you are using colsubset which will force a filtering step with filterData when running preprocessCoverage.

models

The output from makeModels.

cutoffPre

This argument is passed to preprocessCoverage (cutoff).

cutoffFstat

This is used to determine the cutoff argument of calculatePvalues and it's behaviour is determined by cutoffType.

cutoffType

If set to empirical, the cutoffFstat (example: 0.99) quantile is used via quantile. If set to theoretical, the theoretical cutoffFstats (example: 1e-08) is calculated via qf. If set to manual, cutoffFstats is passed to calculatePvalues without any other calculation.

nPermute

The number of permutations. Note that for a full chromosome, a small amount (10) of permutations is sufficient. If set to 0, no permutations are performed and thus no null regions are used, however, the $regions component is created.

seeds

An integer vector of length nPermute specifying the seeds to be used for each permutation. If NULL no seeds are used.

groupInfo

A factor specifying the group membership of each sample that can later be used with the plotting functions in the derfinderPlot package.

txdb

This argument is passed to annotateTranscripts. If NULL, TxDb.Hsapiens.UCSC.hg19.knownGene is used.

writeOutput

If TRUE, output Rdata files are created at each step inside a directory with the chromosome name (example: 'chr21' if chrnum='21'). One Rdata file is created for each component described in the return section.

runAnnotation

If TRUE annotateTranscripts and matchGenes are run. Otherwise these steps are skipped.

lowMemDir

If specified, each chunk is saved into a separate Rdata file under lowMemDir and later loaded in fstats.apply when running calculateStats and calculatePvalues. Using this option helps reduce the memory load as each fork in bplapply loads only the data needed for the chunk processing. The downside is a bit longer computation time due to input/output.

smooth

Whether to smooth the F-statistics (fstats) or not. This is by default FALSE. For RNA-seq data we recommend using FALSE.

weights

Weights used by the smoother as described in smoother.

smoothFunction

A function to be used for smoothing the F-statistics. Two functions are provided by the bumphunter package: loessByCluster and runmedByCluster. If you are using your own custom function, it has to return a named list with an element called $fitted that contains the smoothed F-statistics and an element claled $smoothed that is a logical vector indicating whether the F-statistics were smoothed or not. If they are not smoothed, the original values will be used.

...

Arguments passed to other methods and/or advanced arguments. Advanced arguments:

verbose: If TRUE basic status updates will be printed along the way. Default TRUE.
scalefac: This argument is passed to preprocessCoverage.
chunksize: This argument is passed to preprocessCoverage.
returnOutput: If TRUE, it returns a list with the results from each step. Otherwise, it returns NULL. Default: the opposite of writeOutput.

Passed to extendedMapSeqlevels, preprocessCoverage, calculateStats, calculatePvalues, annotateTranscripts, matchGenes, and define_cluster.

Value

If returnOutput=TRUE, a list with six components:

timeinfo: The wallclock timing information for each step.
optionsStats: The main options used when running this function.
coveragePrep: The output from preprocessCoverage.
fstats: The output from calculateStats.
regions: The output from calculatePvalues.
annotation: The output from matchGenes.

These are the same components that are written to Rdata files if writeOutput=TRUE.

Details

If you are working with data from an organism different from 'Homo sapiens' specify so by setting the global 'species' and 'chrsStyle' options. For example: options(species = 'arabidopsis_thaliana') options(chrsStyle = 'NCBI')

Author

Leonardo Collado-Torres

Examples

## Collapse the coverage information
collapsedFull <- collapseFullCoverage(list(genomeData$coverage),
    verbose = TRUE
)
#> 2024-12-13 15:13:02.935753 collapseFullCoverage: Sorting fullCov
#> 2024-12-13 15:13:02.939384 collapseFullCoverage: Collapsing chromosomes information by sample

## Calculate library size adjustments
sampleDepths <- sampleDepth(collapsedFull,
    probs = c(0.5), nonzero = TRUE,
    verbose = TRUE
)
#> 2024-12-13 15:13:02.940978 sampleDepth: Calculating sample quantiles
#> 2024-12-13 15:13:02.946909 sampleDepth: Calculating sample adjustments

## Build the models
groupInfo <- genomeInfo$pop
adjustvars <- data.frame(genomeInfo$gender)
models <- makeModels(sampleDepths, testvars = groupInfo, adjustvars = adjustvars)

## Analyze the chromosome
results <- analyzeChr(
    chr = "21", coverageInfo = genomeData, models = models,
    cutoffFstat = 1, cutoffType = "manual", groupInfo = groupInfo, mc.cores = 1,
    writeOutput = FALSE, returnOutput = TRUE, method = "regular",
    runAnnotation = FALSE
)
#> extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
#> 2024-12-13 15:13:02.984882 analyzeChr: Pre-processing the coverage data
#> 2024-12-13 15:13:03.290077 analyzeChr: Calculating statistics
#> 2024-12-13 15:13:03.297256 calculateStats: calculating the F-statistics
#> 2024-12-13 15:13:03.360988 analyzeChr: Calculating pvalues
#> 2024-12-13 15:13:03.361671 analyzeChr: Using the following manual cutoff for the F-statistics 1
#> 2024-12-13 15:13:03.363219 calculatePvalues: identifying data segments
#> 2024-12-13 15:13:03.374242 findRegions: segmenting information
#> 2024-12-13 15:13:03.382751 findRegions: identifying candidate regions
#> 2024-12-13 15:13:03.461652 findRegions: identifying region clusters
#> 2024-12-13 15:13:03.570194 calculatePvalues: calculating F-statistics for permutation 1 and seed 20241214
#> 2024-12-13 15:13:03.617957 findRegions: segmenting information
#> 2024-12-13 15:13:03.622317 findRegions: identifying candidate regions
#> 2024-12-13 15:13:03.712589 calculatePvalues: calculating the p-values
#> 2024-12-13 15:13:03.746724 calculatePvalues: skipping q-value calculation.
#> 2024-12-13 15:13:03.760955 analyzeChr: Annotating regions
names(results)
#> [1] "timeinfo"     "optionsStats" "coveragePrep" "fstats"       "regions"     
#> [6] "annotation"