Load the coverage information from a group of BAM files

For a group of samples this function reads the coverage information for a specific chromosome directly from the BAM files. It then merges them into a DataFrame and removes the bases that do not pass the cutoff.

loadCoverage(
  files,
  chr,
  cutoff = NULL,
  filter = "one",
  chrlen = NULL,
  output = NULL,
  bai = NULL,
  ...
)

Arguments

files

A character vector with the full path to the sample BAM files (or BigWig files). The names are used for the column names of the DataFrame. Check rawFiles for constructing files. files can also be a BamFileList, BamFile, BigWigFileList, or BigWigFile object.

chr

Chromosome to read. Should be in the format matching the one used in the raw data.

cutoff

This argument is passed to filterData.

filter

Has to be either 'one' (default) or 'mean'. In the first case, at least one sample has to have coverage above cutoff. In the second case, the mean coverage has to be greater than cutoff.

chrlen

The chromosome length in base pairs. If it's NULL, the chromosome length is extracted from the BAM files.

output

If NULL then no output is saved in disk. If auto then an automatic name is constructed using UCSC names (chrXCovInfo.Rdata for example). If another character is specified, then that name is used for the output file.

bai

The full path to the BAM index files. If NULL it is assumed that the BAM index files are in the same location as the BAM files and that they have the .bai extension. Ignored if files is a BamFileList object or if inputType=='BigWig'.

...

Arguments passed to other methods and/or advanced arguments. Advanced arguments:

verbose: If TRUE basic status updates will be printed along the way.
inputType: Has to be either bam or BigWig. It specifies the format of the raw data files. By default it's set to bam before trying to guess it from the file names.
tilewidth: When specified, tileGenome is used to break up the chromosome into chunks. We don't recommend this for BAM files as the coverage in the borders of the chunks might be slightly off.
which: NULL by default. When a GRanges is specified, this specific region of the genome is loaded instead of the full chromosome.
fileStyle: The naming style of the chromosomes in the input files. If the global option chrsStyle is not set, the naming style won't be changed. This option is useful when you want to use a specific naming style but the raw files use another style.
protectWhich: When not NULL and which is specified, this argument specifies by how much the ranges in which will be expanded. This helps get the same base level coverage you would get from reading the coverage for a full chromosome from BAM files. Otherwise some reads might be excluded and thus the base level coverage will be lower. NULL by default.
drop.D: Whether to drop the bases with 'D' in the CIGAR strings or to include them. Only used with BAM files. FALSE by default.
sampleNames: Column names to be used the samples. By default it's names(files).

Passed to extendedMapSeqlevels, define_cluster, scanBamFlag and filterData. Note that filterData is used internally by loadCoverage and has the important arguments totalMapped and targetSize which can be used to normalize the coverage by library size. See getTotalMapped for calculating totalMapped.

Value

A list with two components.

coverage: is a DataFrame object where each column represents a sample. The number of rows depends on the number of base pairs that passed the cutoff and the information stored is the coverage at that given base.
position: is a logical Rle with the positions of the chromosome that passed the cutoff.

Details

The ... argument can be used to control which alignments to consider when reading from BAM files. See scanBamFlag.

Parallelization for loading the data in chunks is used only used when tilewidth is specified. You may use up to one core per tile.

If you set the advanced argument drop.D = TRUE, bases with CIGAR string "D" (deletion from reference) will be excluded from the base-level coverage calculation.

If you are working with data from an organism different from 'Homo sapiens' specify so by setting the global 'species' and 'chrsStyle' options. For example: options(species = 'arabidopsis_thaliana') options(chrsStyle = 'NCBI')

Author

Leonardo Collado-Torres, Andrew Jaffe

Examples

datadir <- system.file("extdata", "genomeData", package = "derfinder")
files <- rawFiles(
    datadir = datadir, samplepatt = "*accepted_hits.bam$",
    fileterm = NULL
)
## Shorten the column names
names(files) <- gsub("_accepted_hits.bam", "", names(files))

## Read and filter the data, only for 2 files
dataSmall <- loadCoverage(files = files[1:2], chr = "21", cutoff = 0)
#> 2024-12-13 15:13:15.453254 loadCoverage: finding chromosome lengths
#> 2024-12-13 15:13:15.462297 loadCoverage: loading BAM file /__w/_temp/Library/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
#> 2024-12-13 15:13:15.495133 loadCoverage: loading BAM file /__w/_temp/Library/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
#> 2024-12-13 15:13:15.527323 loadCoverage: applying the cutoff to the merged data
#> 2024-12-13 15:13:15.539829 filterData: originally there were 48129895 rows, now there are 454 rows. Meaning that 100 percent was filtered.
if (FALSE) { # \dontrun{
## Export to BigWig files
createBw(list("chr21" = dataSmall))

## Load data from BigWig files
dataSmall.bw <- loadCoverage(c(
    ERR009101 = "ERR009101.bw", ERR009102 =
        "ERR009102.bw"
), chr = "chr21")

## Compare them
mapply(function(x, y) {
    x - y
}, dataSmall$coverage, dataSmall.bw$coverage)

## Note that the only difference is the type of Rle (integer vs numeric) used
## to store the data.
} # }