For a group of samples this function reads the coverage information for a specific chromosome directly from the BAM files. It then merges them into a DataFrame and removes the bases that do not pass the cutoff.
loadCoverage(
files,
chr,
cutoff = NULL,
filter = "one",
chrlen = NULL,
output = NULL,
bai = NULL,
...
)
A character vector with the full path to the sample BAM files
(or BigWig files).
The names are used for the column names of the DataFrame. Check
rawFiles for constructing files
. files
can also be a
BamFileList, BamFile,
BigWigFileList, or BigWigFile object.
Chromosome to read. Should be in the format matching the one used in the raw data.
This argument is passed to filterData.
Has to be either 'one'
(default) or 'mean'
. In
the first case, at least one sample has to have coverage above cutoff
.
In the second case, the mean coverage has to be greater than cutoff
.
The chromosome length in base pairs. If it's NULL
, the
chromosome length is extracted from the BAM files.
If NULL
then no output is saved in disk. If auto
then an automatic name is constructed using UCSC names (chrXCovInfo.Rdata
for example). If another character is specified, then that name is used for
the output file.
The full path to the BAM index files. If NULL
it is
assumed that the BAM index files are in the same location as the BAM files
and that they have the .bai extension. Ignored if files
is a
BamFileList
object or if inputType=='BigWig'
.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way.
Has to be either bam
or BigWig
. It specifies
the format of the raw data files. By default it's set to bam
before
trying to guess it from the file names.
When specified, tileGenome is used to break up the chromosome into chunks. We don't recommend this for BAM files as the coverage in the borders of the chunks might be slightly off.
NULL
by default. When a GRanges
is specified,
this specific region of the genome is loaded instead of the full chromosome.
The naming style of the chromosomes in the input files.
If the global option chrsStyle
is not set, the naming style won't be
changed. This option is useful when you want to use a specific naming style
but the raw files use another style.
When not NULL
and which
is specified,
this argument specifies by how much the ranges in which
will be
expanded. This helps get the same base level coverage you would get from
reading the coverage for a full chromosome from BAM files. Otherwise some
reads might be excluded and thus the base level coverage will be lower.
NULL
by default.
Whether to drop the bases with 'D' in the CIGAR strings
or to include them. Only used with BAM files. FALSE
by default.
Column names to be used the samples. By default it's
names(files)
.
Passed to extendedMapSeqlevels, define_cluster,
scanBamFlag and filterData.
Note that filterData is used internally
by loadCoverage and has the important arguments totalMapped
and targetSize
which can be used to normalize the coverage by
library size. See getTotalMapped for calculating totalMapped
.
A list with two components.
is a DataFrame object where each column represents a sample. The number of rows depends on the number of base pairs that passed the cutoff and the information stored is the coverage at that given base.
is a logical Rle with the positions of the chromosome that passed the cutoff.
The ...
argument can be used to control which alignments to consider
when reading from BAM files. See scanBamFlag.
Parallelization for loading the data in chunks is used only used when
tilewidth
is specified. You may use up to one core per tile.
If you set the advanced argument drop.D = TRUE
, bases with CIGAR
string "D" (deletion from reference) will be excluded from the base-level
coverage calculation.
If you are working with data from an organism different from 'Homo sapiens'
specify so by setting the global 'species' and 'chrsStyle' options. For
example:
options(species = 'arabidopsis_thaliana')
options(chrsStyle = 'NCBI')
datadir <- system.file("extdata", "genomeData", package = "derfinder")
files <- rawFiles(
datadir = datadir, samplepatt = "*accepted_hits.bam$",
fileterm = NULL
)
## Shorten the column names
names(files) <- gsub("_accepted_hits.bam", "", names(files))
## Read and filter the data, only for 2 files
dataSmall <- loadCoverage(files = files[1:2], chr = "21", cutoff = 0)
#> 2023-05-07 06:01:21.817745 loadCoverage: finding chromosome lengths
#> 2023-05-07 06:01:21.830585 loadCoverage: loading BAM file /__w/_temp/Library/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
#> 2023-05-07 06:01:21.879642 loadCoverage: loading BAM file /__w/_temp/Library/derfinder/extdata/genomeData/ERR009102_accepted_hits.bam
#> 2023-05-07 06:01:21.927497 loadCoverage: applying the cutoff to the merged data
#> 2023-05-07 06:01:21.94533 filterData: originally there were 48129895 rows, now there are 454 rows. Meaning that 100 percent was filtered.
if (FALSE) {
## Export to BigWig files
createBw(list("chr21" = dataSmall))
## Load data from BigWig files
dataSmall.bw <- loadCoverage(c(
ERR009101 = "ERR009101.bw", ERR009102 =
"ERR009102.bw"
), chr = "chr21")
## Compare them
mapply(function(x, y) {
x - y
}, dataSmall$coverage, dataSmall.bw$coverage)
## Note that the only difference is the type of Rle (integer vs numeric) used
## to store the data.
}