R/fullCoverage.R
fullCoverage.Rd
For a group of samples this function reads the coverage information for several chromosomes directly from the BAM files. Per chromosome, it merges the unfiltered coverage by sample into a DataFrame. The end result is a list with one such DataFrame objects per chromosome.
fullCoverage(
files,
chrs,
bai = NULL,
chrlens = NULL,
outputs = NULL,
cutoff = NULL,
...
)
A character vector with the full path to the sample BAM files
(or BigWig files).
The names are used for the column names of the DataFrame. Check
rawFiles for constructing files
. files
can also be a
BamFileList
object created with BamFileList or a
BigWigFileList
object created with BigWigFileList.
The chromosome of the files to read. The format has to match the one used in the input files.
The full path to the BAM index files. If NULL
it is
assumed that the BAM index files are in the same location as the BAM files
and that they have the .bai extension. Ignored if files
is a
BamFileList
object.
The chromosome lengths in base pairs. If it's NULL
,
the chromosome length is extracted from the BAM files. Otherwise, it should
have the same length as chrs
.
This argument is passed to the output
argument of
loadCoverage. If NULL
or 'auto'
it is then recycled.
This argument is passed to filterData.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way.
How many cores to use for reading the chromosome
information. There's no benefit of using a number greater than the number
of chromosomes. Also note that your harddisk speed will limit how much
you get from using a higher mc.cores
value.
Controls the number of cores to be used per chr for
loading the data which is only useful in the scenario that you are loading
in genome tiles. If not supplied, it uses mc.cores
for
loadCoverage. Default: 1. If you are using genome tiles, the total
number of cores you'll use will be mc.cores
times
mc.cores.load
.
Passed to loadCoverage, define_cluster and
extendedMapSeqlevels.
Note that filterData is used internally
by loadCoverage (and hence fullCoverage) and has the important
arguments totalMapped
and targetSize
which can be used to
normalize the coverage by library size. See getTotalMapped for
calculating totalMapped
.
A list with one element per chromosome.
datadir <- system.file("extdata", "genomeData", package = "derfinder")
files <- rawFiles(
datadir = datadir, samplepatt = "*accepted_hits.bam$",
fileterm = NULL
)
## Shorten the column names
names(files) <- gsub("_accepted_hits.bam", "", names(files))
## Read and filter the data, only for 1 file
fullCov <- fullCoverage(files = files[1], chrs = c("21", "22"))
#> 2023-05-07 06:01:19.264883 fullCoverage: processing chromosome 21
#> 2023-05-07 06:01:19.273236 loadCoverage: finding chromosome lengths
#> 2023-05-07 06:01:19.293643 loadCoverage: loading BAM file /__w/_temp/Library/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
#> 2023-05-07 06:01:19.353115 loadCoverage: applying the cutoff to the merged data
#> 2023-05-07 06:01:19.367733 filterData: originally there were 48129895 rows, now there are 48129895 rows. Meaning that 0 percent was filtered.
#> 2023-05-07 06:01:19.368349 fullCoverage: processing chromosome 22
#> 2023-05-07 06:01:19.371603 loadCoverage: finding chromosome lengths
#> 2023-05-07 06:01:19.383013 loadCoverage: loading BAM file /__w/_temp/Library/derfinder/extdata/genomeData/ERR009101_accepted_hits.bam
#> 2023-05-07 06:01:19.43013 loadCoverage: applying the cutoff to the merged data
#> 2023-05-07 06:01:19.44397 filterData: originally there were 51304566 rows, now there are 51304566 rows. Meaning that 0 percent was filtered.
#> extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
fullCov
#> $chr21
#> DataFrame with 48129895 rows and 1 column
#> ERR009101
#> <Rle>
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> ... ...
#> 48129891 0
#> 48129892 0
#> 48129893 0
#> 48129894 0
#> 48129895 0
#>
#> $chr22
#> DataFrame with 51304566 rows and 1 column
#> ERR009101
#> <Rle>
#> 1 0
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> ... ...
#> 51304562 0
#> 51304563 0
#> 51304564 0
#> 51304565 0
#> 51304566 0
#>
if (FALSE) {
## You can then use filterData() to filter the data if you want to.
## Use bplapply() if you want to do so with multiple cores as shown below.
library("BiocParallel")
p <- SnowParam(2L)
bplapply(fullCov, function(x) {
library("derfinder")
filterData(x, cutoff = 0)
}, BPPARAM = p)
}