For a group of samples this function reads the coverage information for a specific chromosome directly from the BAM files. It then merges them into a DataFrame and removes the bases that do not pass the cutoff. This is a helper function for loadCoverage and preprocessCoverage.
filterData(
data,
cutoff = NULL,
index = NULL,
filter = "one",
totalMapped = NULL,
targetSize = 8e+07,
...
)
Either a list of Rle objects or a DataFrame with the coverage information.
The base-pair level cutoff to use. It's behavior is controlled
by filter
.
A logical Rle with the positions of the chromosome that passed
the cutoff. If NULL
it is assumed that this is the first time using
filterData and thus no previous index exists.
Has to be either 'one'
(default) or 'mean'
. In
the first case, at least one sample has to have coverage above cutoff
.
In the second case, the mean coverage has to be greater than cutoff
.
A vector with the total number of reads mapped for each
sample. The vector should be in the same order as the samples in data
.
Providing this data adjusts the coverage to reads in targetSize
library prior to filtering. See getTotalMapped for
calculating this vector.
The target library size to adjust the coverage to. Used
only when totalMapped
is specified. By default, it adjusts to
libraries with 80 million reads.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way.
If TRUE
the mean coverage is included in the
result. FALSE
by default.
If TRUE
, the coverage DataFrame is returned.
TRUE
by default.
A list with up to three components.
is a DataFrame object where each column represents a
sample. The number of rows depends on the number of base pairs that passed
the cutoff and the information stored is the coverage at that given base.
Included only when returnCoverage = TRUE
.
is a logical Rle with the positions of the chromosome that passed the cutoff.
is a numeric Rle with the mean coverage at each base.
Included only when returnMean = TRUE
.
Specifies the column names to be used for the results
DataFrame. If NULL
, names from data
are used.
Whether to smooth the mean. Used only when
filter = 'mean'
. This option is used internally by
regionMatrix.
Passed to the internal function .smootherFstats
, see
findRegions.
If cutoff
is NULL
then the data is grouped into
DataFrame without applying any cutoffs. This can be useful if you want to
use loadCoverage to build the coverage DataFrame without applying any
cutoffs for other downstream purposes like plotting the coverage values of a
given region. You can always specify the colsubset
argument in
preprocessCoverage to filter the data before calculating the F
statistics.
## Construct some toy data
library("IRanges")
#> Loading required package: BiocGenerics
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following object is masked from ‘package:utils’:
#>
#> findMatches
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
x <- Rle(round(runif(1e4, max = 10)))
y <- Rle(round(runif(1e4, max = 10)))
z <- Rle(round(runif(1e4, max = 10)))
DF <- DataFrame(x, y, z)
## Filter the data
filt1 <- filterData(DF, 5)
#> 2023-05-07 06:01:17.92625 filterData: originally there were 10000 rows, now there are 8300 rows. Meaning that 17 percent was filtered.
filt1
#> $coverage
#> DataFrame with 8300 rows and 3 columns
#> x y z
#> <Rle> <Rle> <Rle>
#> 1 1 3 7
#> 2 7 3 10
#> 3 4 10 8
#> 4 8 6 3
#> 5 6 1 1
#> ... ... ... ...
#> 8296 9 4 2
#> 8297 4 8 10
#> 8298 9 2 6
#> 8299 8 0 9
#> 8300 6 1 9
#>
#> $position
#> logical-Rle of length 10000 with 2850 runs
#> Lengths: 13 1 1 1 1 ... 1 1 1 1 1
#> Values : TRUE FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE TRUE FALSE
#>
## Filter again but only using the first two samples
filt2 <- filterData(filt1$coverage[, 1:2], 5, index = filt1$position)
#> 2023-05-07 06:01:17.976002 filterData: originally there were 8300 rows, now there are 6924 rows. Meaning that 16.58 percent was filtered.
filt2
#> $coverage
#> DataFrame with 6924 rows and 2 columns
#> x y
#> <Rle> <Rle>
#> 1 7 3
#> 2 4 10
#> 3 8 6
#> 4 6 1
#> 5 8 9
#> ... ... ...
#> 6920 9 4
#> 6921 4 8
#> 6922 9 2
#> 6923 8 0
#> 6924 6 1
#>
#> $position
#> logical-Rle of length 10000 with 4265 runs
#> Lengths: 1 12 1 1 1 ... 1 1 1 1 1
#> Values : FALSE TRUE FALSE TRUE FALSE ... FALSE TRUE FALSE TRUE FALSE
#>