This function takes the coverage data from loadCoverage, scales the data, does the log2 transformation, and splits it into appropriate chunks for using calculateStats.
preprocessCoverage(
coverageInfo,
groupInfo = NULL,
cutoff = 5,
colsubset = NULL,
lowMemDir = NULL,
...
)
A list containing a DataFrame --$coverage
-- with
the coverage data and a logical Rle --$position
-- with the positions
that passed the cutoff. This object is generated using loadCoverage.
You should have specified a cutoff value for loadCoverage unless
that you are using colsubset
which will force a filtering step
with filterData when running preprocessCoverage.
A factor specifying the group membership of each sample. If
NULL
no group mean coverages are calculated. If the factor has more
than one level, the first one will be used to calculate the log2 fold change
in calculatePvalues.
The base-pair level cutoff to use. It's behavior is controlled
by filter
.
Optional vector of column indices of
coverageInfo$coverage
that denote samples you wish to include in
analysis.
If specified, each chunk is saved into a separate Rdata
file under lowMemDir
and later loaded in
fstats.apply when
running calculateStats and calculatePvalues. Using this option
helps reduce the memory load as each fork in bplapply
loads only the data needed for the chunk processing. The downside is a bit
longer computation time due to input/output.
Arguments passed to other methods and/or advanced arguments. Advanced arguments:
If TRUE
basic status updates will be printed along
the way. Default: FALSE
.
Determines whether the data in the chunk should already be
saved as a Matrix object, which can be useful to reduce the computation time
of the F-statistics. Only used when lowMemDir
is not NULL
and
by in that case set to TRUE
by default.
Number of cores you will use for calculating the statistics.
A log 2 transformation is used on the count tables, so zero counts present a problem. What number should we add to the entire matrix? Default: 32.
How many rows of coverageInfo$coverage
should be
processed at a time? Default: 5 million. Reduce this number if you have
hundreds of samples to reduce the memory burden while sacrificing some
speed.
A list with five components.
contains the processed coverage information in a
DataFrame object. Each column represents a sample and the coverage
information is scaled and log2 transformed. Note that if colsubset
is
not NULL
the number of columns will be less than those in
coverageInfo$coverage
. The total number of rows depends on the number
of base pairs that passed the cutoff
and the information stored is
the coverage at that given base. Further note that filterData is
re-applied if colsubset
is not NULL
and could thus lead to
fewer rows compared to coverageInfo$coverage
.
is a list of logical Rle objects. They contain the
partioning information according to chunksize
.
is a logical Rle with the positions of the chromosome that passed the cutoff.
is a numeric Rle with the mean coverage at each filtered base.
is a list of Rle objects containing the mean coverage at
each filtered base calculated by group. This list has length 0 if
groupInfo=NULL
.
Passed to filterData when colsubset
is specified.
If chunksize
is NULL
, then mc.cores
is used to
determine the chunksize
. This is useful if you want to split the data
so each core gets the same amount of data (up to rounding).
Computing the indexes and using those for mclapply
reduces
memory copying as described by Ryan Thompson and illustrated in approach #4
at http://lcolladotor.github.io/2013/11/14/Reducing-memory-overhead-when-using-mclapply
If lowMemDir
is specified then $coverageProcessed
is NULL and
$mclapplyIndex
is a vector with the chunk identifiers.
## Split the data and transform appropriately before using calculateStats()
dataReady <- preprocessCoverage(genomeData,
cutoff = 0, scalefac = 32,
chunksize = 1e3, colsubset = NULL, verbose = TRUE
)
#> 2023-05-07 06:01:28.076262 preprocessCoverage: splitting the data
names(dataReady)
#> [1] "coverageProcessed" "mclapplyIndex" "position"
#> [4] "meanCoverage" "groupMeans"
dataReady
#> $coverageProcessed
#> DataFrame with 1434 rows and 31 columns
#> ERR009101 ERR009102 ERR009105 ERR009107 ERR009108 ERR009112
#> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
#> 1 5 5 5.04439411935845 5 5 5
#> 2 5 5 5.04439411935845 5 5 5
#> 3 5 5 5.04439411935845 5 5 5
#> 4 5 5 5.04439411935845 5 5 5
#> 5 5 5 5.04439411935845 5 5 5
#> ... ... ... ... ... ... ...
#> 1430 5.04439411935845 5 5 5 5 5
#> 1431 5.04439411935845 5 5 5 5 5
#> 1432 5.04439411935845 5 5 5 5 5
#> 1433 5.04439411935845 5 5 5 5 5
#> 1434 5 5 5 5 5 5
#> ERR009115 ERR009116 ERR009131 ERR009138 ERR009144 ERR009145
#> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
#> 1 5 5 5 5 5 5
#> 2 5 5 5 5 5 5
#> 3 5 5 5 5 5 5
#> 4 5 5 5 5 5 5
#> 5 5 5 5 5 5 5
#> ... ... ... ... ... ... ...
#> 1430 5 5 5 5 5 5
#> 1431 5 5 5 5 5 5
#> 1432 5 5 5 5 5 5
#> 1433 5 5 5 5 5 5
#> 1434 5 5 5 5 5 5
#> ERR009148 ERR009151 ERR009152 ERR009153 ERR009159 ERR009161
#> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
#> 1 5 5 5 5 5 5
#> 2 5 5 5 5 5.04439411935845 5
#> 3 5 5 5 5 5.04439411935845 5
#> 4 5 5 5 5 5.04439411935845 5
#> 5 5 5 5 5 5.04439411935845 5
#> ... ... ... ... ... ... ...
#> 1430 5.04439411935845 5 5 5 5 5
#> 1431 5.04439411935845 5 5 5 5 5
#> 1432 5.04439411935845 5 5 5 5 5
#> 1433 5.04439411935845 5 5 5 5 5
#> 1434 5 5 5 5 5 5
#> ERR009163 ERR009164 ERR009167 SRR031812 SRR031835 SRR031867
#> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
#> 1 5 5 5 5 5 5
#> 2 5 5 5 5 5 5
#> 3 5 5 5 5 5 5
#> 4 5 5 5 5 5 5
#> 5 5 5 5 5 5 5
#> ... ... ... ... ... ... ...
#> 1430 5 5 5.04439411935845 5 5 5
#> 1431 5 5 5.04439411935845 5 5 5
#> 1432 5 5 5.04439411935845 5 5 5
#> 1433 5 5 5.04439411935845 5 5 5
#> 1434 5 5 5.04439411935845 5 5 5
#> SRR031868 SRR031900 SRR031904 SRR031914 SRR031936 SRR031958 SRR031960
#> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle> <Rle>
#> 1 5 5 5 5 5 5 5
#> 2 5 5 5 5 5 5 5
#> 3 5 5 5 5 5 5 5
#> 4 5 5 5 5 5 5 5
#> 5 5 5 5 5 5 5 5
#> ... ... ... ... ... ... ... ...
#> 1430 5 5 5 5 5 5 5
#> 1431 5 5 5 5 5 5 5
#> 1432 5 5 5 5 5 5 5
#> 1433 5 5 5 5 5 5 5
#> 1434 5 5 5 5 5 5 5
#>
#> $mclapplyIndex
#> $mclapplyIndex[[1]]
#> logical-Rle of length 1434 with 2 runs
#> Lengths: 1000 434
#> Values : TRUE FALSE
#>
#> $mclapplyIndex[[2]]
#> logical-Rle of length 1434 with 2 runs
#> Lengths: 1000 434
#> Values : FALSE TRUE
#>
#>
#> $position
#> logical-Rle of length 48129895 with 67 runs
#> Lengths: 47407536 32 1256 36 ... 358 34 711827
#> Values : FALSE TRUE FALSE TRUE ... FALSE TRUE FALSE
#>
#> $meanCoverage
#> numeric-Rle of length 1434 with 330 runs
#> Lengths: 1 5 3 ... 1 5 1
#> Values : 0.0322581 0.0645161 0.0967742 ... 0.1290323 0.0967742 0.0322581
#>
#> $groupMeans
#> list()
#>