This function takes the coverage data from loadCoverage, scales the data, does the log2 transformation, and splits it into appropriate chunks for using calculateStats.

preprocessCoverage(
  coverageInfo,
  groupInfo = NULL,
  cutoff = 5,
  colsubset = NULL,
  lowMemDir = NULL,
  ...
)

Arguments

coverageInfo

A list containing a DataFrame –$coverage– with the coverage data and a logical Rle –$position– with the positions that passed the cutoff. This object is generated using loadCoverage. You should have specified a cutoff value for loadCoverage unless that you are using colsubset which will force a filtering step with filterData when running preprocessCoverage.

groupInfo

A factor specifying the group membership of each sample. If NULL no group mean coverages are calculated. If the factor has more than one level, the first one will be used to calculate the log2 fold change in calculatePvalues.

cutoff

The base-pair level cutoff to use. It's behavior is controlled by filter.

colsubset

Optional vector of column indices of coverageInfo$coverage that denote samples you wish to include in analysis.

lowMemDir

If specified, each chunk is saved into a separate Rdata file under lowMemDir and later loaded in fstats.apply when running calculateStats and calculatePvalues. Using this option helps reduce the memory load as each fork in bplapply loads only the data needed for the chunk processing. The downside is a bit longer computation time due to input/output.

...

Arguments passed to other methods and/or advanced arguments. Advanced arguments:

verbose

If TRUE basic status updates will be printed along the way. Default: FALSE.

toMatrix

Determines whether the data in the chunk should already be saved as a Matrix object, which can be useful to reduce the computation time of the F-statistics. Only used when lowMemDir is not NULL and by in that case set to TRUE by default.

mc.cores

Number of cores you will use for calculating the statistics.

scalefac

A log 2 transformation is used on the count tables, so zero counts present a problem. What number should we add to the entire matrix? Default: 32.

chunksize

How many rows of coverageInfo$coverage should be processed at a time? Default: 5 million. Reduce this number if you have hundreds of samples to reduce the memory burden while sacrificing some speed.

Value

A list with five components.

coverageProcessed

contains the processed coverage information in a DataFrame object. Each column represents a sample and the coverage information is scaled and log2 transformed. Note that if colsubset is not NULL the number of columns will be less than those in coverageInfo$coverage. The total number of rows depends on the number of base pairs that passed the cutoff and the information stored is the coverage at that given base. Further note that filterData is re-applied if colsubset is not NULL and could thus lead to fewer rows compared to coverageInfo$coverage.

mclapplyIndex

is a list of logical Rle objects. They contain the partioning information according to chunksize.

position

is a logical Rle with the positions of the chromosome that passed the cutoff.

meanCoverage

is a numeric Rle with the mean coverage at each filtered base.

groupMeans

is a list of Rle objects containing the mean coverage at each filtered base calculated by group. This list has length 0 if groupInfo=NULL.

Passed to filterData when colsubset is specified.

Details

If chunksize is NULL, then mc.cores is used to determine the chunksize. This is useful if you want to split the data so each core gets the same amount of data (up to rounding).

Computing the indexes and using those for mclapply reduces memory copying as described by Ryan Thompson and illustrated in approach #4 at http://lcolladotor.github.io/2013/11/14/Reducing-memory-overhead-when-using-mclapply

If lowMemDir is specified then $coverageProcessed is NULL and $mclapplyIndex is a vector with the chunk identifiers.

Author

Leonardo Collado-Torres

Examples

## Split the data and transform appropriately before using calculateStats()
dataReady <- preprocessCoverage(genomeData,
    cutoff = 0, scalefac = 32,
    chunksize = 1e3, colsubset = NULL, verbose = TRUE
)
#> 2024-12-13 15:13:19.820465 preprocessCoverage: splitting the data
names(dataReady)
#> [1] "coverageProcessed" "mclapplyIndex"     "position"         
#> [4] "meanCoverage"      "groupMeans"       
dataReady
#> $coverageProcessed
#> DataFrame with 1434 rows and 31 columns
#>             ERR009101 ERR009102        ERR009105 ERR009107 ERR009108 ERR009112
#>                 <Rle>     <Rle>            <Rle>     <Rle>     <Rle>     <Rle>
#> 1                   5         5 5.04439411935845         5         5         5
#> 2                   5         5 5.04439411935845         5         5         5
#> 3                   5         5 5.04439411935845         5         5         5
#> 4                   5         5 5.04439411935845         5         5         5
#> 5                   5         5 5.04439411935845         5         5         5
#> ...               ...       ...              ...       ...       ...       ...
#> 1430 5.04439411935845         5                5         5         5         5
#> 1431 5.04439411935845         5                5         5         5         5
#> 1432 5.04439411935845         5                5         5         5         5
#> 1433 5.04439411935845         5                5         5         5         5
#> 1434                5         5                5         5         5         5
#>      ERR009115 ERR009116 ERR009131 ERR009138 ERR009144 ERR009145
#>          <Rle>     <Rle>     <Rle>     <Rle>     <Rle>     <Rle>
#> 1            5         5         5         5         5         5
#> 2            5         5         5         5         5         5
#> 3            5         5         5         5         5         5
#> 4            5         5         5         5         5         5
#> 5            5         5         5         5         5         5
#> ...        ...       ...       ...       ...       ...       ...
#> 1430         5         5         5         5         5         5
#> 1431         5         5         5         5         5         5
#> 1432         5         5         5         5         5         5
#> 1433         5         5         5         5         5         5
#> 1434         5         5         5         5         5         5
#>             ERR009148 ERR009151 ERR009152 ERR009153        ERR009159 ERR009161
#>                 <Rle>     <Rle>     <Rle>     <Rle>            <Rle>     <Rle>
#> 1                   5         5         5         5                5         5
#> 2                   5         5         5         5 5.04439411935845         5
#> 3                   5         5         5         5 5.04439411935845         5
#> 4                   5         5         5         5 5.04439411935845         5
#> 5                   5         5         5         5 5.04439411935845         5
#> ...               ...       ...       ...       ...              ...       ...
#> 1430 5.04439411935845         5         5         5                5         5
#> 1431 5.04439411935845         5         5         5                5         5
#> 1432 5.04439411935845         5         5         5                5         5
#> 1433 5.04439411935845         5         5         5                5         5
#> 1434                5         5         5         5                5         5
#>      ERR009163 ERR009164        ERR009167 SRR031812 SRR031835 SRR031867
#>          <Rle>     <Rle>            <Rle>     <Rle>     <Rle>     <Rle>
#> 1            5         5                5         5         5         5
#> 2            5         5                5         5         5         5
#> 3            5         5                5         5         5         5
#> 4            5         5                5         5         5         5
#> 5            5         5                5         5         5         5
#> ...        ...       ...              ...       ...       ...       ...
#> 1430         5         5 5.04439411935845         5         5         5
#> 1431         5         5 5.04439411935845         5         5         5
#> 1432         5         5 5.04439411935845         5         5         5
#> 1433         5         5 5.04439411935845         5         5         5
#> 1434         5         5 5.04439411935845         5         5         5
#>      SRR031868 SRR031900 SRR031904 SRR031914 SRR031936 SRR031958 SRR031960
#>          <Rle>     <Rle>     <Rle>     <Rle>     <Rle>     <Rle>     <Rle>
#> 1            5         5         5         5         5         5         5
#> 2            5         5         5         5         5         5         5
#> 3            5         5         5         5         5         5         5
#> 4            5         5         5         5         5         5         5
#> 5            5         5         5         5         5         5         5
#> ...        ...       ...       ...       ...       ...       ...       ...
#> 1430         5         5         5         5         5         5         5
#> 1431         5         5         5         5         5         5         5
#> 1432         5         5         5         5         5         5         5
#> 1433         5         5         5         5         5         5         5
#> 1434         5         5         5         5         5         5         5
#> 
#> $mclapplyIndex
#> $mclapplyIndex[[1]]
#> logical-Rle of length 1434 with 2 runs
#>   Lengths:  1000   434
#>   Values :  TRUE FALSE
#> 
#> $mclapplyIndex[[2]]
#> logical-Rle of length 1434 with 2 runs
#>   Lengths:  1000   434
#>   Values : FALSE  TRUE
#> 
#> 
#> $position
#> logical-Rle of length 48129895 with 67 runs
#>   Lengths: 47407536       32     1256       36 ...      358       34   711827
#>   Values :    FALSE     TRUE    FALSE     TRUE ...    FALSE     TRUE    FALSE
#> 
#> $meanCoverage
#> numeric-Rle of length 1434 with 330 runs
#>   Lengths:         1         5         3 ...         1         5         1
#>   Values : 0.0322581 0.0645161 0.0967742 ... 0.1290323 0.0967742 0.0322581
#> 
#> $groupMeans
#> list()
#>