Memory issues with BiocParallel

This repo is about a memory issue with BiocParallel in the computing cluster I have access to managed by a SGE system. The full post can be found at the Bioc-devel mailing list. Basically, it seems that after R changed to version 3.2 -- which matches other major changes to BiocParallel -- the memory used greatly increased when using SnowParam().

I first noticed this issue when attempting to reproduce the results of step1-fullCoverage.sh using R 3.2.x (using sh step1-fullCoverage.sh brainspan). Previously with R 3.1.x, the script used 173.5 GB of RAM and in R 3.2 and 3.2.x it reached 450 GB before getting shut down by the cluster system. The main package used in that script is basically identical between R versions 3.1.x and 3.2, and while SnowParam() did change it's also likely that the issue is related to another package(s). Note that the results and memory used are reproducible with R 3.1.x.

Here I made some scripts to try to show the problem with a smaller data set. The actual scripts are SnowParam-memory.R and SnowParam-memory.sh which are run using sh SnowParam-memory.sh. It runs the same code for R versions 3.1.x, 3.2 and 3.2.x.

Data

Below is the summary information for these jobs with memory shown in GB used, as reported by the cluster I'm using. The actual files are at mem_emails/. The log files are available at logs/.

For example, mem_emails/der-snow-3.1.x.txt has an entry with the email information for the job per replicate run. The email information includes among other things, the maximum memory as measured by the cluster under "Max vmem". Each entry starts with the job name, for example job 6416599. The job number is then useful to identify which are the corresponding log files. The log files will have names with extentions .e and .o for stderr and stdout respectively, and with the same digits as the job id. So for this example, the log files are logs/der-snow-3.1.x.e6416599 and logs/der-snow-3.1.x.o6416599.

df <- data.frame(
    memory = c(
        6.404, 5.170, 5.036,
        12.926, 12.899, 11.735,
        1.573, 1.573, 1.508,
        7.286, 7.285, 5.036,
        12.370, 12.303, 11.676,
        1.574, 1.574, 1.508,
        7.286, 7.285, 5.036,
        12.394, 12.452, 11.974,
        1.574, 1.574, 1.508,
        7.286, 7.285, 5.036,
        12.161, 12.422, 11.974,
        1.306, 1.306, 1.210
    ),
    R = factor(rep(c('3.2.x', '3.2', '3.1.x'), 3),
        levels = c('3.1.x', '3.2', '3.2.x')),
    param = rep(c('snow', 'multicore', 'serial'), each = 3),
    run = factor(rep(1:4, each = 9))
)
knitr::kable(df, format = 'html')

memory	R	param	run
6.404	3.2.x	snow	1
5.170	3.2	snow	1
5.036	3.1.x	snow	1
12.926	3.2.x	multicore	1
12.899	3.2	multicore	1
11.735	3.1.x	multicore	1
1.573	3.2.x	serial	1
1.573	3.2	serial	1
1.508	3.1.x	serial	1
7.286	3.2.x	snow	2
7.285	3.2	snow	2
5.036	3.1.x	snow	2
12.370	3.2.x	multicore	2
12.303	3.2	multicore	2
11.676	3.1.x	multicore	2
1.574	3.2.x	serial	2
1.574	3.2	serial	2
1.508	3.1.x	serial	2
7.286	3.2.x	snow	3
7.285	3.2	snow	3
5.036	3.1.x	snow	3
12.394	3.2.x	multicore	3
12.452	3.2	multicore	3
11.974	3.1.x	multicore	3
1.574	3.2.x	serial	3
1.574	3.2	serial	3
1.508	3.1.x	serial	3
7.286	3.2.x	snow	4
7.285	3.2	snow	4
5.036	3.1.x	snow	4
12.161	3.2.x	multicore	4
12.422	3.2	multicore	4
11.974	3.1.x	multicore	4
1.306	3.2.x	serial	4
1.306	3.2	serial	4
1.210	3.1.x	serial	4

The same information is shown below in two plots. The first one use loess to summarize the overall trend while the second one shows one line per each replicate run.

library('ggplot2')
ggplot(df, aes(y = memory, x = R, colour = param, shape = run, group = param)) + geom_point() + geom_smooth(method = loess, se = FALSE)

## One line per run
ggplot(df, aes(y = memory, x = R, colour = param, shape = run, group = paste0(df$param, df$run))) + geom_point() + geom_line()

Note that starting in R 3.2, the type of cluster is SOCK instead of PSOCK. I ignore if this could explain the observed difference.

`derfinder` example

SnowParam-memory-derfinder.R and SnowParam-memory-derfinder.sh compose a second example using objects and some of the code from the original use case. I wrote to see if it lead to the same memory fold changes I observe with my analysis script. Here are some of the results.

df2 <- data.frame(
    memory = c(
        13.091, 12.308, 7.175,
        13.856, 13.268, 8.475,
        1.027, 1.040, 927.723 / 1024,
        12.871, 10.904, 7.174,
        13.671, 13.789, 8.475,
        1.029, 1.042, 927.723 / 1024,
        12.871, 10.940, 7.174,
        13.675, 13.793, 8.465,
        1.028, 1.042, 931.301 / 1024,
        12.866, 11.072, 7.175,
        13.671, 13.793, 8.473,
        1.011, 1.030, 929.055 / 1024
    ),
    R = factor(rep(c('3.2.x', '3.2', '3.1.x'), 3),
        levels = c('3.1.x', '3.2', '3.2.x')),
    param = rep(c('snow', 'multicore', 'serial'), each = 3),
    run = factor(rep(1:4, each = 9))
)
knitr::kable(df2, format = 'html', digits = 3)

memory	R	param	run
13.091	3.2.x	snow	1
12.308	3.2	snow	1
7.175	3.1.x	snow	1
13.856	3.2.x	multicore	1
13.268	3.2	multicore	1
8.475	3.1.x	multicore	1
1.027	3.2.x	serial	1
1.040	3.2	serial	1
0.906	3.1.x	serial	1
12.871	3.2.x	snow	2
10.904	3.2	snow	2
7.174	3.1.x	snow	2
13.671	3.2.x	multicore	2
13.789	3.2	multicore	2
8.475	3.1.x	multicore	2
1.029	3.2.x	serial	2
1.042	3.2	serial	2
0.906	3.1.x	serial	2
12.871	3.2.x	snow	3
10.940	3.2	snow	3
7.174	3.1.x	snow	3
13.675	3.2.x	multicore	3
13.793	3.2	multicore	3
8.465	3.1.x	multicore	3
1.028	3.2.x	serial	3
1.042	3.2	serial	3
0.909	3.1.x	serial	3
12.866	3.2.x	snow	4
11.072	3.2	snow	4
7.175	3.1.x	snow	4
13.671	3.2.x	multicore	4
13.793	3.2	multicore	4
8.473	3.1.x	multicore	4
1.011	3.2.x	serial	4
1.030	3.2	serial	4
0.907	3.1.x	serial	4

The same information is shown below in two plots. The first one use loess to summarize the overall trend while the second one shows one line per each replicate run.

ggplot(df2, aes(y = memory, x = R, colour = param, shape = run, group = param)) + geom_point() + geom_smooth(method = loess, se = FALSE)

## One line per run
ggplot(df2, aes(y = memory, x = R, colour = param, shape = run, group = paste0(df2$param, df2$run))) + geom_point() + geom_line()

In this example, the memory mean fold change when using SnowParam() between R 3.2.x and 3.1.x is 1.8x, compared to 1.4x from the first example. In comparison, with SerialParam() the same mean fold changes are 1.13x and 1.05x respectively.

As shown in the compare view nothing changed in filterData() between R 3.1.x and 3.2.x.

Longer tests

In the branch longerTest, I made it so both examples had 10 times more data while still using 10 cores.

This longer test was motivated by the observed difference in memory use between MulticoreParam() and SnowParam(). I've been told before that the memory use should be similar if not lower for MulticoreParam() given that objects are shared between the parent and child processes. In order to inspect the jobs, the cluster admin required longer jobs because the other ones were too quick.

I also took advantage of this longer test to record more detailed information about the memory used at a given time by the jobs instead of just the maximum used memory recorded in the emails. I believe that this is the best I can do to answer Valerie's second point from this email.

## Process memory log info
#raw <- readLines('logs/testLog.txt')
raw <- readLines('logs/logMemory.txt')

clean_mem <- function(m) {
    if(m == "N/A") res <- NA
    if(grepl('M', m)) res <- as.numeric(gsub('M', '', m)) / 1024
    if(grepl('G', m)) res <- as.numeric(gsub('G', '', m))
    return(res)
}

clean_time <- function(time) {
    clean <- strsplit(time, ':')[[1]]
    sum(as.integer(rev(clean)) * 60^(seq_len(length(clean)) - 1))
}

loginfo <- lapply(strsplit(raw[!grepl('logMemory', raw)], '\t'), function(x) { 
    clean <- unlist(strsplit(x, ','))
    clean <- gsub(' ', '', clean)
    info <- sapply(strsplit(clean, '='), function(z) { z[length(z)]})
    der <- grepl('der-', info[7])
    if(der) info[7] <- gsub('der-', '', info[7])
    res <- data.frame(
        jobid = info[1],
        node = info[3],
        vmem = clean_mem(info[4]),
        maxvmem = clean_mem(info[5]),
        elapsed = clean_time(info[6]),
        param = strsplit(info[7], '-')[[1]][1],
        rversion = strsplit(info[7], '-')[[1]][2],
        example = ifelse(der, 'derfinder', 'generic')
    )
})
loginfo <- do.call(rbind, loginfo)
loginfo$percentMem <- loginfo$vmem / loginfo$maxvmem * 100

The following 3 plots show the memory log results from every 2 second intervals for the generic example. The vmem used (first plot) is always lower than or equal to but not always equal to the maximum vmem used (second plot), hence why I plotted the percent of maximum vmem used (third plot).

## Plots from memory logs for generic example
ggplot(subset(loginfo, example == 'generic'), aes(y = vmem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('vmem used (in GB)')

ggplot(subset(loginfo, example == 'generic'), aes(y = maxvmem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('Max vmem used (in GB)')

ggplot(subset(loginfo, example == 'generic'), aes(y = percentMem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('Percent of max vmem used')

The data is a little bit hard to see in the previous plots because the serial runs took much longer than the parallel ones. The next three plots show the same data up to minute 55.

In the vmem used plot (first one), we can see how the runs with MulticoreParam() have high yet short peaks at the beginning, and then reach similar levels to SnowParam(), and even lower than SnowParam() under R 3.1.x.

In the maximum vmem plot (second one), the difference between SnowParam() and MulticoreParam() looks huge, as it does when looking at the memory reported in the email. But that comes from the early peak in MulticoreParam(). Also, in this plot, we can see the large increase after R 3.1.x in memory use when using SnowParam(). Note that this increase is also observed on the first plot and it's not a peak.

The third plot shows that SnowParam() has the closest vmem usage to the maximum, regardless of the R version.

## Plots from memory logs for generic example, truncated at minute 55
ggplot(subset(loginfo, example == 'generic'), aes(y = vmem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('vmem used (in GB)') + xlim(c(0, 55))

ggplot(subset(loginfo, example == 'generic'), aes(y = maxvmem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('Max vmem used (in GB)') + xlim(c(0, 55))

ggplot(subset(loginfo, example == 'generic'), aes(y = percentMem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('Percent of max vmem used') + xlim(c(0, 55))

The next three plots show the results for the derfinder example. This example is much faster and didn't increase much in memory time after increasing the data by 10. However, the memory did increase, as expected.

The vmem used plot (first plot) shows that the memory used increases in a step function. The number of steps doesn't match the number of cores used. Results from R 3.1.x are faster than the other R versions for SnowParam() and MulticoreParam().

The maxvmem used (second plot) looks very similar to the first plot, if not identical. That is reflected on the third plot where nearly all the lines are at 100 percent except for SerialParam().

## Plots from memory logs for derfinder example
ggplot(subset(loginfo, example == 'derfinder'), aes(y = vmem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('vmem used (in GB)')

ggplot(subset(loginfo, example == 'derfinder'), aes(y = maxvmem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('Max vmem used (in GB)')

ggplot(subset(loginfo, example == 'derfinder'), aes(y = percentMem, x = elapsed / 60, colour = param, shape = rversion)) + geom_line(aes(linetype=rversion)) + xlab('Time elapsed (in min)') + ylab('Percent of max vmem used')

This page was last updated at 2015-07-16 22:49:19.