<- round(runif(5, min = 3, max = 10), 0)
x unlist(lapply(1:length(x), function(xx) (1:x[xx]) - 1))
[1] 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6
Brandon LeBeau
May 31, 2015
I have a simulation package that allows for the simulation of regression models including nested data structures. You can see the package on github here: simReg. Over the weekend I updated the package to allow for the simulation of unbalanced designs. I’m hoping to put together a new vigenette soon highlighting the functionality.
I am working on a simulation that uses the unbalanced functionality and while simulating longitudinal data I’ve found the function is much slower than the cross sectional counterparts (and balanced designs). I’ve ran some additional testing and I believe I have the speed issues narrowed down to the fact that I am generating a time variable. Essentially, I have a vector of number of observations per cluster. The function then turns this vector of lengths into a time variable starting at 0 up to the maximum number of observations minus 1 by 1. As an example:
[1] 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6
From the code above, you can see that there the number of observations is generated using runif
which is saved to the object x
. Then I use a combination of lapply, unlist, and the ‘:’ operator to generate the sequence. This is the same code used in my package above to generate the time variable.
As such, I was interested in testing various ways to generate the sequence and do a performance comparison. I compared the following ways, the ':'
operator, seq.int
, seq
, do.call
with mapply
, and rep.int
for the balanced case as a comparison to how it was done before. This was all done with the great microbenchmark
package.
Here are the results from the 7 comparisons:
library(microbenchmark)
x <- round(runif(100, min = 3, max = 15), 0)
microbenchmark(
colon = unlist(lapply(1:length(x), function(xx) (1:x[xx]) - 1)),
seq.int = unlist(lapply(1:length(x), function(xx) seq.int(0, x[xx] - 1, 1))),
seq = unlist(lapply(1:length(x), function(xx) seq(0, x[xx] - 1, 1))),
seq.int_mapply = do.call(c, mapply(seq.int, 0, x - 1)),
seq_mapply = do.call(c, mapply(seq, 0, x - 1)),
colon_mapply = do.call(c, mapply(':', 0, x - 1)),
rep.int = rep.int(1:8 - 1, times = 100), # balanced case for reference.
times = 1000L
)
Warning in microbenchmark(colon = unlist(lapply(1:length(x), function(xx) (1:x[xx]) - : less
accurate nanosecond times to avoid potential integer overflows
Unit: nanoseconds
expr min lq mean median uq max neval
colon 34768 35793.0 38108.680 36285.0 36941.0 918851 1000
seq.int 44813 46207.0 50598.592 46986.0 47867.5 826888 1000
seq 557190 568895.5 593151.428 575927.0 584475.5 1492031 1000
seq.int_mapply 42271 43952.0 47280.872 44854.0 46002.0 834186 1000
seq_mapply 196226 200592.5 221808.032 203544.5 207009.0 3461835 1000
colon_mapply 36859 38335.0 41831.480 39196.0 40098.0 957350 1000
rep.int 984 1107.0 1247.056 1189.0 1271.0 10209 1000
The results (in microseconds) show that this is where the significant slowdown is coming in my package implementing the unbalanced cases, although it appears that the ‘:’ operator is the second best alternative. For those that have not seen the significant speed bump of the seq.int
and rep.int
over the seq
and rep
alternatives should also pay close attention (compare lines 2 and 3 above).
I’d be interested in alternative procedures that I am not aware of as well. Although not a big deal when running the package once, doing it 50,000 times does add up.
Lastly, for those that are interested, we can show they are all equivalent methods (except for the rep.int
case).
identical(
unlist(lapply(1:length(x), function(xx) (1:x[xx]) - 1)),
unlist(lapply(1:length(x), function(xx) seq.int(0, x[xx] - 1, 1))),
unlist(lapply(1:length(x), function(xx) seq(0, x[xx] - 1, 1))),
do.call(c, mapply(seq.int, 0, x - 1)),
do.call(c, mapply(seq, 0, x - 1)),
do.call(c, mapply(':', 0, x - 1))
)
[1] FALSE