I have a simulation package that allows for the simulation of regression models including nested data structures. You can see the package on github here: simReg. Over the weekend I updated the package to allow for the simulation of unbalanced designs. I’m hoping to put together a new vigenette soon highlighting the functionality.

I am working on a simulation that uses the unbalanced functionality and while simulating longitudinal data I’ve found the function is much slower than the cross sectional counterparts (and balanced designs). I’ve ran some additional testing and I believe I have the speed issues narrowed down to the fact that I am generating a time variable. Essentially, I have a vector of number of observations per cluster. The function then turns this vector of lengths into a time variable starting at 0 up to the maximum number of observations minus 1 by 1. As an example:

```
x <- round(runif(5, min = 3, max = 10), 0)
unlist(lapply(1:length(x), function(xx) (1:x[xx]) - 1))
```

`## [1] 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 6`

From the code above, you can see that there the number of observations is generated using `runif`

which is saved to the object `x`

. Then I use a combination of lapply, unlist, and the ‘:’ operator to generate the sequence. This is the same code used in my package above to generate the time variable.

As such, I was interested in testing various ways to generate the sequence and do a performance comparison. I compared the following ways, the `':'`

operator, `seq.int`

, `seq`

, `do.call`

with `mapply`

, and `rep.int`

for the balanced case as a comparison to how it was done before. This was all done with the great `microbenchmark`

package.

Here are the results from the 7 comparisons:

```
library(microbenchmark)
x <- round(runif(100, min = 3, max = 15), 0)
microbenchmark(
colon = unlist(lapply(1:length(x), function(xx) (1:x[xx]) - 1)),
seq.int = unlist(lapply(1:length(x), function(xx) seq.int(0, x[xx] - 1, 1))),
seq = unlist(lapply(1:length(x), function(xx) seq(0, x[xx] - 1, 1))),
seq.int_mapply = do.call(c, mapply(seq.int, 0, x - 1)),
seq_mapply = do.call(c, mapply(seq, 0, x - 1)),
colon_mapply = do.call(c, mapply(':', 0, x - 1)),
rep.int = rep.int(1:8 - 1, times = 100), # balanced case for reference.
times = 1000L
)
```

```
## Unit: microseconds
## expr min lq mean median uq max neval cld
## colon 96.890 119.8810 175.096814 132.608 209.3800 3007.675 1000 b
## seq.int 123.986 145.9505 209.669046 161.346 246.5345 2408.685 1000 b
## seq 1787.116 2045.5555 2914.030581 2357.367 3505.6695 18720.988 1000 d
## seq.int_mapply 135.482 163.8090 240.447820 180.642 271.9885 4331.693 1000 b
## seq_mapply 723.386 828.4860 1297.687641 908.338 1515.1280 52677.410 1000 c
## colon_mapply 119.059 144.5140 207.652022 162.783 244.4820 1878.668 1000 b
## rep.int 2.053 3.6950 5.845916 4.927 6.9800 210.201 1000 a
```

The results (in microseconds) show that this is where the significant slowdown is coming in my package implementing the unbalanced cases, although it appears that the ‘:’ operator is the second best alternative. For those that have not seen the significant speed bump of the `seq.int`

and `rep.int`

over the `seq`

and `rep`

alternatives should also pay close attention (compare lines 2 and 3 above).

I’d be interested in alternative procedures that I am not aware of as well. Although not a big deal when running the package once, doing it 50,000 times does add up.

Lastly, for those that are interested, we can show they are all equivalent methods (except for the `rep.int`

case).

```
identical(
unlist(lapply(1:length(x), function(xx) (1:x[xx]) - 1)),
unlist(lapply(1:length(x), function(xx) seq.int(0, x[xx] - 1, 1))),
unlist(lapply(1:length(x), function(xx) seq(0, x[xx] - 1, 1))),
do.call(c, mapply(seq.int, 0, x - 1)),
do.call(c, mapply(seq, 0, x - 1)),
do.call(c, mapply(':', 0, x - 1))
)
```

`## [1] TRUE`

## Share this post

Twitter

Google+

Facebook

Reddit

LinkedIn

StumbleUpon

Email