Announcing `simglm` version 0.6.0: Power and Simulation of Generalized Linear Mixed Models

The simglm package has an update on CRAN bumping the version up to 0.6.0. This update has added the ability to simulate count data (poisson) and also has fixed (I think) the Shiny app that comes with the package. As I have not posted about this package since the first CRAN release (v 0.5.0), I plan to give an overview of all that the package offers in addition to the new additions.

Installation

The package can be installed directly from CRAN:

install.packages("simglm")

simglm Features

The package allows for flexible simulation of general(-ized) linear (mixed) models with up to three levels of nesting. The flexibility comes from the ability to easily vary data generation procedures. For example one can use any R data generating function to simulate error, random effect, or covariate distributions. Other data generating options can include:

  • heterogeneity of variance
  • missing data
  • ARIMA models for within error structure
  • flexible time specification for longitudinal models
  • unbalanced data generation for nested designs
  • correlated covariates
  • factor/categorical variable generation

The primary functions for data generation within the package are sim_reg and sim_glm. The main distinction being that sim_reg is used for continuous outcomes whereas sim_glm is used for non-continuous outcomes. Currently sim_glm supports dichotomous outcomes (ie., 0/1; logistic) and count outcomes (poisson distributed).

Below are two examples using the sim_reg and sim_glm to generate a simple two level cross sectional model.

fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(3.2, 0.4, 0.02, 0.6)
random <- ~ 1 
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
                  var_type = 'level1')
n <- 50
p <- 5
error_var <- 5
with_err_gen <- 'rnorm'
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2), 
                  var_type = c('level1', 'level1'))

sim_reg(fixed = fixed, fixed_param = fixed_param, random = random,
        random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
        error_var = error_var, with_err_gen = with_err_gen, 
        data_str = data_str, fact_vars = fact_vars)
## # A tibble: 250 x 14
##    X.Intercept.  income act_o factor.grad_deg… income1 act_o1 grad_degree_f
##           <dbl>   <dbl> <dbl>            <dbl>   <dbl>  <dbl>         <dbl>
##  1            1 -0.583      4                1 -0.583       4             2
##  2            1  0.536      6                1  0.536       6             2
##  3            1 -2.12      36                1 -2.12       36             2
##  4            1  0.572     25                0  0.572      25             1
##  5            1  0.986     14                0  0.986      14             1
##  6            1 -0.454     22                1 -0.454      22             2
##  7            1 -0.239     15                0 -0.239      15             1
##  8            1 -0.0458    22                1 -0.0458     22             2
##  9            1  0.334      6                1  0.334       6             2
## 10            1  1.91       6                0  1.91        6             1
## # … with 240 more rows, and 7 more variables: b0 <dbl>, Fbeta <dbl>,
## #   randEff <dbl>, err <dbl>, sim_data <dbl>, withinID <int>,
## #   clustID <int>

Similar values are used for the sim_glm function, however the within error is not specified.

fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(0.5, 0.4, 0.02, 0.6)
random <- ~ 1 
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
                  var_type = 'level1')
n <- 50
p <- 5
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2), 
                  var_type = c('level1', 'level1'))

sim_glm(fixed = fixed, fixed_param = fixed_param, random = random,
        random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
        data_str = data_str, fact_vars = fact_vars, outcome_type = 'logistic')
## # A tibble: 250 x 15
##    X.Intercept. income act_o factor.grad_deg… income1 act_o1 grad_degree_f
##           <dbl>  <dbl> <dbl>            <dbl>   <dbl>  <dbl>         <dbl>
##  1            1 -0.169     3                1  -0.169      3             2
##  2            1 -1.66     17                1  -1.66      17             2
##  3            1  0.502    28                0   0.502     28             1
##  4            1  0.236     3                0   0.236      3             1
##  5            1 -0.393     6                1  -0.393      6             2
##  6            1  1.56     22                0   1.56      22             1
##  7            1 -1.03     13                1  -1.03      13             2
##  8            1 -1.64     36                1  -1.64      36             2
##  9            1  0.642    36                1   0.642     36             2
## 10            1  0.373    29                1   0.373     29             2
## # … with 240 more rows, and 8 more variables: b0 <dbl>, Fbeta <dbl>,
## #   randEff <dbl>, logistic <dbl>, prob <dbl>, sim_data <dbl>,
## #   withinID <int>, clustID <int>

You could easily simulate a count outcome by changing outcome_type = 'logistic' in the last example to outcome_type = 'poisson'.

Power

The package also includes wrappers around the simulation functions to explore empirical power under these frameworks. For example for a simple regression:

fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
                  var_type = c("single", "single", "single"),
                  opts = list(list(mean = 0, sd = 2),
                              list(mean = 0, sd = 2),
                              list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'

pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 50

power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
                     n = n, error_var = error_var, with_err_gen = with_err_gen, 
                     data_str = "single", pow_param = pow_param, alpha = alpha,
                     pow_dist = pow_dist, pow_tail = pow_tail, 
                     replicates = replicates, raw_power = FALSE)
power_out
## # A tibble: 4 x 6
##   var         avg_test_stat sd_test_stat power num_reject num_repl
##   <fct>               <dbl>        <dbl> <dbl>      <dbl>    <dbl>
## 1 (Intercept)          1.42        0.901  0.1           5       50
## 2 act                  5.84        1.12   1            50       50
## 3 diff                 3.12        1.01   0.7          35       50
## 4 numCourse            2.47        0.961  0.42         21       50

Where now the simulation function sim_reg replicates 50 times and the estimates are used to generate empirical power for each parameter in the model. This idea becomes more powerful when a Monte Carlo like varying of parameters are used. This can ensure a much larger, explorative, and descriptive power analysis. These analyses if done well with the coverage of the parameters varied, may be more appropriate/realistic based on the data to be collected (or more similar to the data that was collected for post-hoc power analyses). Below is a simple example of varying parameters in the power analysis.

fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
                  var_type = c("single", "single", "single"),
                  opts = list(list(mean = 0, sd = 2),
                              list(mean = 0, sd = 2),
                              list(mean = 0, sd = 1)))
n <- NULL
error_var <- NULL
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"

pow_tail <- 2
replicates <- 50
terms_vary <- list(n = c(20, 40, 60, 80, 100), 
                   error_var = c(5, 10, 20))

power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
                     n = n, error_var = error_var, with_err_gen = with_err_gen, 
                     data_str = "single", pow_param = pow_param, alpha = alpha,
                     pow_dist = pow_dist, pow_tail = pow_tail, 
                     replicates = replicates, terms_vary = terms_vary, 
                     raw_power = FALSE)

power_out
## # A tibble: 60 x 8
## # Groups:   var, n [20]
##    var       n error_var avg_test_stat sd_test_stat power num_reject
##    <fct> <dbl>     <dbl>         <dbl>        <dbl> <dbl>      <dbl>
##  1 (Int…    20         5         1.12         0.809  0.02          1
##  2 (Int…    20        10         1.18         0.952  0.04          2
##  3 (Int…    20        20         0.787        0.600  0             0
##  4 (Int…    40         5         1.40         1.02   0.12          6
##  5 (Int…    40        10         1.31         0.692  0.06          3
##  6 (Int…    40        20         1.06         0.691  0.04          2
##  7 (Int…    60         5         1.61         0.984  0.14          7
##  8 (Int…    60        10         1.15         0.704  0.02          1
##  9 (Int…    60        20         1.10         0.753  0.02          1
## 10 (Int…    80         5         1.84         1.01   0.2          10
## # … with 50 more rows, and 1 more variable: num_repl <dbl>

Similar syntax can be used for nested designs as well as with generalized models. These models and data types take more time to run compared to the single level designs above. Therefore, it is recommended to perform small simulations initially to ensure no errors and estimate how much time the simulation may take.

Shiny App

The package also comes with a Shiny app that contains most (but not all) of the features found in the full package. These will likely be implemented slowly over time. You can run the Shiny app locally with the following function:

simglm::run_shiny()

One note, there are errors that occur in the stack trace, however as far as I can tell in my testing the app runs okay.

JSM

Of note to anyone attending JSM this year, I am talking about the simglm package https://ww2.amstat.org/meetings/jsm/2017/onlineprogram/ActivityDetails.cfm?SessionID=214411. Slides of this talk will be posted likely after the talk on my website.

Bugs and feature requests are welcomed and can be submitted directly to GitHub https://github.com/lebebr01/simglm/issues.