Brandon LeBeau

7 minute read

The simglm package has an update on CRAN bumping the version up to 0.6.0. This update has added the ability to simulate count data (poisson) and also has fixed (I think) the Shiny app that comes with the package. As I have not posted about this package since the first CRAN release (v 0.5.0), I plan to give an overview of all that the package offers in addition to the new additions.

Installation

The package can be installed directly from CRAN:

install.packages("simglm")

simglm Features

The package allows for flexible simulation of general(-ized) linear (mixed) models with up to three levels of nesting. The flexibility comes from the ability to easily vary data generation procedures. For example one can use any R data generating function to simulate error, random effect, or covariate distributions. Other data generating options can include:

  • heterogeneity of variance
  • missing data
  • ARIMA models for within error structure
  • flexible time specification for longitudinal models
  • unbalanced data generation for nested designs
  • correlated covariates
  • factor/categorical variable generation

The primary functions for data generation within the package are sim_reg and sim_glm. The main distinction being that sim_reg is used for continuous outcomes whereas sim_glm is used for non-continuous outcomes. Currently sim_glm supports dichotomous outcomes (ie., 0/1; logistic) and count outcomes (poisson distributed).

Below are two examples using the sim_reg and sim_glm to generate a simple two level cross sectional model.

fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(3.2, 0.4, 0.02, 0.6)
random <- ~ 1 
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
                  var_type = 'level1')
n <- 50
p <- 5
error_var <- 5
with_err_gen <- 'rnorm'
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2), 
                  var_type = c('level1', 'level1'))

sim_reg(fixed = fixed, fixed_param = fixed_param, random = random,
        random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
        error_var = error_var, with_err_gen = with_err_gen, 
        data_str = data_str, fact_vars = fact_vars)
## # A tibble: 250 x 14
##    X.Intercept.       income act_o factor.grad_degree_f.2      income1
##           <dbl>        <dbl> <dbl>                  <dbl>        <dbl>
##  1            1 -0.772872289    28                      1 -0.772872289
##  2            1  0.416096972     7                      0  0.416096972
##  3            1 -0.022392623     8                      1 -0.022392623
##  4            1 -0.141803138     1                      0 -0.141803138
##  5            1  0.010888225    21                      0  0.010888225
##  6            1 -0.007103202    17                      0 -0.007103202
##  7            1 -0.612343775    22                      0 -0.612343775
##  8            1  0.203837800    30                      0  0.203837800
##  9            1  1.949610181    21                      1  1.949610181
## 10            1  1.003149409    31                      1  1.003149409
## # ... with 240 more rows, and 9 more variables: act_o1 <dbl>,
## #   grad_degree_f <dbl>, b0 <dbl>, Fbeta <dbl>, randEff <dbl>, err <dbl>,
## #   sim_data <dbl>, withinID <int>, clustID <int>

Similar values are used for the sim_glm function, however the within error is not specified.

fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(0.5, 0.4, 0.02, 0.6)
random <- ~ 1 
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
                  var_type = 'level1')
n <- 50
p <- 5
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2), 
                  var_type = c('level1', 'level1'))

sim_glm(fixed = fixed, fixed_param = fixed_param, random = random,
        random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
        data_str = data_str, fact_vars = fact_vars, outcome_type = 'logistic')
## # A tibble: 250 x 15
##    X.Intercept.      income act_o factor.grad_degree_f.2     income1
##           <dbl>       <dbl> <dbl>                  <dbl>       <dbl>
##  1            1  0.05600817     6                      1  0.05600817
##  2            1  0.56322606    24                      1  0.56322606
##  3            1 -0.11876126     1                      0 -0.11876126
##  4            1  0.79058375    11                      1  0.79058375
##  5            1  2.45043806    21                      0  2.45043806
##  6            1  1.77433262    10                      1  1.77433262
##  7            1 -1.56422506     3                      1 -1.56422506
##  8            1 -1.70684183    11                      1 -1.70684183
##  9            1 -0.79035323    31                      0 -0.79035323
## 10            1  0.03784287     7                      0  0.03784287
## # ... with 240 more rows, and 10 more variables: act_o1 <dbl>,
## #   grad_degree_f <dbl>, b0 <dbl>, Fbeta <dbl>, randEff <dbl>,
## #   logistic <dbl>, prob <dbl>, sim_data <dbl>, withinID <int>,
## #   clustID <int>

You could easily simulate a count outcome by changing outcome_type = 'logistic' in the last example to outcome_type = 'poisson'.

Power

The package also includes wrappers around the simulation functions to explore empirical power under these frameworks. For example for a simple regression:

fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
                  var_type = c("single", "single", "single"),
                  opts = list(list(mean = 0, sd = 2),
                              list(mean = 0, sd = 2),
                              list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'

pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 50

power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
                     n = n, error_var = error_var, with_err_gen = with_err_gen, 
                     data_str = "single", pow_param = pow_param, alpha = alpha,
                     pow_dist = pow_dist, pow_tail = pow_tail, 
                     replicates = replicates, raw_power = FALSE)
power_out
## # A tibble: 4 x 6
##           var avg_test_stat sd_test_stat power num_reject num_repl
##        <fctr>         <dbl>        <dbl> <dbl>      <dbl>    <dbl>
## 1 (Intercept)      1.650382    0.9693836  0.14          7       50
## 2         act      6.006373    1.0708427  1.00         50       50
## 3        diff      3.416535    0.9326546  0.82         41       50
## 4   numCourse      2.370347    0.9836364  0.38         19       50

Where now the simulation function sim_reg replicates 50 times and the estimates are used to generate empirical power for each parameter in the model. This idea becomes more powerful when a Monte Carlo like varying of parameters are used. This can ensure a much larger, explorative, and descriptive power analysis. These analyses if done well with the coverage of the parameters varied, may be more appropriate/realistic based on the data to be collected (or more similar to the data that was collected for post-hoc power analyses). Below is a simple example of varying parameters in the power analysis.

fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
                  var_type = c("single", "single", "single"),
                  opts = list(list(mean = 0, sd = 2),
                              list(mean = 0, sd = 2),
                              list(mean = 0, sd = 1)))
n <- NULL
error_var <- NULL
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"

pow_tail <- 2
replicates <- 50
terms_vary <- list(n = c(20, 40, 60, 80, 100), 
                   error_var = c(5, 10, 20))

power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
                     n = n, error_var = error_var, with_err_gen = with_err_gen, 
                     data_str = "single", pow_param = pow_param, alpha = alpha,
                     pow_dist = pow_dist, pow_tail = pow_tail, 
                     replicates = replicates, terms_vary = terms_vary, 
                     raw_power = FALSE)

power_out
## # A tibble: 60 x 8
## # Groups:   var, n [?]
##            var     n error_var avg_test_stat sd_test_stat power num_reject
##         <fctr> <dbl>     <dbl>         <dbl>        <dbl> <dbl>      <dbl>
##  1 (Intercept)    20         5     1.1585985    1.1659626  0.10          5
##  2 (Intercept)    20        10     0.9295801    0.8381110  0.04          2
##  3 (Intercept)    20        20     0.7667661    0.5516210  0.00          0
##  4 (Intercept)    40         5     1.8054688    1.0987028  0.24         12
##  5 (Intercept)    40        10     1.0318538    0.7035785  0.02          1
##  6 (Intercept)    40        20     0.9198477    0.7038914  0.02          1
##  7 (Intercept)    60         5     2.0447040    0.9094806  0.30         15
##  8 (Intercept)    60        10     1.1719409    0.7894543  0.08          4
##  9 (Intercept)    60        20     0.8776733    0.7000668  0.02          1
## 10 (Intercept)    80         5     2.0065774    1.0137946  0.24         12
## # ... with 50 more rows, and 1 more variables: num_repl <dbl>

Similar syntax can be used for nested designs as well as with generalized models. These models and data types take more time to run compared to the single level designs above. Therefore, it is recommended to perform small simulations initially to ensure no errors and estimate how much time the simulation may take.

Shiny App

The package also comes with a Shiny app that contains most (but not all) of the features found in the full package. These will likely be implemented slowly over time. You can run the Shiny app locally with the following function:

simglm::run_shiny()

One note, there are errors that occur in the stack trace, however as far as I can tell in my testing the app runs okay.

JSM

Of note to anyone attending JSM this year, I am talking about the simglm package https://ww2.amstat.org/meetings/jsm/2017/onlineprogram/ActivityDetails.cfm?SessionID=214411. Slides of this talk will be posted likely after the talk on my website.

Bugs and feature requests are welcomed and can be submitted directly to GitHub https://github.com/lebebr01/simglm/issues.

comments powered by Disqus