The simglm
package has an update on CRAN bumping the version up to 0.6.0. This update has added the ability to simulate count data (poisson) and also has fixed (I think) the Shiny app that comes with the package. As I have not posted about this package since the first CRAN release (v 0.5.0), I plan to give an overview of all that the package offers in addition to the new additions.
Installation
The package can be installed directly from CRAN:
install.packages("simglm")
simglm
Features
The package allows for flexible simulation of general(-ized) linear (mixed) models with up to three levels of nesting. The flexibility comes from the ability to easily vary data generation procedures. For example one can use any R data generating function to simulate error, random effect, or covariate distributions. Other data generating options can include:
- heterogeneity of variance
- missing data
- ARIMA models for within error structure
- flexible time specification for longitudinal models
- unbalanced data generation for nested designs
- correlated covariates
- factor/categorical variable generation
The primary functions for data generation within the package are sim_reg
and sim_glm
. The main distinction being that sim_reg
is used for continuous outcomes whereas sim_glm
is used for non-continuous outcomes. Currently sim_glm
supports dichotomous outcomes (ie., 0/1; logistic) and count outcomes (poisson distributed).
Below are two examples using the sim_reg
and sim_glm
to generate a simple two level cross sectional model.
fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(3.2, 0.4, 0.02, 0.6)
random <- ~ 1
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
var_type = 'level1')
n <- 50
p <- 5
error_var <- 5
with_err_gen <- 'rnorm'
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2),
var_type = c('level1', 'level1'))
sim_reg(fixed = fixed, fixed_param = fixed_param, random = random,
random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
error_var = error_var, with_err_gen = with_err_gen,
data_str = data_str, fact_vars = fact_vars)
## # A tibble: 250 x 14
## X.Intercept. income act_o factor.grad_deg… income1 act_o1 grad_degree_f
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.583 4 1 -0.583 4 2
## 2 1 0.536 6 1 0.536 6 2
## 3 1 -2.12 36 1 -2.12 36 2
## 4 1 0.572 25 0 0.572 25 1
## 5 1 0.986 14 0 0.986 14 1
## 6 1 -0.454 22 1 -0.454 22 2
## 7 1 -0.239 15 0 -0.239 15 1
## 8 1 -0.0458 22 1 -0.0458 22 2
## 9 1 0.334 6 1 0.334 6 2
## 10 1 1.91 6 0 1.91 6 1
## # … with 240 more rows, and 7 more variables: b0 <dbl>, Fbeta <dbl>,
## # randEff <dbl>, err <dbl>, sim_data <dbl>, withinID <int>,
## # clustID <int>
Similar values are used for the sim_glm
function, however the within error is not specified.
fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(0.5, 0.4, 0.02, 0.6)
random <- ~ 1
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
var_type = 'level1')
n <- 50
p <- 5
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2),
var_type = c('level1', 'level1'))
sim_glm(fixed = fixed, fixed_param = fixed_param, random = random,
random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
data_str = data_str, fact_vars = fact_vars, outcome_type = 'logistic')
## # A tibble: 250 x 15
## X.Intercept. income act_o factor.grad_deg… income1 act_o1 grad_degree_f
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.169 3 1 -0.169 3 2
## 2 1 -1.66 17 1 -1.66 17 2
## 3 1 0.502 28 0 0.502 28 1
## 4 1 0.236 3 0 0.236 3 1
## 5 1 -0.393 6 1 -0.393 6 2
## 6 1 1.56 22 0 1.56 22 1
## 7 1 -1.03 13 1 -1.03 13 2
## 8 1 -1.64 36 1 -1.64 36 2
## 9 1 0.642 36 1 0.642 36 2
## 10 1 0.373 29 1 0.373 29 2
## # … with 240 more rows, and 8 more variables: b0 <dbl>, Fbeta <dbl>,
## # randEff <dbl>, logistic <dbl>, prob <dbl>, sim_data <dbl>,
## # withinID <int>, clustID <int>
You could easily simulate a count outcome by changing outcome_type = 'logistic'
in the last example to outcome_type = 'poisson'
.
Power
The package also includes wrappers around the simulation functions to explore empirical power under these frameworks. For example for a simple regression:
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 50
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
n = n, error_var = error_var, with_err_gen = with_err_gen,
data_str = "single", pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, raw_power = FALSE)
power_out
## # A tibble: 4 x 6
## var avg_test_stat sd_test_stat power num_reject num_repl
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.42 0.901 0.1 5 50
## 2 act 5.84 1.12 1 50 50
## 3 diff 3.12 1.01 0.7 35 50
## 4 numCourse 2.47 0.961 0.42 21 50
Where now the simulation function sim_reg
replicates 50 times and the estimates are used to generate empirical power for each parameter in the model. This idea becomes more powerful when a Monte Carlo like varying of parameters are used. This can ensure a much larger, explorative, and descriptive power analysis. These analyses if done well with the coverage of the parameters varied, may be more appropriate/realistic based on the data to be collected (or more similar to the data that was collected for post-hoc power analyses). Below is a simple example of varying parameters in the power analysis.
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- NULL
error_var <- NULL
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 50
terms_vary <- list(n = c(20, 40, 60, 80, 100),
error_var = c(5, 10, 20))
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
n = n, error_var = error_var, with_err_gen = with_err_gen,
data_str = "single", pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, terms_vary = terms_vary,
raw_power = FALSE)
power_out
## # A tibble: 60 x 8
## # Groups: var, n [20]
## var n error_var avg_test_stat sd_test_stat power num_reject
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 20 5 1.12 0.809 0.02 1
## 2 (Int… 20 10 1.18 0.952 0.04 2
## 3 (Int… 20 20 0.787 0.600 0 0
## 4 (Int… 40 5 1.40 1.02 0.12 6
## 5 (Int… 40 10 1.31 0.692 0.06 3
## 6 (Int… 40 20 1.06 0.691 0.04 2
## 7 (Int… 60 5 1.61 0.984 0.14 7
## 8 (Int… 60 10 1.15 0.704 0.02 1
## 9 (Int… 60 20 1.10 0.753 0.02 1
## 10 (Int… 80 5 1.84 1.01 0.2 10
## # … with 50 more rows, and 1 more variable: num_repl <dbl>
Similar syntax can be used for nested designs as well as with generalized models. These models and data types take more time to run compared to the single level designs above. Therefore, it is recommended to perform small simulations initially to ensure no errors and estimate how much time the simulation may take.
Shiny App
The package also comes with a Shiny app that contains most (but not all) of the features found in the full package. These will likely be implemented slowly over time. You can run the Shiny app locally with the following function:
simglm::run_shiny()
One note, there are errors that occur in the stack trace, however as far as I can tell in my testing the app runs okay.
JSM
Of note to anyone attending JSM this year, I am talking about the simglm
package https://ww2.amstat.org/meetings/jsm/2017/onlineprogram/ActivityDetails.cfm?SessionID=214411. Slides of this talk will be posted likely after the talk on my website.
Bugs and feature requests are welcomed and can be submitted directly to GitHub https://github.com/lebebr01/simglm/issues.