The `simglm`

package has an update on CRAN bumping the version up to 0.6.0. This update has added the ability to simulate count data (poisson) and also has fixed (I think) the Shiny app that comes with the package. As I have not posted about this package since the first CRAN release (v 0.5.0), I plan to give an overview of all that the package offers in addition to the new additions.

## Installation

The package can be installed directly from CRAN:

`install.packages("simglm")`

`simglm`

Features

The package allows for flexible simulation of general(-ized) linear (mixed) models with up to three levels of nesting. The flexibility comes from the ability to easily vary data generation procedures. For example one can use any R data generating function to simulate error, random effect, or covariate distributions. Other data generating options can include:

- heterogeneity of variance
- missing data
- ARIMA models for within error structure
- flexible time specification for longitudinal models
- unbalanced data generation for nested designs
- correlated covariates
- factor/categorical variable generation

The primary functions for data generation within the package are `sim_reg`

and `sim_glm`

. The main distinction being that `sim_reg`

is used for continuous outcomes whereas `sim_glm`

is used for non-continuous outcomes. Currently `sim_glm`

supports dichotomous outcomes (ie., 0/1; logistic) and count outcomes (poisson distributed).

Below are two examples using the `sim_reg`

and `sim_glm`

to generate a simple two level cross sectional model.

```
fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(3.2, 0.4, 0.02, 0.6)
random <- ~ 1
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
var_type = 'level1')
n <- 50
p <- 5
error_var <- 5
with_err_gen <- 'rnorm'
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2),
var_type = c('level1', 'level1'))
sim_reg(fixed = fixed, fixed_param = fixed_param, random = random,
random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
error_var = error_var, with_err_gen = with_err_gen,
data_str = data_str, fact_vars = fact_vars)
```

```
## # A tibble: 250 x 14
## X.Intercept. income act_o factor.grad_degree_f.2 income1
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.772872289 28 1 -0.772872289
## 2 1 0.416096972 7 0 0.416096972
## 3 1 -0.022392623 8 1 -0.022392623
## 4 1 -0.141803138 1 0 -0.141803138
## 5 1 0.010888225 21 0 0.010888225
## 6 1 -0.007103202 17 0 -0.007103202
## 7 1 -0.612343775 22 0 -0.612343775
## 8 1 0.203837800 30 0 0.203837800
## 9 1 1.949610181 21 1 1.949610181
## 10 1 1.003149409 31 1 1.003149409
## # ... with 240 more rows, and 9 more variables: act_o1 <dbl>,
## # grad_degree_f <dbl>, b0 <dbl>, Fbeta <dbl>, randEff <dbl>, err <dbl>,
## # sim_data <dbl>, withinID <int>, clustID <int>
```

Similar values are used for the `sim_glm`

function, however the within error is not specified.

```
fixed <- ~ 1 + act_o + income + grad_degree_f
fixed_param <- c(0.5, 0.4, 0.02, 0.6)
random <- ~ 1
random_param <- list(random_var = 10, rand_gen = 'rnorm')
cov_param <- list(dist_fun = 'rnorm',
var_type = 'level1')
n <- 50
p <- 5
data_str <- 'cross'
fact_vars <- list(numlevels = c(36, 2),
var_type = c('level1', 'level1'))
sim_glm(fixed = fixed, fixed_param = fixed_param, random = random,
random_param = random_param, cov_param = cov_param, k = NULL, n = n, p = p,
data_str = data_str, fact_vars = fact_vars, outcome_type = 'logistic')
```

```
## # A tibble: 250 x 15
## X.Intercept. income act_o factor.grad_degree_f.2 income1
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.05600817 6 1 0.05600817
## 2 1 0.56322606 24 1 0.56322606
## 3 1 -0.11876126 1 0 -0.11876126
## 4 1 0.79058375 11 1 0.79058375
## 5 1 2.45043806 21 0 2.45043806
## 6 1 1.77433262 10 1 1.77433262
## 7 1 -1.56422506 3 1 -1.56422506
## 8 1 -1.70684183 11 1 -1.70684183
## 9 1 -0.79035323 31 0 -0.79035323
## 10 1 0.03784287 7 0 0.03784287
## # ... with 240 more rows, and 10 more variables: act_o1 <dbl>,
## # grad_degree_f <dbl>, b0 <dbl>, Fbeta <dbl>, randEff <dbl>,
## # logistic <dbl>, prob <dbl>, sim_data <dbl>, withinID <int>,
## # clustID <int>
```

You could easily simulate a count outcome by changing `outcome_type = 'logistic'`

in the last example to `outcome_type = 'poisson'`

.

## Power

The package also includes wrappers around the simulation functions to explore empirical power under these frameworks. For example for a simple regression:

```
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 50
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
n = n, error_var = error_var, with_err_gen = with_err_gen,
data_str = "single", pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, raw_power = FALSE)
power_out
```

```
## # A tibble: 4 x 6
## var avg_test_stat sd_test_stat power num_reject num_repl
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.650382 0.9693836 0.14 7 50
## 2 act 6.006373 1.0708427 1.00 50 50
## 3 diff 3.416535 0.9326546 0.82 41 50
## 4 numCourse 2.370347 0.9836364 0.38 19 50
```

Where now the simulation function `sim_reg`

replicates 50 times and the estimates are used to generate empirical power for each parameter in the model. This idea becomes more powerful when a Monte Carlo like varying of parameters are used. This can ensure a much larger, explorative, and descriptive power analysis. These analyses if done well with the coverage of the parameters varied, may be more appropriate/realistic based on the data to be collected (or more similar to the data that was collected for post-hoc power analyses). Below is a simple example of varying parameters in the power analysis.

```
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- NULL
error_var <- NULL
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 50
terms_vary <- list(n = c(20, 40, 60, 80, 100),
error_var = c(5, 10, 20))
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param,
n = n, error_var = error_var, with_err_gen = with_err_gen,
data_str = "single", pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, terms_vary = terms_vary,
raw_power = FALSE)
power_out
```

```
## # A tibble: 60 x 8
## # Groups: var, n [?]
## var n error_var avg_test_stat sd_test_stat power num_reject
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 20 5 1.1585985 1.1659626 0.10 5
## 2 (Intercept) 20 10 0.9295801 0.8381110 0.04 2
## 3 (Intercept) 20 20 0.7667661 0.5516210 0.00 0
## 4 (Intercept) 40 5 1.8054688 1.0987028 0.24 12
## 5 (Intercept) 40 10 1.0318538 0.7035785 0.02 1
## 6 (Intercept) 40 20 0.9198477 0.7038914 0.02 1
## 7 (Intercept) 60 5 2.0447040 0.9094806 0.30 15
## 8 (Intercept) 60 10 1.1719409 0.7894543 0.08 4
## 9 (Intercept) 60 20 0.8776733 0.7000668 0.02 1
## 10 (Intercept) 80 5 2.0065774 1.0137946 0.24 12
## # ... with 50 more rows, and 1 more variables: num_repl <dbl>
```

Similar syntax can be used for nested designs as well as with generalized models. These models and data types take more time to run compared to the single level designs above. Therefore, it is recommended to perform small simulations initially to ensure no errors and estimate how much time the simulation may take.

## Shiny App

The package also comes with a Shiny app that contains most (but not all) of the features found in the full package. These will likely be implemented slowly over time. You can run the Shiny app locally with the following function:

`simglm::run_shiny()`

One note, there are errors that occur in the stack trace, however as far as I can tell in my testing the app runs okay.

### JSM

Of note to anyone attending JSM this year, I am talking about the `simglm`

package https://ww2.amstat.org/meetings/jsm/2017/onlineprogram/ActivityDetails.cfm?SessionID=214411. Slides of this talk will be posted likely after the talk on my website.

Bugs and feature requests are welcomed and can be submitted directly to GitHub https://github.com/lebebr01/simglm/issues.

Twitter

Google+

Facebook

Reddit

LinkedIn

StumbleUpon

Email