<h1>Assessing the validity of item response theory models when calibrating field test items</h1>
<h2>Brandon LeBeau</h2>
<h3>University of Iowa</h3>
# Validity for IRT Models
- Validity is important for any assessment and the argument should begin with psychometrics.
- How the psychometrics is performed directly impacts properties of the assessment that are assessed later for evidence of validity.
+ Are scores reported below chance level?
- The validity of the psychometrics is particularly important for field test data.
# IRT Model
<img src="/figs/irt.PNG" alt="" height = "200" width="1200"/>
# Field Testing
- Field testing (FT) is essential to new assessment development or form building.
+ A way to gather information to make informed decisions about which items become operational.
- Limitations:
+ Many items are tried that do not become operational.
- This spreads a fixed pool of individuals (respondents) across many field test items.
- Ultimately, sample size can be significantly smaller compared to operational assessments.
+ Issues with distractors.
# Threats to Validity in FT
- Generalizeability
+ Is the FT sample representative of the desired population?
+ Over-fitting with 3PL model?
- Uncertainty in estimates
+ Sample size and lower asymptote estimation
+ Interconnected parameter estimates
# Generalizeability
- We assume respondents are randomly sampled from some population.
+ Are item responses truly randomly sampled from the population of interest?
- Selection or Measurement bias
+ If not, estimates are extremely sample dependent.
- 3PL model may provide better fit, but is this at the cost of overfitting?
+ Fit should not be the only consideration when deciding on an IRT model for FT data.
# Uncertainty
- Sample size (1000 commonly cited for 3PL model):
+ Tends to be smaller in field test designs.
+ Even with small samples, can achieve convergence with 3PL model with help of priors, ridge, etc.
- Are our estimates now biased?
- Estimating Lower Asymptote (pseudo-guessing):
+ Difficulty in estimating this term ($c_{j}$) has direct impact on estimation of the other two terms.
- This leads to a cascading vortex of problems.
+ The pseudo-guessing is commonly a nuisance parameter, why allow a nuisance parameter to drastically affect estimation of other terms?
# Methodology
- Individual response strings were resampled in a two stage framework:
+ First, individuals who took the field test were resampled with replacement within each field test booklet.
+ Second, individuals who only took operational items were resampled with replacement to fill out the remaining observations.
- After resampling, items were calibrated with Bilog-MG.
- This process was replicated 5000 times to generate bootstrapped item parameters.
# Example ICC Math FT Item 3PL
<img src="/figs/iccgr3math57.png" alt="" height = "500" width="1200"/>
# Example ICC Math FT Item 2PL
<img src="/figs/iccgr3math572pl.png" alt="" height = "500" width="1200"/>
# Example ICC ELA FT Item 3PL
<img src="/figs/iccread653pl.png" alt="" height = "500" width="1200"/>
# Example ICC ELA FT Item 2PL
<img src="/figs/iccread652pl.png" alt="" height = "500" width="1200"/>
# ICC Summary
- For individual items, the variation in the ICCs for a 3PL model can be large.
+ This may lower usefulness of estimates in helping to select operational (best) items.
- How can these 3PL curves be expected to generalize beyond this sample with so much variability?
# b and c est and SE
<img src="/figs/pairs_bc_ela.png" alt="" height = "500" width="1200"/>
# a and c est and SE
<img src="/figs/pairs_ac_ela.png" alt="" height = "500" width="1200"/>
# a and b est and SE
<img src="/figs/pairs_ab_ela.png" alt="" height = "500" width="1200"/>
# Uncertainty Summary
- The pseudo-guessing estimates are:
+ negatively related to the estimates of the b and a.
+ positively related to the uncertainty in the b parameter, likely the parameter of most interest.
- In turn, increasing the uncertainty in the b parameter:
+ further increases the uncertainty in the a parameter.
+ is also negatively related to estimates of the a parameter.
- Thus, creating the cascading vortex of problems.
# Conclusions
- Item parameters estimated from FT data should not be treated as truth.
+ Variation in these parameter estimates needs to be considered.
- Fit should not be the only concern when selecting an IRT model, uncertainty, generalizeability, and usefulness should also be considered.
- Estimates are much more stable when using the 2PL model.
+ Thus providing a stronger foundation with which to start the validity argument for an assessment.
# Questions?
- Twitter: @blebeau11
- Website: <http://brandonlebeau.org> <br/> <http://www2.education.uiowa.edu/directories/person?id=bleb>
- Slides: <http://brandonlebeau.org/2016/04/09/ncme2016.html>