A New Study Examines Longitudinal Scoring Stability in a Large-Scale Assessment Program

Researchers at Measurement Incorporated conducted the first comprehensive examination of longitudinal scoring stability of a large-scale assessment program. Corey Palermo, Michael B. Bunch, and Kirk Ridge analyzed scoring data collected from 2016-2018, during three consecutive administrations of a large-scale, multi-state summative assessment program. The purpose of the research was to determine the overall extent of leniency/severity during scoring, the extent to which leniency/severity effects were stable across administrations, and the impact of the stability of leniency/severity effects on students’ scores.

The summative assessments required hand-scoring for English language arts (ELA) reading, writing, and research items. Hand-scoring was also required for mathematics items assessing problem-solving, modeling and data analysis, and communicating reasoning. Validity responses (i.e., expert-scored benchmark responses selected to represent the range of responses that raters encounter operationally) were used to determine rate leniency/severity (i.e., the extent to which responses were assigned lower or higher scores than warranted given an external criterion of performance). The authors used cross-classified multilevel models to model raters’ scores and the responses and items with which they were associated. Then, they applied model results to students’ scaled scores to estimate the net impact of leniency/severity effects across scoring administrations on students’ scores.

The results showed relative stability across administrations in mathematics scoring and slightly increasing severity associated with ELA short answer item scoring. ELA essay scoring had mixed results, showing evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when applied to students’ scaled scores, the cumulative effect of drift was estimated to be a change of less than one scaled score point for tests having scaled score ranges of 432–582 points. These results suggest that overall, rater effects had little impact on students’ scores.

This research was the first of its kind to provide a comprehensive examination of the longitudinal scoring stability of a large-scale assessment program. It was also the first study to examine longitudinal scoring stability of mathematics and ELA short constructed response items. The study was published in the Fall 2019 special issue of the Journal of Educational Measurement on Rater-Mediated Assessments.

Palermo, C., Bunch, M., & Ridge, K. (2019). Scoring stability in a large-scale assessment program: A longitudinal analysis of leniency/severity effects. Journal of Educational Measurement, 56(3), 626–652. https://doi.org/10.1111/jedm.12228