Research & Expertise

The page presents research supported by Measurement Incorporated (MI) as well as unaffiliated research conducted by current MI staff. These publications and presentations illustrate our staff’s wide-ranging expertise in the field of educational measurement.

Featured Research

Industry-leading, open-access publications authored by our staff.

Palermo, C. (2022) Rater characteristics, response content, and scoring contexts: Decomposing determinates of scoring accuracy. Frontiers in Psychology, 13:937097. https://doi.org/10.3389/fpsyg.2022.937097

Abstract: Raters may introduce construct-irrelevant variance when evaluating written responses to performance assessments, threatening the validity of students’ scores. Numerous factors in the rating process, including the content of students’ responses, the characteristics of raters, and the context in which the scoring occurs, are thought to influence the quality of raters’ scores. Despite considerable study of rater effects, little research has examined the relative impacts of the factors that influence rater accuracy. In practice, such integrated examinations are needed to afford evidence-based decisions of rater selection, training, and feedback. This study provides the first naturalistic, integrated examination of rater accuracy in a large-scale assessment program. Leveraging rater monitoring data from an English language arts (ELA) summative assessment program, I specified cross-classified, multilevel models via Bayesian (i.e., Markov chain Monte Carlo) estimation to decompose the impact of response content, rater characteristics, and scoring contexts on rater accuracy. Results showed relatively little variation in accuracy attributable to teams, items, and raters. Raters did not collectively exhibit differential accuracy over time, though there was significant variation in individual rater’s scoring accuracy from response to response and day to day. I found considerable variation in accuracy across responses, which was in part explained by text features and other measures of response content that influenced scoring difficulty. Some text features differentially influenced the difficulty of scoring research and writing content. Multiple measures of raters’ qualification performance predicted their scoring accuracy, but general rater background characteristics including experience and education did not. Site-based and remote raters demonstrated comparable accuracy, while evening-shift raters were slightly less accurate, on average, than day-shift raters. This naturalistic, integrated examination of rater accuracy extends previous research and provides implications for rater recruitment, training, monitoring, and feedback to improve human evaluation of written responses.

Keywords: rater effects, writing assessment, rater-mediated assessment, multilevel modeling (MLM), rater monitoring

He, Y. & Cui, Z. (2020). Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Applied Psychological Measurement, 44, 296–310. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7262993/

Abstract: Item parameter estimates of a common item on a new test form may change abnormally due to reasons such as item overexposure or change of curriculum. A common item, whose change does not fit the pattern implied by the normally behaved common items, is defined as an outlier. Although improving equating accuracy, detecting and eliminating of outliers may cause a content imbalance among common items. Robust scale transformation methods have recently been proposed to solve this problem when only one outlier is present in the data, although it is not uncommon to see multiple outliers in practice. In this simulation study, the authors examined the robust scale transformation methods under conditions where there were multiple outlying common items. Results indicated that the robust scale transformation methods could reduce the influences of multiple outliers on scale transformation and equating. The robust methods performed similarly to a traditional outlier detection and elimination method in terms of reducing the influence of outliers while keeping adequate content balance.

Keywords: equating, item response theory, multiple outliers, robust scale transformation

White Papers

Our collection of white papers, written by our own industry experts and psychometricians, include the latest industry research and best practices.

A Gentle Introduction to Automated Scoring

by Corey Palermo, Ph.D. (October, 2017)
"Automated scoring also offers a variety of benefits for assessment of learning. One benefit is that it is much faster than scoring by teachers or professional raters; once models have been generated, responses can be scored in seconds. This allows assessment results to be available to stakeholders very rapidly. A second benefit is that automated scoring tends to be as accurate or more accurate than multiple professional raters. Furthermore, automated‐scoring engines are perfectly reliable in ways that raters are not—an automated‐scoring engine will assign the same score to a response every time."

White Paper: PEG Changes

by Michael B. Bunch, Thomas Davis, Ann Hayes, Derek Justice, Julie St. John (July, 2017)
"MI continues to monitor advancements in the automated essay scoring field while searching for ways to make PEG as effective as possible in helping students learn to write. As a result, PEG will be ever-evolving."

The Case for Professional Learning Communities

by Tina B. Clayton (2017)
"A Professional Learning Community (PLC) is a small group of professionals who continuously seek cutting-edge ideas and collaboratively evaluate how to best apply the new information to the work. The PLC operates under the assumption that to stay ahead of the competition, an organization must learn faster than the competition and consistently produce exceptional work."

Project Essay Grade (PEG) Current Usage and Research

by Shayne Miel (2014)
"With subsequent improvements in PEG and general advances in the reliability of machine scoring, artificial intelligence (AI) scoring has become a valuable, and in some cases, essential, tool in a variety of contexts."

The Future of Testing

by Michael B. Bunch, Ph.D (2013)
"The future still looks a lot like it did 25 years ago: cognitive-based assessment, online assessment, widespread use of computer adaptive testing, universal access to technology, and instantaneous reporting of test results. So many wonderful things, still within our view but just beyond our grasp!"

It Takes Three

by Michael B. Bunch, Ph.D (2012)
"Making sure all students are college and career ready requires not only an alignment of curriculum and instruction with college and career requirements but also an approach to monitoring student progress on a continual basis, with in-class formative assessments, frequent interim assessments, and focused summative assessments. Taken together, formative, interim, and summative assessments, aligned to Common Core State Standards (CCSS), will support instructional decision making and enhance daily learning activities.”

Aligning Curriculum, Assessment, and Instruction

by Michael B. Bunch, Ph.D (2012)
"A key component of educational achievement test validation is alignment of the test to both curriculum and instruction. By alignment, we mean the degree to which the items of the test, both individually and collectively, match the structure and intent of the curriculum and instruction."

Publications

Peer-reviewed scholarly works by our staff.

2021

Clauser, B. E., & Bunch, M. B. (Eds.). (2021). The history of educational measurement: Key advancements in theory, policy, and practice. Routledge. https://doi.org/10.4324/9780367815318

Jiang, N., Rogers, B., Fan, X., Hu, X., Lewis, A., Cai, B. (2021). School-level factors related to visual arts achievement for fourth graders: a longitudinal analysis. Studies in Art Education, 62(1), 47–62. https://doi.org/10.1080/00393541.2020.1858263

2020

DiStefano, C., & Jiang, N. (2020). Applying the Rasch rating scale method to questionnaire data. In M. Khine (Ed.), Rasch measurement. Springer. https://doi.org/10.1007/978-981-15-1800-3_3

Fan, X., Jiang, N., & Lewis, A. (2020). Factors associated with fourth graders’ music knowledge assessed by SCAAP. International Journal of Music Education, 38(4), 644–656. https://doi.org/10.1177/0255761420926664

Wang, W., Chen, J. & Kingston, N. (2020). How well do simulation studies inform decisions about multistage testing? Journal of Applied Measurement, 21(3), 271–281. PMID: 33983899.

2019

Liu, J., Burgess, Y., DiStefano, C., Pan, F., & Jiang, N. (2019). Validating the Pediatric Symptoms Checklist–17 in the preschool environment. Journal of Psychoeducational Assessment, 38(4), 460–474. https://doi.org/10.1177/0734282919828234

Murray, A. K., Daoust, C. J., & Chen, J. (2019). Developing instruments to measure Montessori instructional practices. Journal of Montessori Research, 5(1), 50–87. https://doi.org/10.17161/jomr.v5i1.9797

Palermo, C., Bunch, M., & Ridge, K. (2019). Scoring stability in a large-scale assessment program: A longitudinal analysis of leniency/severity effects. Journal of Educational Measurement, 56(3), 626–652. https://doi.org/10.1111/jedm.12228

Palermo, C., & Thomson, M. M. (2019). Large-scale assessment as professional development: Teachers’ motivations, ability beliefs, and values. Teacher Development, 23(2), 192–212. https://doi.org/10.1080/13664530.2018.1536612

2018

Chen, J. (2018). KR-20. In B. Frey (Ed.), Encyclopedia of educational research, measurement, and evaluation. Sage Publishing.

Chen, J. (2018). Interstate School Leaders Licensure Consortium (ISLLC) standards. In B. Frey (Ed.), Encyclopedia of educational research, measurement, and evaluation. Sage Publishing.

Chen, J. & Perie, M. (2018). Comparability with computer-based assessment: Does screen size matter? Computers in the Schools, 35(4), 268–283 https://doi.org/10.1080/07380569.2018.1531599

Cui, Z., Liu, C., He, Y., & Chen, H. (2018). Evaluation of a new method for providing full review opportunities in computerized adaptive testing—Computerized adaptive testing with salt. Journal of Educational Measurement, 55(4), 582–594. https://doi.org/10.1111/jedm.12193

2017

DiStefano, C., Liu, J., Jiang, N., Shi, D. (2017). Examination of the weighted root mean square residual: Evidence for trustworthiness? Structural Equation Modeling: A Multidisciplinary Journal, 25(3), 453–466. https://doi.org/10.1080/10705511.2017.1390394

2016

Bunch, M. B., Vaughn, D., & Miel, S. (2016). Automated scoring in assessment systems. In Y. Rosen, S. Ferrara, & M. Mosharraf (Eds.), Handbook of research on technology tools for real-world skill development (pp. 611–626). IGI Global. https://doi.org/10.4018/978-1-4666-9441-5.ch023

McClintock, J. C. (2016). Reduction in cheating following a forensic investigation on a statewide summative assessment. Applied Measurement in Education, 29(2),132–143. https://doi.org/10.1080/08957347.2016.1138958

2015

He, Y., Cui, Z., & Osterlind, S.J. (2015). New robust scale transformation methods in the presence of outlying common items. Applied Psychological Measurement, 39(8), 613–626. https://doi.org/10.1177%2F0146621615587003

McClintock, J. C. (2015). Erasure analyses: Reducing the number of false positives. Applied Measurement in Education, 28(1), 14–32. https://doi.org/10.1080/08957347.2014.973563

2014

Sotaridona, L. S., Wibowo, A., & Hendrawan, I. (2014). A parametric approach to detect a disproportionate number of identical item responses on a test. In N. M. Kingston & A. K. Clark (Eds.), Test fraud: Statistical detection and methodology (pp. 54–68). Routledge.

Presentations

Conference paper and poster presentations by our staff.

2022

He, Y., Jing, S., & Lu, Y. (2022, April). A multilevel multinomial logit approach to bias detection. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Jiang, N., Zhang, T., Gao, R., DiStefano, C., & Dou, J. (2022, April). Measurement invariance testing using multiple-group CFA: A systematic review. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

Justice, D. (2022, April). A linear model approach to bias detection. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. LINK

Palermo, C. (2022, April). Examining hybrid automated scoring/handscoring results in a multi-state design. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

2021

Brown, R. (2021, September). How medical boards responded to COVID by making oral exams virtual and the lessons they learned. Poster presented at the American Board of Medical Specialties Virtual Conference.

Cui, Z., Liu, C., & He, Y. (2021, June). Using machine learning to administer salt items in computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education (virtual).

Gao, R., Jiang, N., DiStefano, C., & Liu, J. (2021, April). Young children’s behavior adjustment trajectories: A latent growth curve analysis. Paper presented at the annual meeting of the American Educational Research Association, Orlando, FL.

Gao, R., DiStefano, C., Liu, J., & Jiang, N. (2021, April). Longitudinal invariance analysis of Pediatric Symptom Checklist-17 (PSC-17). Paper presented at the annual meeting of the American Educational Research Association, Orlando, FL.

Jiang, N., Gao, R., DiStefano, C., & Liu, J. (2021, April). Using latent profile analysis to classify primary student’s social-emotional and behavioral functioning. Paper presented at the annual meeting of the American Educational Research Association, Orlando, FL.

Thacker, A., Word, A., Sinclair, A., Nash, B., & Chen, J. (2021, June). Moving bookmark standards setting from in-person to virtual: Best practices/lessons learned. Paper presented at the annual meeting of the National Council on Measurement in Education, online.

2020

He, Y., Wu, Y.F., & Tao, W. (2020, September). Comparing CTT postequating and IRT preequating in the embedded field-test model. Paper presented at the annual meeting of the National Council on Measurement in Education.

Jiang, N., Pompey, K., & Burgess, Y. (2020, April). A comparison of two DIF methods for analyzing Rasch model data: A monte carlo investigation. Paper presented virtually at the annual meeting of the American Educational Research Association, San Francisco, CA.

Murray, A., Daoust, C., & Chen, J. (2020, April). Validating tools for measuring Montessori implementation. Paper presented at the annual meeting of the of American Educational Research Association, San Francisco, CA.

Wu, Y.F., He, Y., & Tao, W. (2020, September). Evaluating impacts on operational item performance in the embedded field-test model. Paper presented at the annual meeting of the National Council on Measurement in Education.

2019

Cui, Z., Liu, C., & He, Y. (2019, April). On administering salt items in computerized adaptive testing with salt. Paper presented at the annual meeting of the National Council on Measurement in Education, Toronto, Canada.

Daoust, C., Murray, A., & Chen, J. (2019, March). A reexamination of implementation practices in Montessori early childhood education. Paper presented at The Montessori Event, Washington, DC.

Murray, A., Chen, J., Daoust, C., & Amos, A. (2019, April). Dimensions of fidelity in a constructivist classroom. Paper presented at the annual meeting of American Educational Research Association, Toronto, Canada.

Pompey, K., Jiang, N., Burgess, Y., Lewis, A., & Dou, J. (2019, April). Differential item functioning analysis of a state-wide visual arts assessment using a two-stage procedure. Paper presented at the annual meeting of the American Educational Research Association, Toronto, Canada.

2018

Chen, T., Tao, W., & Gao, X. (2018, July). Evaluating item position effects on scrambled form pre-equating. Paper presented at the annual meeting of the International Test Commission Conference, Montreal, Canada.

Jiang, N., DiStefano, C., Liu, J., & Shi, D. (2018, April). An investigation of statistical power and sample size for CFA models with ordinal data: A monte carlo study. Paper presented at the Modern Modeling Methods Conference, Storrs, CT.

Jiang, N., Liu, J., Shi, D., & DiStefano, C. (2018, April). Performance of the weighted root mean square residual with categorical and continuous data. Paper presented at the Modern Modeling Methods Conference, Storrs, CT.

Jiang, N., Liu, J., Shi, D., & DiStefano, C. (2018, July). Sample size and statistical power for SEM: A simulation study. Paper presented at the International Meeting of Psychometric Society, New York, NY.

Jiang, N., Zheng, J., & Lewis, A. (2018, April). An HLM approach to investigate factors influencing visual arts achievement for elementary school students. Paper presented at the Chinese American Educational Research and Development Association Conference, New York, NY.

Wang, W., Zheng, Z., & Chen, J. (2018, October). Clustering students in a state classroom assessment system: Exploring the usages for classroom assessment. Paper presented at the National Council on Measurement in Education Special Conference on Classroom Assessment, Lawrence, KS.

2017

Burgess, Y., Lewis, A., & Jiang, N. (2017, November.) Increasing stakeholder use of assessment data through improved reporting. Paper presented at the annual meeting of the American Evaluation Association, Washington, DC.

Chen, T., Huang, C.H., & Liu, C. (2017). An imputation approach to handling incomplete computerized tests. Paper presented at the annual meeting of the International Association of Computerized Adaptive Testing, Niigata, Japan.

Fang, Y., Lu, Y., & He, Y. (2017, April). Can subtest equating borrow information from the full test? Paper presented at the annual meeting of the National Council on Measurement in Education, San Antonio, TX.

Guo, Z, Jiang, N., & Robert, J. (2017, April). Interrater reliability estimator accuracy and double-rated percentages: A monte carlo investigation. Paper presented at the annual meeting of the American Educational Research Association, San Antonio, TX.

He, Y., & Yi, Q. (2017, April). Impact of item parameter drift on mixed-format tests. Paper presented at the meeting of the annual meeting of the National Council on Measurement in Education, San Antonio, TX.

Leighton, E., Fan, X., Jiang, N. & Lewis, A. (2017, April). Using item response theory to investigate assessment quality in a large-scale music assessment program. Paper presented at the 6th International Symposium on Assessment in Music Education, Context Matters, Birmingham, UK.

Liu, J., Jiang, N., & DiStefano, C., (2017, May). Performance of weighted root mean square residual (WRMR) in structural equation modeling. Poster presented at the Modern Modeling Methods conference, Mansfield, CT.

2016

Cui, Z., Liu, C., He, Y., & Chen, H. (2016, April). A modified procedure in applying CATS to allow unrestricted answer changing. Paper presented at the annual meeting of the National Council on Measurement in Education, Washington, D.C.

Cui, Z., Liu, C., He, Y., & Chen, H. (2016, April). Evaluation of a new method in providing full review opportunities in computerized adaptive testing—Computerized adaptive testing with salt. Paper presented at the annual meeting of the National Council on Measurement in Education SRERA Distinguished Paper Session, Washington, D.C.

He, Y., Liu, R., & Cui, Z. (2016, April). Bayesian estimation of null categories in constructed-response items. Paper presented at the annual meeting of the National Council on Measurement in Education, Washington, D.C.

Jiang, N., Pan, F., Liu, J. & DiStefano, C. (2016, February). Paper presented at the South Carolina Educators for the Practical Use of Research (SCEPUR) annual conference, Columbia, SC.

Yi, Q., He, Y., & Wei, H. (2016, April). Sample size requirement for trend scoring in mixed-format test equating. Paper presented at the annual meeting of the National Council on Measurement in Education, Washington, D.C.

2015

Chen, T., & Tao, W. (2015, April). Linking multiple scaling tests under IRT. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

Cui, Z., & He, Y. (2015, July). Practical considerations in choosing an anchor form for equating. Poster presented at the annual meeting of the Psychometric Society, Beijing, China.

Cui, Z., Liu, C., He, Y., & Chen, H. (2015, April). Allowing unrestricted answer changing through computerized adaptive testing with salt. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

Cui, Z., Liu, C., He, Y., & Chen, H. (2015, September). Comparing CATS and the block review method in providing review options in CAT. Paper presented at the International Association for Computerized Adaptive Testing Summit, Cambridge, UK.

Cui, Z., Liu, C., He, Y., & Chen, H. (2015, December). Evaluation of a new method in providing full review opportunities in computerized adaptive testing—Computerized adaptive testing with salt. Paper received the IEREA Distinguished Research Award from Iowa Educational Research and Evaluation Association, Iowa City, IA.

Harris, D. J., Liu, C., & Chen, T. (2015). An exploratory study of starting a CAT with a non-scaled item pool. Paper presented at the annual meeting of the International Association of Computerized Adaptive Testing, Cambridge, England.

He, Y., Cui, Z., & Osterlind, S.J. (2015, April). Using robust scale transformation methods for multiple outlying common items. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

Hurtz, G., & Brown, R. (2015, April). Establishing meaningful expectations for test performance via invariant latent standards. Symposium presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

Hurtz, G., Brown, R. & Tucker, N. (2015, March). Evolution of psychological measurement models and their applications in practical testing and assessment. Paper presented at the annual conference of the Association of Test Publishers, Palm Springs, CA.

Wibowo, A., & Sotaridona, L.S. (2015, July). Incorporating suspicious answer changes in the detection of aberrant response patterns. Paper presented at the International Meeting of the Psychometric Society (IMPS), Beijing, China.

2014

Cui, Z., He, Y., & Osterlind, S.J. (2014, August). New robust scale transformation methods in the presence of outlying common items. Paper presented at the annual meeting of the Psychometric Society, Madison, WI.

Cui, Z., Liu, C., He, Y., & Chen, H. (2014, October). Comparison of algorithms that allow item review in computerized adaptive testing. Paper presented at the International Association for Computerized Adaptive Testing Summit, Princeton, NJ.

He, Y. & Cui, Z. (2014, April). Comparison of IRT preequating methods when item positions change. Paper presented at the annual meeting of the National Council on Measurement in Education, Philadelphia, PA.

Su, I., He, Y., & Osterlind, S. J. (2014, April). Comparing different model-based standard setting procedures. Poster presented at the NCME Graduate Student Issues Committee (GSIC) poster session, Philadelphia, PA.

Yang, P., He, Y., & Wang, Z. (2014, April). Hierarchical bayesian modeling for two-parameter nested logit model with parallel computing. Poster presented at the NCME Graduate Student Issues Committee (GSIC) poster session, Philadelphia, PA.

2013

He, Y., Yang, P., & Osterlind, S. J. (2013, April). Weighted moment approaches in scale transformation for IRT equating. Poster presented at the NCME Graduate Student Issues Committee (GSIC) poster session, San Francisco, CA.

Sotaridona, L. S., Wibowo, A., & Hendrawan, I. (2013, October). Item-level analysis of wrong-to-right erasures. Paper presented at the 2nd Annual Conference on Statistical Detection of Possible Test Fraud, Madison, KS.

Sotaridona, L. S., Wibowo, A., & Hendrawan, I. (2013, April-May). The utility of dichotomous IRT models on group-level cheating detection method. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA.

Sotaridona, L. S., Wibowo, A., Hendrawan, I., & Pornel, J. (2013, October). An application of nominal response model to identify erroneously scored test items. Invited paper presentation at the 12th National Convention on Statistics, Manila, Philippines.

Sotaridona, L. S., Wibowo, A., & Pornel, J. (2013, October). The stability of point biserial correlation coefficient estimates against different sampling schemes. Paper presented at the 12th National Conventions on Statistics, Manila, Philippines.

Wibowo, A., Sotaridona, L. S.& Hendrawan, I. (2013, October). Item-level analysis of response similarity. Paper presented at the 2nd Annual Conference on Statistical Detection of Possible Test Fraud, Madison, KS.

Wibowo, A., Sotaridona, L. S., & Hendrawan, I. (2013, April-May). Statistical models for flagging unusual number of wrong-to-right erasures. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA.

Zopluoglu, C., Chen, T., Huang, C., & Mroch, A. (2013, April). Using previous test performance to improve the efficiency of statistical indices in detecting answer copying. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.

Zopluoglu, C., Chen, T., Huang, C., & Mroch, A. (2013, October). The performance of statistical indices in detecting answer copying on multiple-choice examinations using dichotomous item scores. Paper presented at the 2nd Annual Statistical Detection of Potential Test Fraud Conference, Madison, WI.