Symposium 1 – The CAT in the language assessments bag
Organizer: Alina A. von Davier – Duolingo, USA
Abstract: With the growth of digital technology and advances in automated test development tools, ranging from automated item generation to automated scoring, opportunity has come to develop innovative forms of technology-based assessments. This symposium offers an overview of how innovative computer adaptive algorithms, especially when coupled with other advanced technologies, support language tests. The four selected papers cover a bag of CATs with a wide range of specific applications that span the language itself (English and German), the country (Brazil, Germany, USA) and the supportive methodologies and technologies (from automatic item and test development to delivery). The first paper presents a computer adaptive English test from Brazil. It provides a historical perspective that reflects on the changes to the test over time. The second paper provides an overview of the Duolingo English Test automatic item generation (AIG) and CAT algorithm procedures. The third paper describes a German language test for professionals–Goethe Test PRO. The paper illustrates the psychometric considerations for the test. The fourth paper introduces a new paradigm where the AIG and the CAT algorithms are blended together into a dynamic assessment design. This paper will build onto the existing methodologies at Duolingo but integrate them with an Elo rating system for an in-real time difficulty estimation. These studies illustrate the similarities and differences in CAT design across language tests and also contribute to computational psychometric research by blending the computational models behind the automatic algorithm into the more traditional CAT approaches.
- Mariana Curi (University of São Paulo, Brazil), Elias Silva de Oliveira, & Lohan Rodrigues
- Burr Settles (Duolingo, USA), & Geoff LaFlair
- Aron Fink (University of Frankfurt, Germany), & Katharina Klein
- Yigal Attali (Duolingo, USA), & Alina A. von Davier
Symposium 2 – Adaptive testing in PISA: past, present and future
Organizer: Janine Buchholz, & Mario Piacentini – OECD, France
Abstract: Starting with the 2018 cycle, the Programme for International Student Assessment (PISA) uses a multi-stage adaptive testing (MSAT) design to assign different test forms that are matched to students’ ability. This initial foray into adaptive testing helped PISA address test-fairness concerns (by limiting the share of respondents who are given tests that do not allow them to demonstrate their full proficiency); eliminated the need for country-level adaptations; and achieved some reductions in measurement error, especially for students with exceptionally low or high performance. Specifically, a MSAT design with two branching points and a non-adaptive (random probability) layer was chosen to control exposure of items (for item calibration) and manage non-statistical constraints (coverage of sub-constructs). Only preliminary estimates of item characteristics were available for the adaptive decisions, and all item parameters were re-calibrated after the adaptive administration. The lessons learned in this first experience have informed the design for PISA 2022. Starting with PISA 2022, multiple domains are being administered in adaptive fashion. Looking beyond 2022, there is a clear potential to improve the designs and associated methodologies to further increase both measurement precision gains and student engagement during the test. Several papers presented at this symposium will illustrate the challenges that the introduction of adaptive testing in PISA faced and the opportunities that exist to introduce methodological innovations. The first set of presentations review the past and presence of MSAT designs in PISA, while the second set explores ways of introducing future innovations in the context of PISA. The opening presentation by Hyo Jeong Shin will demonstrate the robustness of the adaptive design first implemented in the PISA 2018 reading test. Based on this, Peter van Rijn will review the technical challenges encountered in the design of the PISA 2022 mathematics test and explain how they were addressed. The presentation by Janine Buchholz will examine the differential effects of the increased ability-difficulty match, a result of MSAT, on test-taking engagement. Finally, the presentations by Andreas Frey and Hua Hua Chang will explore the potential benefits of introducing greater adaptivity in the design, such as through testlet-based computerised adaptive testing combined with shadow testing (ST), and on-the-fly assembled multistage adaptive testing (OMST). In the discussion, Matthias von Davier will reflect on the five presentations and provide some concluding remarks.
- Hyo Jeong Shin (ETS, USA), Christoph König, Frederic Robin, Kentaro Yamamoto, & Andreas Frey
- Peter van Rijn (ETS, USA), Usama Ali, Hyo Jeong Shin, & Fred Robin
- Janine Buchholz (OECD, France), Hyo Jeong Shin, & Maria Bolsinova
- Andreas Frey (University of Frankfurt, Germany), Christoph König, & Aron Fink
- Hua-Hua Chang (Purdue University, USA), Xiuxiu Tang, Yi Zheng, Tong Wu, & Kit-Tai Hau
- Discussant: Matthias von Davier (Lynch School of Education, Boston College, USA)
Symposium 3 – Standardizing the measurement of physical, mental, and social health in adults and children with or without (chronic) conditions – The Patient-Reported Outcomes Measurement Information System (PROMIS®)
Organizer: Caroline B. Terwee – Universiteit Amsterdam, the Netherlands
Abstract: There is increasing interest in healthcare for measuring physical, mental, and social health from the patients‘ perspective in individual patient care to obtain transparent and comparable outcomes for health care evaluations and improvement initiatives. However, health care providers do not yet measure patient-reported health outcomes consistently because of lack of consensus on what to measure, time investment and the excess of questionnaires that differ in content and quality, and have incomparable scores. The Patient-Reported Outcomes Measurement Information System (PROMIS®) was developed by a collaboration between the US National Institute of Health and eight US research institutes to develop one state-of-the-art assessment system to measure patient-reported health with highly accurate, precise and short measures for use across adult and pediatric (patient) populations. A wide range of generic item banks was developed, targeting various constructs, such as pain, physical function, anxiety, depression, fatigue, sleep disturbances, and participation in social roles and activities. Item banks were developed using item response theory (IRT) methods and can be used as standard short forms (e.g. 4-, 6-, 8-items versions), custom short forms (selection of relevant items for a specific context) and computerized adaptive tests (CAT). To make PROMIS widely available and maintain its scientific quality, a number of resources have been established: the PROMIS Health Organization (PHO) was established to maintain and encourage the application of PROMIS. PHO is a growing open membership society with education (e.g. workshops and annual conferences), and on-demand resources. The “HealthMeasures” team (Northwestern University, Chicago) and website is the official information (helpdesk) and distribution center for PROMIS, which also coordinates all translations. The Assessment Center Application Programming Interface (API) was developed to connect to any data collection software application (e.g. REDCap) with the full library of PROMIS measures, CAT software, and standardized item parameters. PROMIS CATs have been built into electronic health record systems, such as Epic, and are available through the PROMIS iPad App. Scoring manuals and interpretations guidelines were developed for research and clinical practice. Linking studies are being performed to convert PROMIS scores to scores of related commonly used questionnaires. PROMIS National Centers have been established in 19 countries. Their role is to coordinate all translation efforts, communicate the value of PROMIS to the scientific and research community, and encourage, facilitate, and support the application of PROMIS in the local country. PROMIS measures have been translated in more than 60 languages. Cross-cultural validation studies are being performed to evaluate content validity, confirm the underlying calibration model, and assess differential item functioning between language versions to test the PROMIS convention to use a single set of IRT item parameters across populations and language versions to express scores on a common scale (T-score metric). The ultimate aim is to develop PROMIS into a gold-standard outcome metric for measuring patient-reported health outcomes in an efficient, precise, and comparable way across the world.
- Matthias Rose (Charité-Universitätsmedizin Berlin, Germany)
- Felix Fischer (Charité-Universitätsmedizin Berlin,Germany)
- Leo D. Roorda (Amsterdam Rehabilitation Research Center|Reade, Amsterdam, the Netherlands)
- Benjamin D. Schalet (Feinberg School of Medicine, Northwestern University Chicago, USA)
- Discussant: Ulf Kröhne (DIPF | Leibniz Institute for Research and Information in Education, Germany)
Symposium 4 – Applications of CAT across multiple fields using the Concerto platform
Organizer: David Stillwell & Luning Sun – The Psychometrics Centre at the University of Cambridge, UK
Abstract: The University of Cambridge Psychometrics Centre strives towards making online adaptive testing available to everyone. That is why we’ve created Concerto: a powerful and user-friendly platform that empowers experts and beginners alike to make better tests, with little to no knowledge of coding experience required. There are minimum set-up costs, no licence fees and no limitations. Concerto harmonises the statistical power of the R programming language, the security of MySQL databases and the flexibility of HTML to deliver advanced online tests. These instruments work in unison, giving users unparalleled freedom and control over the design of their assessments. In-built algorithms for score calculation and report generation ensure a rewarding experience for participants, whatever the context. In this symposium, scholars around the world will share their experience of developing online adaptive tests using the Concerto platform. These projects bring forward a number of successful applications of CAT across multiple fields in educational, psychological and clinical assessments.
- Conrad Harrison (University of Oxford, UK), Bao Sheng Loe, Przemysław Lis, & Chris J. Sidey-Gibbons
- Eren Can Aybek (Pamukkale University, Turkey)
- Ecosse Lamoureaux (National University of Singapore, Singapore)
- Bao Sheng Loe (University of Cambridge, UK), Przemysław Lis, & Vesselin Popov
Symposium 5 – Computerized adaptive practicing
Organizer: Han L. J. van der Maas – University of Amsterdam, the Netherlands
Abstract: Computerized adaptive practicing (CAP) is a variant of Computerized adaptive testing (CAT) combining the goals of formative and summative measurement, i.e., practicing and testing. Both are essential in education. It is well known that learning skills such as arithmetic requires intensive practice adapted to the level of ability of the individual (cf. zone of proximal development, deliberate practice). It is also evident that adaptive practicing requires precise assessments of ability, the goal of adaptive measurement. In the last 15 years we developed an algorithm for CAP and applied this technology in a popular online educational system used by 2000 Dutch primary schools, in which we collect about two million item responses per day in about 50 games concerning arithmetic, intelligence, and language (Dutch and English). The algorithm is based on the Elo rating system developed for chess competitions, but incorporates response time in scoring responses to items. Both items and person parameters are estimated on the fly, such that pre-testing the 60.000 items in the item bank is no longer required. In this symposium a) we explain the educational and psychological concepts underlying this approach and introduce the Elo estimation algorithm , b) describe how and why this algorithm has been optimized in 12 year of Math Garden practice , c) explain what role AB testing plays in this optimization and how the data can be utilized to provide learning analytics beyond the basic IRT estimates of ability, d) discuss limitations of the Elo algorithm and provide insights in trackers of ability in a developmental (learning) context, and e) propose a new algorithm for computerized adaptive practicing that allows for unbiased statistical testing of educational and developmental hypotheses.
- Han L. J. van der Maas (University of Amsterdam, the Netherlands)
- Maria Bolsinova (Tilburg University, the Netherlands)
- Abe Hofman (University of Amsterdam, the Netherlands)
- Matthieu Brinkhuis (Utrecht University, the Netherlands)
- Alexander Savi (University of Amsterdam, the Netherlands)
- Discussant: Gunter Maris (ACT-Next, USA, the Netherlands)