Descriptive Summary Statistics
Apart from data that were missing by design due to the complex booklet design that was used in the assessment overall, some student responses were not legible, some students produced so little text that it could not be assessed, and some students did not respond to their assigned tasks. All of these cases led to missing ratings, which accounted for 24.8% of the ratings in the HSA sample but only for 5.0% of the ratings in the MSA design. These magnitudes reflect the different mean achievement levels in these samples as well as a poorer match of the assigned tasks to the students in the HSA sample, who found some tasks at CEFR levels B1 and B2 too challenging and opted to skip them. As these missing data can be interpreted as unsuccessful attempts at the tasks, they were coded as “below pass.”
We first address the design factor raters. Table 4 shows the percentage of students who received a pass on the global rating for each of the 19 tasks by each of the 13 raters. We show the global rating here to conserve space. Table 4 also presents the expected a priori CEFR levels as provided by the task developers before the data were collected. As Table 4 shows, the variation among the raters in terms of marginal percentages correct across all tasks within a particular student sample is relatively small with a few exceptions such as Rater 1 and Rater 2 in the MSA design, who gave, on average, much higher ratings than the other raters. The average ratings for each sample across all tasks and raters (i.e., .37 for the HSA sample and .63 for the MSA sample) reflect the expected proficiency difference of students in the two samples.3
2.2 Percentage of Student Responses Classified as “Pass” by Sample, Task, and Rater
On the task side, tasks that were classified into lower CEFR proficiency levels had higher marginal “pass” proportions across raters and vice versa; the only exception is Task 12 in the MSA sample, the empirical characteristics of which are more similar to tasks at Level A2. Despite this desirable ordering of the tasks, Tasks 13, 17, and 19 did not function very well in the HSA sample, whereas Tasks 7, 11, and 18 did not function well in the MSA sample from a discriminatory perspective. With the exception of Task 7, which was too easy in the MSA sample (i.e., % “pass” = .99), the other five tasks were too difficult in their respective samples (i.e., % “pass” ≤ .11), thus providing little discriminatory information about students.
Do'stlaringiz bilan baham: |