Assessing writing


Download 56 Kb.
bet1/3
Sana10.02.2023
Hajmi56 Kb.
#1183266
  1   2   3
Bog'liq
Assessing writing


Assessing writing.
Contents.
1 Introduction.
2 Effectively addresses the topic and task.
3 Is well organized and well developed, using clearly appropriate explanations, exemplif i cations and/or details.
4 Displays unity, progression and coherence.
5 Displays consistent facility in the use of language, demonstrating syntactic variety, appropriate word choice and idiomaticity, though it may have minor lexical or grammatical errors.
6 Conclusion.
Writers at the Superior level demonstrate a high degree of control of grammar and syntax, of both general and specialized/pro-fessional vocabulary, of spelling or symbol production, of cohesive devices, and of punctuation. Their vocabulary is precise and varied. Writers at this level direct their writing to their audiences; their writing fl uency eases the reader’s task. Writers at the Superior level do not typically control target-language cultural, organizational, or stylistic patterns. At the Superior level, writers demonstrate no pattern of error; however, occasional errors may occur, particularly in low-frequency structures. When present, these errors do not interfere with comprehension, and they rarely distract the native reader (p. 11). In this description, there is mention of grammar, syntax, lexis, spelling, cohesion, fl uency, style, organization, and other factors. Because of the complex nature of the construct of writing prof i ciency, it is important to more fully understand the pieces that contribute to the whole in order to assess it more accurately. Specif i cally, in relation to lexis, the TOEFL iBT rubric mentions appropriate word choice and idiomaticity, and the ACTFL rubric mentions control of general and specialized vocabulary as well as lexical precision and variation as part of writing prof i ciency.
Additionally, lexical knowledge has consistently been shown to account for a large amount of variance in writing prof i ciency scores. For instance, Stæhr (2008) showed that as much as 72 % of the variance in writing prof i ciency could be accounted for by vocabulary knowledge. Others found less strong, but still highly meaningful results, such as Milton, Wade, and Hopkins (2010) and Miralpeix and Mu˜ noz (2018), who found that vocabulary size accounted for 57.8 % and 32.1 % of the variance in writing prof i ciency scores, respectively. Therefore, in the present study, we explore the relationship between the construct of writing prof i ciency and an aspect of lexical knowledge, lexical diversity (LD), using a battery of LD measures in an English for Academic Purposes setting. While several other studies relating writing prof i ciency and LD exist (see, e.g., Yu, 2010; Gebril & Plakans, 2016; Gonz´ alez, 2017; Crossley & McNamara, 2012; Wang, 2014), the purpose of this study to build on this literature by: 1) using multiple LD measures separately and in combination, 2) exploring the factor of timing condition, and 3) to build understanding of the relationship between different LD measures in writing assessment.
The study of LD and how it relates to language ability has existed for over a century. Thomson and Thompson (1915) fi rst proposed an empirical method for using a person’s vocabulary usage patterns to estimate his or her language knowledge. Twenty years later, Carroll (1938) introduced the term “diversity of vocabulary” and def i ned it as “the relative amount of repetitiveness or the relative variety in vocabulary” (p. 379). Although there remains debate on the def i nition of LD, for the purposes of this study, we adhere to Malvern, Richards, Chipere, and Dur´ an (2004)’s def i nition stating that LD is the range or variety of vocabulary within a text. More recently, Vidal and Jarvis (2020) def i ne lexical diversity as “the variety of words used in speech or writing”, which is also consistent with how LD will be considered in this study. That is, it is synonymous with lexical variation, and it is an aspect of lexical richness and complexity.
When LD fi rst began to be studied, the most frequent measure used was Type-to-Token Ratio (TTR) (Johnson, Fairbanks, Mann, & Chotlos, 1944; Osgood & Walker, 1959); however, as research in the fi eld has progressed, the validity of TTR as a measure of LD has been repeatedly challenged, and new measures of LD have been proposed and validated. One of the fi rst alternative measures proposed was vocabulary diversity (vocd-D), which relies on the mathematical modeling of the introduction of new words into longer and longer texts (Malvern & Richards, 2002). Then, the measure of textual LD (MTLD), which analyzes the LD of a text without being impacted by the length, was introduced (McCarthy, 2005). The Moving Average Type-to-Token Ratio (MATTR), which calculates TTR through a series of moving windows and takes the average, leading the measure to be unaffected by text length, was also created (Covington & McFall, 2010). The reliability and validity of these measures have been tested several times over, and they have begun to replace TTR as the standard for LD measurement (Authors, xxxx; Fergadiotis, Fergadiotis, Wright, & Green, 2015; McCarthy & Jarvis, 2010; Jarvis, 2013a).
Over time, LD measurement has steadily progressed, and currently, many LD measures are used for measuring the construct (e.g., Maas, MTLD-MA, MTLD-BID, and many others) (Kyle, Crossley, & Jarvis, 2021; McCarthy & Jarvis, 2010). Many studies have compared the values produced by these different measures to assess how well and how similarly they assess the phenomenon of lexical diversity (e.g., Fergadiotis et al., 2015; McCarthy & Jarvis, 2010). In a validation study of MTLD, McCarthy and Jarvis (2010) compared it to several other metrics, vocd-D, TTR, Maas, Yule’s K, and an HD-D index. The study examined several aspects of the different measures using assessments of convergent validity, divergent validity, internal validity, and incremental validity. MTLD performed well across all four types of validity and appeared to capture unique lexical information (i.e., volume, abundance, evenness, dispersion, disparity, and specialness; see Jarvis, 2013b; Kyle et al., 2021), as did Maas and vocd-D (or HD-D). This discovery led McCarthy and Jarvis (2010) to recommend the assessment of LD through all three measures, rather than any one index. In a com-parison of vocd-D, Maas, MTLD, and MATTR, Fergadiotis et al. (2015) found that MTLD and MATTR were stronger indicators of LD than Maas and vocd-D, which appeared to be affected by construct-irrelevant confounding sources.
As these measures are all intended to assess the same phenomenon (i.e., LD), one could reasonably expect similar validities and indicative abilities; however, based on these studies, it appears that the different LD measures do not necessarily correlate well with one another and that using multiple measures might be a better approach to measuring LD than using one alone. For example, McCarthy and Jarvis (2010) found that three measures capture unique lexical information. However, this practice has yet to gain universal or even widespread acceptance among researchers. Thus, the relationship between these measures and the construct of LD that they are intended to measure appears to remain unresolved. Further, the exact nature of how the measures correspond to each As the study of LD has progressed, it has frequently been correlated with language prof i ciency. For example, Jarvis (2002) used best-f i tting curves to analyze written texts produced by Finnish and Swedish L2 English learners and native English speakers. He found two curve-f i tting formulas that produced accurate models for the type-token curves of 90% of the texts. Further analysis revealed a clear relationship between LD and instruction, which typically correlates with prof i ciency level. In a study of L2 writing samples, Crossley and Salsbury (2012) divided a selection of texts into beginning, intermediate, and advanced categorizations based on TOEFL and ACT ESL scores. They then performed a discriminant function analysis, using the computational tool Coh-Metrix, to predict the prof i ciency level based on breadth of lexical knowledge, depth of lexical knowledge, and access to core lexical items. LD, as measured by M which reports LD on a reverse scale (for more information see Tweedie & Baayen, 1998), was a predictor of writing prof i ciency with medium effect size (partial eta-squared = 0.250). In a later study, Crossley & McNamara et al. (2012) conducted a similar analysis of essay responses written by graduating high school students in Hong Kong. Once again, LD (as measured by vocd-D) was shown to be a signif i cant predictor of writing prof i ciency (R2 = 0.180, p < 0.01). In another study, Crossley, Salsbury, McNamara, and Jarvis (2011) collected 240 writing samples (60 beginning, 60 intermediate, 60 advanced, and 60 L1 English speakers). Raters then used a standardized lexical prof i ciency rubric to rate the essays, and the Coh-Metrix computational tool was used to analyze the essays for LD and other measures. LD, along with word hypernymy values and content word frequency, accounted for 44% of the variance in rater-assigned scores. Kyle et al. (2021) further investigated this relationship with an analysis of L1 and L2 argumentative essays that human raters rated for LD and that Kyle et al. analyzed for LD in terms of abundance, variety, and volume. The goal of that study was to see how well and to what extent LD indices ref l ected perceived LD. After comparison of the human ratings and lexical indices, Kyle et al. found that abundance and variety accounted for the majority (74%) of the variance in the human ratings of prof i ciency.
While many studies focus on the relationship between LD and prof i ciency in writing tasks, similar observations can be made in productive tasks for speech as well. Yu (2010) conducted an analysis of speaking and writing task responses in order to determine what differences might exist in LD between the two task types and what relationship exists between LD and response rating. Using vocd-D, Yu found a signif i cant and positive correlation for LD with overall quality ratings and general language prof i ciency. Furthermore, the LD of speaking and writing tasks were at similar levels, suggesting task type does not have a signif i cant effect on LD. Corroborating Yu’s results, Loomis (2015) found that LD differed signif i cantly across certain prof i ciency levels. The study focused on L2 Arabic speakers at different prof i ciency levels according to ACTFL guidelines, and a LD analysis revealed signif i cant differences between the Intermediate Mid and Advanced Mid speakers. In addition to text length, a couple of studies have also considered the role that timing plays in the prof i ciency tests with respect to LD. Time restrictions constrain test takers, which increases the demand of a task and can affect language production (see Margolis et al., 2020 for a summary of the effect of timing in language production tasks). Matsumoto (2010) looked at the language production of Japanese L2 learners through speaking and writing. As speaking is spontaneous and writing is non-spontaneous, speaking was considered a time constrained activity, while writing was not. A difference of TTR was found between the two groups, suggesting that time constraints may impact LD. In a study focused on time constraints in writing, Lee (2019) looked at writing responses from 123 L2 English learners under two time constraints (30- and 60-minute), using vocd-D to assess LD. There appeared to be no signif i cant differences in LD, as measured by D, between the two time constraints. Thus, because of both a paucity of investigations of the issue, and because the results of studies that have examined the inf l uence of timing on LD appear to conf l ict, it remains unclear whether time constraints impact LD or not, suggesting a need for further analysis.
3. The present study In our review of the literature, some potential gaps arose in LD measuring in language production, particularly writing production in language learners. First, while many studies use different and sometimes multiple measures of LD, studies have yet to fully explore the relationship between the different LD measures much beyond simple correlations. Even when multiple measures are used they are used as separate indicators rather than combined into a single LD score. Second, because multiple LD measures are oftentimes used, it is probably useful to determine which measures to use in combination, especially if LD measures are measuring more than one un-derlying subconstruct of LD. Third, the effect of timing conditions on LD is not fully understood in writing, and as it could be a confounding variable, it is important to account for its possible effects. The present study attempts to address these three issues. This study presents a methodologically innovative approach to the study of LD, as we tested whether multiple measures were more predictive than any single measure. Additionally, this research helps clarify the importance of LD in writing prof i ciency tests (something that would be of great interest to instructors, students, test makers, etc.) and whether this importance varies depending on timing conditions. The following research questions (RQs) focus on writing production in timed writing prof i ciency assessments for ESL learners: 1) What is the effect of LD on perceived writing prof i ciency? 2) Does the predictive power of LD measures improve when using multiple measures?
Equivalence table for writing test scores and ACTFL proficiency levels.
Writing Test Score ACTFL Level
0 Novice Low or Mid
1 Novice High
2 Intermediate Low
3 Intermediate Mid
4 Intermediate High
5 Advanced Low
6 Advanced Mid
7 Advanced High
Methodology 4.1. The English Language Center (ELC) Corpus To answer these RQs, we gathered a corpus of written responses to exam prompts given to ESL students enrolled in the English Language Center (ELC), an intensive English program at Brigham Young University (BYU) in the United States, in the winter, summer, and fall semesters from 2018 to 2021. These responses were expository, descriptive, narrative, and persuasive, depending on the prompt (see Appendix A). The initial corpus included 5150 responses. After cleaning, it included 4207 responses. The data cleaning involved adding spaces around punctuation in appropriate places, correcting errors in reported L1 and sex, excluding participant outliers 40 years old and older (n = 50), as we wanted the sample to be representative of typical university students, and eliminating any responses that had fewer than 50 words, as previous research has shown that LD measures are less valid or invalid for shorter responses (Zenker & Kyle, 2021). Responses fewer than 50 words accounted for the vast majority of the exclusions from the dataset. We also divided the corpus into two sub-corpora, one with 10-minute responses and one with 30-minute responses. The 4207 writing responses came from 861 students (421 males and 440 females) who came from 26 different L1 backgrounds, with a mean age of 24.3 years (SD = 4.97; median = 23; range of 17–39). The majority of students had a prof i ciency level between Intermediate Mid and Advanced English, as described by the American Council on the Teaching of Foreign Languages (ACTFL) prof i ciency guidelines (ACTFL, 2012; Moore, 2018). Most students contributed multiple responses as they spent several semesters in the program (mean number of responses per student = 4.9, SD = 2.37, median = 4, range = 1–14,). The total word count for the corpus is 958,603, and the mean word count was 147 (SD = 51.5) for the 10-minute prompts and 308 (SD = 126) for the 30-minute prompts. In order to make a direct correspondence between the writing prof i ciency ratings and LD scores, spelling was not corrected for in this dataset, even though it is sometimes corrected for in other LD studies. Prof i ciency scores varied widely across the sample, spanning nearly the whole possible range of scores (mean = 3.74, SD = 1.08, median = 3.82, range = 0–6.93).
Writing proficiency exam and scoring The texts for the corpus come from responses to a regularly administered exam, which is used both for placement and for achievement testing within the ELC program. The exam includes sections on reading, writing, speaking, listening, and grammar, and it is completed by newly admitted students at the beginning of the semester and by all students at the end of the semester. Student scores at the beginning of the semester determine a student’s placement into one of the seven course levels offered in the program. Exam scores at the end of the semester allow for a student’s course levels to be properly adjusted according to their prof i ciency level. The writing portion of the exam contains two essay prompts designed by a PhD language testing specialist: one 10-minute and one 30-minute. Although there are two prompts, the raters give the writing section one composite score for each student rather than individual scores for each essay. The essay prompts vary from semester to semester and from test to test. Several example prompts are provided in Appendix A. Rater calibration occurred every semester. Prior to off i cial rating, raters received a training packet, which helped them familiarize themselves with the rubrics and provideed them with essays that received clear scores and essays that were borderline. Twelve days before the test administration began, all raters received a practice sheet and writing samples that they scored within a week. Once all scores are collected, a calibration meeting was held in which raters’ anonymized scores are displayed in order to discuss errors and biases. At the meeting, calibration consisted of iterative rounds of all readers coding the same essays, noting and resolving discrep-ancies in scores, and retraining when necessary. In off i cial scoring, all essays were double-rated. If there was a discrepancy of one point or fewer between raters, the average of the two scores was taken. If there was a discrepancy larger than one point, a third rater was brought in to arbitrate. Their arbitration decided which score was assigned to the response (Cox, personal communication, 2021). After raters scored all essays, the scores were run through a Many-Faceted Rasch Model (MFRM) using Winsteps to derive a fair average (i.e., expected) score and a fi t score (Linacre, 1989). Under the Rasch Model, a fair average score is an estimate of the true score for a test that takes accounts for how raters scored across all tests, thus mitigating individual rater biases (Linacre, 2008). (Additional details about the validity of the tests for the examining. writing proficiency can be found in Moore, 2018 and Sims et al., 2020). The fair average score for the writing portion of the exams for a given semester is what we refer to as students’ writing prof i ciency score. Table 1 shows the correspondence between fair average scores and ACTFL level (Cox 2014)
Statistical analyses First, we used the lexical-diversity Python package1 to calculate LD measures (TTR, HDD, MTLD, MATTR, MTLD-MA, MTLD-BID, Root TTR, Log TTR, Maas, and MSTTR) for all responses. We operationalized words as fl emmas, which are the base or root of words with no regard for part of speech (Jarvis & Hashimoto, 2021; McLean, 2018). Then we checked the assumptions for a Principal Components Analysis (PCA) and regressions of normality, absence of collinearity, factorability, and homoscedasticity using Variance Inf l ation Factors, a Bartlett Correlation Test, a Kaiser-Meyer Olkin sampling adequacy test, a Wilkes-Shapiro test, a correlation matrix, and by visualizing our data in scatterplots. We also determined that we had a suff i cient number of observations for the regression by using the criteria 50 + 8k (where k is the number of predictors) (Tabachnick & Fidell, 2001). After checking these assumptions, we removed log TTR, root TTR, number of words, TTR, and MTLD-W in a stepwise fashion because they violated the assumptions of collinearity (VIF maximum of 10), sphericity, and/or sampling adequacy (KMO threshold of 0.8). We also checked for fl ipped sign effects. An alpha level of 0.01 was used as a threshold for signif i cance for tests in this study. Once the fi nal set of LD measures was determined, two-sided U-tests were conducted to compare differences between the measure between the two timing conditions. Two PCAs were conducted on the corpora. The fi rst PCA included all of the LD measures that we considered for this study. The second considered those measures that did not violate the assumptions of the regression models that we ran. The fi rst PCA was used to assess the nature of the relationships between variables. The second was used to compare methods of combining LD measures. Sta-tistics in the factor analysis family such as PCA are commonly used to obtain reduced models and to understand the relationship between variables in multivariate models (see, e.g., Biber, 1988; Berber Sardinha & Pinto, 2019). The number of components for the PCA was resolved by only considering components whose eigenvalues were greater than 1 (cf., Levshina, 2015, p. 355) as well as visual examination of a scree plot. After conducting the PCAs, Pearson correlations were run between all of the lexical diversity measures used in this study. We also ran linear mixed-effects regressions on responses in the two timing condition subcorpora.
The first two regressions considered all variables that we found not to be multicollinear to predict writing prof i ciency under the two timing conditions. Both models had random effects (specif i cally random intercepts) of participant, L1, and semester the test was administered, which accounts for both variation due to time and prompts. Unique and overlapping variance was also calculated for each fi xed effect by running a model with only one LD measure at a time. The models were then compared using R2 and Akaike Information Criteria (AIC). Effect sizes were interpreted using Plonsky and Oswald’s grounded suggested heuristics for L2 research (i.e., r = 0.25 is small; r = .40 is medium; r = .60 is large). Statistics were run using the R programming language and the following packages: olsrr (Hebbali, 2020), psych (Revelle, 2021), FactoMineR (Le, Josse, & Husson, 2008), ggplot2 (Wickham et al., 2019), lme4 (Bates, Maechler, Bolker, & Walker, 2015), and lmerTest (Kuznetsova, Brockhoff, & Christensen, 2017).
Table 2


Download 56 Kb.

Do'stlaringiz bilan baham:
  1   2   3




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling