An Introduction to Applied Linguistics
particular language abilities. Figure 15.6 summarizes Douglas’ view, showing
Download 1.71 Mb. Pdf ko'rish
|
Norbert Schmitt (ed.) - An Introduction to Applied Linguistics (2010, Routledge) - libgen.li
particular language abilities. Figure 15.6 summarizes Douglas’ view, showing language capacity and test method as responsible for test performance. Testing experts differ on how to interpret and deal with the fact of test method influence on performance; however, most agree that it is essential to identify those aspects of test method that may play a role. Characteristics of the test tasks or testing method Test-taker’s interpretation of test tasks or method Test-taker’s goals and plans for participation Test-taker’s language ability Test-taker’s performance Figure 15.6 Factors involved in the relationship between a test method and performance as outlined by Douglas (1998) The most encompassing framework for describing test methods has been developed in two stages, first as ‘test method facets’ (Bachman, 1990) and, more recently, ‘test task characteristics’ (Bachman and Palmer, 1996). Test task characteristics are defined as: 254 An Introduction to Applied Linguistics • The test ‘setting’, such as the physical specifications of the room and the participants. • The testing ‘rubrics’, including the instructions, test structure, allotted time, response evaluation and calculation of scores. • The ‘input’ to the test-taker, such as test length and grammatical and topical characteristics of test questions. • The ‘output’ expected from the learner, such as the length and grammatical and topical features of responses. • The relationship between input and output, such as whether or not the answers to questions the examinee is asked depend on previous responses. These test task characteristics provide the analytic tools needed for both construction and analysis of language tests, and therefore have played a role in test validation research. Validation The term ‘validity’ carries some meaning for almost everyone, but in educational measurement, including language testing, this term has an extensive technical sense about which volumes have been written. Many applied linguists learned at one time that validity was defined as consisting of three sub-types: • ‘Content validity’ (whether the content of the test questions is appropriate). • ‘Criterion-related validity’ (whether other tests measuring similar linguistic abilities correlated with the test in question). • ‘Construct validity’ (whether research shows that the test measures the ‘construct’ discussed above). In addition, many people think of validity of a test being established by measurement experts through statistical analysis of test scores. Although current perspectives retain traces of these ideas, both the theory and practice of validation are now markedly different from this view (Chapelle, 1999). One big change is typically associated with a seminal paper by Messick (1989), which defined validation as the process of constructing an argument about the interpretations and uses made from test scores. Such an argument may draw upon criterion- related evidence, for example, but the goal of validation would be to establish an argument by integrating a variety of evidence to support test score interpretation and use. As an ‘argument’, rather than a black and white ‘proof’, validation may draw upon a number of different types of data. Such an argument is made on the basis of both qualitative and quantitative research, and it relies on the perspectives obtained from technical work on language testing and the perspectives of applied linguists, language teachers and other test users. An ESL reading test provides an example of how these perspectives worked together (Chapelle, Jamieson and Hegelheimer, 2003). A publishing company contracted testing researchers to develop an ESL test to be delivered on the world-wide web to ESL learners at a wide variety of proficiency levels. Because the test-takers would have a great deal of variation in their reading ability, the test developers decided to include three modules in the test, one with beginning level texts for the examinees to read, one with somewhat simplified texts and a third with advanced-level texts. Once this decision had been made, however, the test developers needed to be able to show that the tests on the texts 255 Assessment actually represented the intended differences in levels, and therefore three types of evidence were used. One type of evidence was the judgement of ESL teachers. Teams of ESL teachers were formed and they worked together to form an understanding of what they should be looking for in texts of various levels in ESL books. Then each passage that had been selected for its potential as a reading text on the test was evaluated by two members of the team to give it a rating of ‘beginning’, ‘intermediate’ or ‘advanced’. An interesting finding during this part of the work was that the two ESL teachers did not always agree on the level of the text, nor did they always agree with the original author’s assignment of the text to a particular level. This part of the test development process resulted in a pool of texts about which two raters agreed. In other words, if two raters thought that a text was a beginning level one, it was retained in the pool of possible texts of the test, but if one rater thought it was a beginning level one and the other rater thought it was intermediate, it was eliminated. The text agreed upon then proceeded to the next stage of analysis. The second type of analysis drew on the expertise of a corpus linguist, who did a quantitative analysis of the language of each of the texts. The texts were scanned to copy them into electronic files, which were then tagged and analysed by use of a computer program that quantified characteristics of the texts that signal difficulty, such as word length, sentence length and syntactic complexity. The corpus linguist set cut scores for each of these features and then selected texts that, on the basis of these characteristics, were clear examples of each level. These texts formed the basis of the reading comprehension modules at the three levels of difficulty. Test writers developed questions to test comprehension as well as other aspects of reading comprehension and then the three module tests were given to a group of examinees. The third type of analysis was quantitative. The researchers wanted to see if the texts that had been so carefully selected as beginning level actually produced test items that were easier than those that had been selected as intermediate and advanced. The question was whether or not the predicted number of examinees got test questions correct for the beginning, intermediate and advanced level tests. As Table 15.1 shows, the researchers predicted that a high percentage of examinees would obtain correct responses on the beginning level texts and so on. The table also shows the results that were obtained when a group of 47 learners took the tests. In fact, the percentages of correct responses turned out as anticipated. Predicted and actual results Intended test level Beginning Intermediate Advanced Predicted High percentage Medium percentage Low percentage Actual mean percentage of correct responses 85 74 68 Table 15.1 Summary of predictions and results in a quantitative validity argument These three types of evidence about the reading test modules are obviously not all that we would want to know about their validity as tests of reading, but these data form one part of the validity argument. A second important development in 256 An Introduction to Applied Linguistics validation practices has been evolving over the past several years to help testing researchers to specify the types of evidence that are needed in view of the types of inferences that that underlie the score interpretation and use (Kane, 2006). These advances have been influential and useful in language testing (Bachman, 2005; Chapelle, Enright and Jamieson, 2008). The many types of qualitative and quantitative analysis that are used in validity research would be too much to describe in this introduction, but the idea of how testing researchers evaluate test data can be illustrated through the description of two basic test analysis procedures. Test Analysis Two types of analysis form the basis for much of the quantitative test analysis: ‘difficulty analysis’ and ‘correlational analysis’. Difficulty analysis refers to the type of analysis that was described above, in which the concern is to determine how difficult the items on the test are. Correlational analysis is a means of obtaining a statistical estimate of the strength of the relationship between two sets of test scores. Computationally, each of these analyses is straightforward. The challenge in language testing research is to design a study in which the results of the analysis can be used to provide information about the questions that are relevant to the validity of test use. Item Difficulty In the example above, the researchers were concerned that their intended levels of text difficulty would actually hold true when examinees took the three modules of the reading test. In the description of the results, we summarized the item difficulties of each of the tests. However, in the actual study the researchers also examined the item difficulties of each item on each of the tests. The item difficulty is defined as the percentage of examinees who answered the item correctly. To obtain this percentage, the researchers divided the number who scored correctly by the total number who took the test and multiplied by 100. On the reading test described above, if 40 correct responses were obtained on an item, that would be (40/47 = 0.85, and then 0.85 ⳯ 100 = 85). People who write tests professionally use this and other item statistics to decide which items are good and which ones should be revised or deleted from a test during test development. As illustrated above, the concept of difficulty can be used several different ways, but it is best used in view of the construct that the test is intended to measure, and the use of the test. If all of the items on a test have high values for item difficulty, for example, the person analysing the test knows that the test is very easy. But whether or not this means that the items should be changed depends on the test construct, the examinees tested and the test use. In this regard, testing researchers distinguish between ‘norm-referenced’ tests, which are intended to make distinctions among examinees, and ‘criterion-referenced’ decisions, which are intended to be used to make decisions about an individual’s knowledge of the material reflected on the test. A test that is easy for a group of examinees would not be successful in distinguishing between examinees, but it may have shown correctly that individuals in that group knew the material tested. Moreover, when difficulty is interpreted in view of the construct that an item of a test is intended to measure, it can be used as one part of a validity argument. 257 Assessment Correlation A second statistical analysis used in validation research is ‘correlation’. When testing researchers or teachers look at how similar two tests are, they are considering the correlation between tests. For example, if a group of students takes two tests at the beginning of the semester, their scores can be lined up next to each other and, if the number of students is small, the degree of relationship between them may be apparent, as shown in Table 15.2. With this small number, it is evident that the student who performed well on the first test also did so on the second. Student 5 scored the lowest on both tests, and the others line up in between. The correlation allows for an exact number to be used to express the observation that the students scored approximately the same on the two tests. The correlation is 0.97. Examinees Test 1 Test 2 Student 1 35 35 Student 2 25 26 Student 3 30 26 Student 4 34 32 Student 5 17 16 Table 15.2 The use of correlation in validation research A correlation can range from 1.00 to –1.00, indicating a perfect positive relationship or a perfect negative relationship. A correlation of 0.00 would indicate no relationship. Table 15.3 illustrates two sets of scores that show a negative relationship. The correlation among the scores in Table 15.3 is –0.79. Typically, in language testing, correlations in the positive range are found when tests of different language skills are correlated. However, like the analysis of difficulty, the analysis of correlations requires an understanding of the constructs and test uses of the tests investigated. Examinees Test 1 Test 2 Student 1 35 17 Student 2 25 26 Student 3 30 26 Student 4 34 28 Student 5 17 35 Table 15.3 Two sets of scores that show a negative relationship The direction and strength of a correlation depend on many factors, including the number of subjects and the distributions of scores, and therefore correlations should be interpreted in view of both the construct that the test is intended to measure and the data used to do the analysis. Correlational techniques are the conceptual building blocks for many of the complex test analyses that are conducted, which also require a clear understanding of the basic principles outlined in the first part of the chapter. 258 An Introduction to Applied Linguistics Language Assessment and Language Teaching The relationships between assessment and teaching are as multifaceted as the contexts and purposes of assessment; however, some trends are worth noting. The first is an increased interest in social and political influences on assessment (see McNamara and Roever, 2006 for a comprehensive overview). In this context, most professional language testers, under the influence of Messick’s (1989) argument that validation should ‘trace the social consequences’ of a test, have embraced the idea that tests should be designed and used so as to have a positive impact on teaching and learning. In recent years, researchers have begun to study this impact in a range of educational contexts. Another notable shift in the assessment landscape is a loss of faith in the capacity of ‘traditional’ forms of educational measurement such as standardized tests to capture learning outcomes accurately and a corresponding move towards greater alignment of curriculum and instruction through the adoption by teachers of new forms of performance assessment (Leung and Rea-Dickins, 2007). A third aspect of language assessment which is found in recent literature is the way in which governments in many countries, under increasing pressure to demonstrate accountability and measurable outcomes, are using assessment as a policy tool. Let us look at each of these trends in more detail. Washback One result of Messick’s (1989) expansion of the concept of validity to include the social consequences of test use, has been an increased focus on ‘washback’, a term commonly used by writers on language assessment to denote the influence of testing on teaching (Hughes, 2003: 1). This influence often tends to be presented as harmful – it has been claimed, for example, that tests (particularly high- stakes standardized tests) exercise a negative influence due to the temptation for teachers to spend time on activities that will help students to succeed in the test (for example, learning test-taking strategies) rather than on developing the skills and knowledge which should be the object of instruction (Alderson and Hamp- Lyons, 1996: 280–281). Conversely, it is also believed that ‘positive washback’ can be brought about through the introduction of tests that target the skills needed by language learners in real life (Cheng, 1998: 279). Seen in this way, a test could be considered more or less valid according to how beneficial its washback effects were thought to be. Although some washback studies have identified detrimental effects of standardized testing on teaching practice (see, for example, Fox and Cheng, 2007; Slomp, 2008), Alderson and Wall (1993), reject such a view of washback as simplistic and unsupported by evidence. They argue that ‘washback, if it exists ... is likely to be a complex phenomenon which cannot be related to a test’s validity’ (Alderson and Wall, 1993: 116). The findings of research into washback in a range of language teaching contexts support Alderson and Wall’s (1993) contention that washback effects are complex. In a study of the impact of two national tests used in Israel, Shohamy, Donitsa-Schmidt and Ferman (1996) found that washback patterns ‘can change over time and that the impact of tests is not necessarily stable’. Wall and Alderson’s (1993) study of the introduction of a new examination into the Sri Lankan educational system showed that a range of constraints may influence the intended effects of an examination, including 259 Assessment inadequate communication of information by educational authorities, low levels of teacher awareness and lack of professional development support. These authors conclude that ‘an exam on its own cannot reinforce an approach to teaching the educational system has not adequately prepared its teachers for’ (Wall and Alderson, 1993: 67). Cheng’s (1998) research into the introduction of a new task- based examination into the Hong Kong examination system suggests that the impact of assessment reform may be limited unless there is genuine change in ‘how teachers teach and how textbooks are designed’. The role of the teacher emerges as a major factor in many washback studies. Alderson and Hamp-Lyons (1996) investigated teacher attitudes and behaviour in TOEFL preparation classes and concluded that washback effects may vary significantly according to individual teacher characteristics. Burrows (2004) reached a similar conclusion in a study of adult ESL teachers’ reactions to the introduction of a new competency-based assessment system in the Adult Migrant English Program in Australia. She concluded that teachers’ responses are related to their attitudes towards and experiences of the implementation of the assessment, their perceptions of the quality of the assessment; the extent to which the assessment represented a departure from their previous practices; and their attitudes to change itself. All of these findings suggest that the nature and extent of washback are governed by a wide range of individual, educational and social factors. These include the political context in which a test or assessment system is introduced, the time that has elapsed since adoption, the knowledge, attitudes and beliefs of teachers and educational managers, the role of test agencies and publishers, the relationships between participants and the resources available. An adequate model of impact, according to Wall (1997: 297) needs to include all of these influences and to describe the relationships between them. ‘Alternative’ Assessment The close interrelationship between teaching and assessment which is depicted in many of the washback studies described above has not always been reflected in the language testing literature. In comparison to standardized proficiency testing, the pedagogical role of assessment has until recently received relatively little attention (Rea-Dickins and Gardner, 2000; Brindley, 2007). However, over the last decade, there has been a growing acknowledgement of the need for closer links between assessment and instruction (Shohomy, 1992; Genesee and Hamayan, 1994) accompanied by a recognition on the part of educational authorities in many countries that teacher-conducted assessments have an important role to play in determining learners’ achievement. As a result, we have seen the widespread adoption of ‘alternative’ assessment methods which directly reflect learning activities and which are carried out by practitioners in the context in which learning takes place (Brown and Hudson, 1998). Some of the more commonly used methods include the following. Observation Informal observation of learners’ language use is one of the most widely used methods of assessment in language classrooms (Brindley, 2001a; Brown, 2004). As Brown (2004: 266–7) notes, on the basis of the information that they build up through observing their students’ behaviour, experienced teachers’ estimates 260 An Introduction to Applied Linguistics of student ability are frequently highly correlated with more formal test results. Information derived from teacher observations may be used in a variety of ways to inform classroom decision-making (for example, whether learners have achieved the learning objectives for a particular unit of instruction and are ready to progress to the next unit). Types of observation that can be used to monitor progress and identify individual learning difficulties range from anecdotal records to checklists and rating scales. In some educational systems, teachers’ observations of learner performance may form an important part of the evidence that is used for external reporting to authorities, and may thus require detailed recording of classroom language use. However, when used for this purpose, observation needs to be conducted with a great deal of care and attention if it is to yield valid and reliable information. In this context, Rea-Dickins and Gardner (2000) have identified a number of sources of potential unreliability in teachers’ transcription and interpretation of classroom language samples that may affect the validity of the inferences that are made. They call for more research into the validity and reliability of observational assessment and highlight the need to include classroom observation skills in teacher professional development programmes (Rea-Dickins and Gardner, 2000: 238–239). Portfolios A portfolio is a purposeful collection of students’ work over time that contains samples of their language performance at different stages of completion, as well as the student’s own observations on his or her progress. Three types of portfolio have been identified, reflecting different purposes and features (Valencia and Calfee, 1991). These are first, the ‘showcase’ portfolio which represents a collection of student’s best or favourite work. The entries in the showcase portfolio are selected by the student and thus portray an individual’s learning over time. No comparison with external standards or with other students is involved. Second, there is the ‘documentation’ portfolio which contains systematic ongoing records of progress. The documentation portfolio may include observations, checklists, anecdotal records, interviews, classroom tests and performance assessments. The selection of entries may be made by either the teacher or the student. According to Valencia and Calfee (1991: 337), ‘the documentation resembles a scrapbook, providing evidence but not judging the quality of the activities’. Finally, the ‘evaluation’ portfolio which is used as public evidence of learners’ achievement is more standardized than either the showcase or documentation portfolio because of the need for comparability. The contents of the evaluation portfolio and the assessment criteria used are largely determined by external requirements, although there is some room for individual selection and reflection activities. In the context of language education programmes in the USA, Gottlieb and Nguyen (2007) describe what they call a ‘pivotal portfolio’ that combines the features of the showcase and documentation portfolio. It contains essential evidence of the student’s work, along with common assessments administered by all teachers, and follows the learner for the duration of the programme. The use of portfolios as a means of recording and assessing progress offers a number of advantages to language teachers and learners. Not only does it provide a way of relating assessment closely to instruction and motivating learners 261 Assessment (Fulcher, 1997) but it also offers learners the opportunity to reflect on their learning goals and strategies, thus promoting learner independence (Gottlieb and Nguyen, 2007). Another claimed advantage of assessment portfolios is that they provide concrete evidence of development that can be used to demonstrate tangible achievement to external stakeholders in language programmes (Genesee and Upshur, 1996: 100). However, the introduction of portfolio assessment has not been without problems. There has been considerable debate in the research literature concerning issues such as the type and amount of student work that should be included in a portfolio, the extent to which students should be involved in selection of the entries and the amount of external assistance they should be allowed (Fulcher, 1997; Brown and Hudson, 1998; Hamp-Lyons and Condon, 2000). In addition, research studies have highlighted both technical and practical difficulties associated with portfolio use. These include: • Low levels of agreement between assessors on the quality of language samples (Brindley, 20001b). • Lack of comparability between the samples submitted (Hamp-Lyons and Condon, 2000). • The time and expense associated with collecting and grading large numbers of student texts on a continuing basis, conducting standard-setting meetings and discussing portfolios with students on an individual basis (Weigle, 2002). In spite of these potential difficulties, however, it has been argued that the positive impact of portfolios on both teachers and learners is in itself sufficient reason to continue their use, even if it cannot be demonstrated that portfolio assessment is technically more reliable than more traditional means of assessment (Fulcher, 1997; Hamp-Lyons and Condon, 2000). In addition, with the advent of new technology, the practical problems of data management and storage associated with paper-based portfolios do not arise, since the contents can be stored, displayed and transmitted electronically. A wide variety of work samples can now be captured in different electronic formats, ranging from video-recorded speech samples to writing assignments and used by teachers, learners and relevant third Download 1.71 Mb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling