An Introduction to Applied Linguistics


particular language abilities. Figure 15.6 summarizes Douglas’ view, showing


Download 1.71 Mb.
Pdf ko'rish
bet105/159
Sana09.04.2023
Hajmi1.71 Mb.
#1343253
1   ...   101   102   103   104   105   106   107   108   ...   159
Bog'liq
Norbert Schmitt (ed.) - An Introduction to Applied Linguistics (2010, Routledge) - libgen.li


particular language abilities. Figure 15.6 summarizes Douglas’ view, showing 
language capacity and test method as responsible for test performance. Testing 
experts differ on how to interpret and deal with the fact of test method influence 
on performance; however, most agree that it is essential to identify those aspects 
of test method that may play a role.
Characteristics of the test 
tasks or testing method
Test-taker’s interpretation 
of test tasks or method
Test-taker’s goals and 
plans for participation
Test-taker’s language ability
Test-taker’s performance
Figure 15.6 Factors involved in the relationship between a test method and performance as 
outlined by Douglas (1998)
The most encompassing framework for describing test methods has been 
developed in two stages, first as ‘test method facets’ (Bachman, 1990) and, 
more recently, ‘test task characteristics’ (Bachman and Palmer, 1996). Test task 
characteristics are defined as:


254 An Introduction to Applied Linguistics
• The test ‘setting’, such as the physical specifications of the room and the 
participants.
• The testing ‘rubrics’, including the instructions, test structure, allotted time, 
response evaluation and calculation of scores.
• The ‘input’ to the test-taker, such as test length and grammatical and topical 
characteristics of test questions.
• The ‘output’ expected from the learner, such as the length and grammatical and 
topical features of responses.
• The relationship between input and output, such as whether or not the answers 
to questions the examinee is asked depend on previous responses.
These test task characteristics provide the analytic tools needed for both 
construction and analysis of language tests, and therefore have played a role in 
test validation research.
Validation
The term ‘validity’ carries some meaning for almost everyone, but in educational 
measurement, including language testing, this term has an extensive technical 
sense about which volumes have been written. Many applied linguists learned at 
one time that validity was defined as consisting of three sub-types:
• ‘Content validity’ (whether the content of the test questions is appropriate).
• ‘Criterion-related validity’ (whether other tests measuring similar linguistic 
abilities correlated with the test in question).
• ‘Construct validity’ (whether research shows that the test measures the 
‘construct’ discussed above).
In addition, many people think of validity of a test being established by 
measurement experts through statistical analysis of test scores. Although current 
perspectives retain traces of these ideas, both the theory and practice of validation 
are now markedly different from this view (Chapelle, 1999). One big change 
is typically associated with a seminal paper by Messick (1989), which defined 
validation as the process of constructing an argument about the interpretations 
and uses made from test scores. Such an argument may draw upon criterion-
related evidence, for example, but the goal of validation would be to establish an 
argument by integrating a variety of evidence to support test score interpretation 
and use. As an ‘argument’, rather than a black and white ‘proof’, validation may 
draw upon a number of different types of data.
Such an argument is made on the basis of both qualitative and quantitative 
research, and it relies on the perspectives obtained from technical work on 
language testing and the perspectives of applied linguists, language teachers 
and other test users. An ESL reading test provides an example of how these 
perspectives worked together (Chapelle, Jamieson and Hegelheimer, 2003). A 
publishing company contracted testing researchers to develop an ESL test to be 
delivered on the world-wide web to ESL learners at a wide variety of proficiency 
levels. Because the test-takers would have a great deal of variation in their reading 
ability, the test developers decided to include three modules in the test, one with 
beginning level texts for the examinees to read, one with somewhat simplified 
texts and a third with advanced-level texts. Once this decision had been made, 
however, the test developers needed to be able to show that the tests on the texts 


255
Assessment
actually represented the intended differences in levels, and therefore three types 
of evidence were used.
One type of evidence was the judgement of ESL teachers. Teams of ESL teachers 
were formed and they worked together to form an understanding of what they 
should be looking for in texts of various levels in ESL books. Then each passage 
that had been selected for its potential as a reading text on the test was evaluated 
by two members of the team to give it a rating of ‘beginning’, ‘intermediate’ or 
‘advanced’. An interesting finding during this part of the work was that the two 
ESL teachers did not always agree on the level of the text, nor did they always agree 
with the original author’s assignment of the text to a particular level. This part of 
the test development process resulted in a pool of texts about which two raters 
agreed. In other words, if two raters thought that a text was a beginning level one, 
it was retained in the pool of possible texts of the test, but if one rater thought it 
was a beginning level one and the other rater thought it was intermediate, it was 
eliminated. The text agreed upon then proceeded to the next stage of analysis.
The second type of analysis drew on the expertise of a corpus linguist, who 
did a quantitative analysis of the language of each of the texts. The texts were 
scanned to copy them into electronic files, which were then tagged and analysed 
by use of a computer program that quantified characteristics of the texts that 
signal difficulty, such as word length, sentence length and syntactic complexity. 
The corpus linguist set cut scores for each of these features and then selected texts 
that, on the basis of these characteristics, were clear examples of each level. These 
texts formed the basis of the reading comprehension modules at the three levels 
of difficulty. Test writers developed questions to test comprehension as well as 
other aspects of reading comprehension and then the three module tests were 
given to a group of examinees.
The third type of analysis was quantitative. The researchers wanted to see if 
the texts that had been so carefully selected as beginning level actually produced 
test items that were easier than those that had been selected as intermediate and 
advanced. The question was whether or not the predicted number of examinees 
got test questions correct for the beginning, intermediate and advanced level tests. 
As Table 15.1 shows, the researchers predicted that a high percentage of examinees 
would obtain correct responses on the beginning level texts and so on. The table 
also shows the results that were obtained when a group of 47 learners took the 
tests. In fact, the percentages of correct responses turned out as anticipated.
Predicted and 
actual results
Intended test level
Beginning
Intermediate
Advanced
Predicted
High percentage
Medium percentage
Low percentage
Actual mean 
percentage of 
correct responses
85
74
68
Table 15.1 Summary of predictions and results in a quantitative validity argument
These three types of evidence about the reading test modules are obviously not 
all that we would want to know about their validity as tests of reading, but these 
data form one part of the validity argument. A second important development in 


256 An Introduction to Applied Linguistics
validation practices has been evolving over the past several years to help testing 
researchers to specify the types of evidence that are needed in view of the types 
of inferences that that underlie the score interpretation and use (Kane, 2006). 
These advances have been influential and useful in language testing (Bachman, 
2005; Chapelle, Enright and Jamieson, 2008). The many types of qualitative and 
quantitative analysis that are used in validity research would be too much to 
describe in this introduction, but the idea of how testing researchers evaluate 
test data can be illustrated through the description of two basic test analysis 
procedures.
Test Analysis
Two types of analysis form the basis for much of the quantitative test analysis: 
‘difficulty analysis’ and ‘correlational analysis’. Difficulty analysis refers to the type 
of analysis that was described above, in which the concern is to determine how 
difficult the items on the test are. Correlational analysis is a means of obtaining 
a statistical estimate of the strength of the relationship between two sets of test 
scores. Computationally, each of these analyses is straightforward. The challenge 
in language testing research is to design a study in which the results of the analysis 
can be used to provide information about the questions that are relevant to the 
validity of test use.
Item Difficulty
In the example above, the researchers were concerned that their intended levels of 
text difficulty would actually hold true when examinees took the three modules 
of the reading test. In the description of the results, we summarized the item 
difficulties of each of the tests. However, in the actual study the researchers also 
examined the item difficulties of each item on each of the tests. The item difficulty 
is defined as the percentage of examinees who answered the item correctly. To 
obtain this percentage, the researchers divided the number who scored correctly 
by the total number who took the test and multiplied by 100. On the reading test 
described above, if 40 correct responses were obtained on an item, that would be 
(40/47 = 0.85, and then 0.85 
⳯ 100 = 85). People who write tests professionally 
use this and other item statistics to decide which items are good and which ones 
should be revised or deleted from a test during test development.
As illustrated above, the concept of difficulty can be used several different ways, 
but it is best used in view of the construct that the test is intended to measure, 
and the use of the test. If all of the items on a test have high values for item 
difficulty, for example, the person analysing the test knows that the test is very 
easy. But whether or not this means that the items should be changed depends 
on the test construct, the examinees tested and the test use. In this regard, testing 
researchers distinguish between ‘norm-referenced’ tests, which are intended to 
make distinctions among examinees, and ‘criterion-referenced’ decisions, which 
are intended to be used to make decisions about an individual’s knowledge of the 
material reflected on the test. A test that is easy for a group of examinees would 
not be successful in distinguishing between examinees, but it may have shown 
correctly that individuals in that group knew the material tested. Moreover, when 
difficulty is interpreted in view of the construct that an item of a test is intended 
to measure, it can be used as one part of a validity argument.


257
Assessment
Correlation
A second statistical analysis used in validation research is ‘correlation’. When 
testing researchers or teachers look at how similar two tests are, they are considering 
the correlation between tests. For example, if a group of students takes two tests at 
the beginning of the semester, their scores can be lined up next to each other and, 
if the number of students is small, the degree of relationship between them may 
be apparent, as shown in Table 15.2. With this small number, it is evident that the 
student who performed well on the first test also did so on the second. Student 5 
scored the lowest on both tests, and the others line up in between. The correlation 
allows for an exact number to be used to express the observation that the students 
scored approximately the same on the two tests. The correlation is 0.97.
Examinees
Test 1
Test 2
Student 1
35
35
Student 2
25
26
Student 3
30
26
Student 4
34
32
Student 5
17
16
Table 15.2 The use of correlation in validation research
A correlation can range from 1.00 to –1.00, indicating a perfect positive 
relationship or a perfect negative relationship. A correlation of 0.00 would indicate 
no relationship. Table 15.3 illustrates two sets of scores that show a negative 
relationship. The correlation among the scores in Table 15.3 is –0.79. Typically, 
in language testing, correlations in the positive range are found when tests of 
different language skills are correlated. However, like the analysis of difficulty, the 
analysis of correlations requires an understanding of the constructs and test uses 
of the tests investigated.
Examinees
Test 1
Test 2
Student 1
35
17
Student 2
25
26
Student 3
30
26
Student 4
34
28
Student 5
17
35
Table 15.3 Two sets of scores that show a negative relationship
The direction and strength of a correlation depend on many factors, including 
the number of subjects and the distributions of scores, and therefore correlations 
should be interpreted in view of both the construct that the test is intended 
to measure and the data used to do the analysis. Correlational techniques are 
the conceptual building blocks for many of the complex test analyses that are 
conducted, which also require a clear understanding of the basic principles 
outlined in the first part of the chapter.


258 An Introduction to Applied Linguistics
Language Assessment and Language Teaching
The relationships between assessment and teaching are as multifaceted as the 
contexts and purposes of assessment; however, some trends are worth noting. 
The first is an increased interest in social and political influences on assessment 
(see McNamara and Roever, 2006 for a comprehensive overview). In this 
context, most professional language testers, under the influence of Messick’s 
(1989) argument that validation should ‘trace the social consequences’ of a test, 
have embraced the idea that tests should be designed and used so as to have 
a positive impact on teaching and learning. In recent years, researchers have 
begun to study this impact in a range of educational contexts. Another notable 
shift in the assessment landscape is a loss of faith in the capacity of ‘traditional’ 
forms of educational measurement such as standardized tests to capture learning 
outcomes accurately and a corresponding move towards greater alignment of 
curriculum and instruction through the adoption by teachers of new forms of 
performance assessment (Leung and Rea-Dickins, 2007). A third aspect of language 
assessment which is found in recent literature is the way in which governments 
in many countries, under increasing pressure to demonstrate accountability and 
measurable outcomes, are using assessment as a policy tool. Let us look at each 
of these trends in more detail.
Washback
One result of Messick’s (1989) expansion of the concept of validity to include the 
social consequences of test use, has been an increased focus on ‘washback’, a term 
commonly used by writers on language assessment to denote the influence of 
testing on teaching (Hughes, 2003: 1). This influence often tends to be presented 
as harmful – it has been claimed, for example, that tests (particularly high-
stakes standardized tests) exercise a negative influence due to the temptation for 
teachers to spend time on activities that will help students to succeed in the test 
(for example, learning test-taking strategies) rather than on developing the skills 
and knowledge which should be the object of instruction (Alderson and Hamp-
Lyons, 1996: 280–281). Conversely, it is also believed that ‘positive washback’ can 
be brought about through the introduction of tests that target the skills needed 
by language learners in real life (Cheng, 1998: 279). Seen in this way, a test could 
be considered more or less valid according to how beneficial its washback effects 
were thought to be.
Although some washback studies have identified detrimental effects of 
standardized testing on teaching practice (see, for example, Fox and Cheng, 
2007; Slomp, 2008), Alderson and Wall (1993), reject such a view of washback 
as simplistic and unsupported by evidence. They argue that ‘washback, if it 
exists ... is likely to be a complex phenomenon which cannot be related to a 
test’s validity’ (Alderson and Wall, 1993: 116). The findings of research into 
washback in a range of language teaching contexts support Alderson and Wall’s 
(1993) contention that washback effects are complex. In a study of the impact of 
two national tests used in Israel, Shohamy, Donitsa-Schmidt and Ferman (1996) 
found that washback patterns ‘can change over time and that the impact of tests 
is not necessarily stable’. Wall and Alderson’s (1993) study of the introduction of 
a new examination into the Sri Lankan educational system showed that a range 
of constraints may influence the intended effects of an examination, including 


259
Assessment
inadequate communication of information by educational authorities, low levels 
of teacher awareness and lack of professional development support. These authors 
conclude that ‘an exam on its own cannot reinforce an approach to teaching 
the educational system has not adequately prepared its teachers for’ (Wall and 
Alderson, 1993: 67). Cheng’s (1998) research into the introduction of a new task-
based examination into the Hong Kong examination system suggests that the 
impact of assessment reform may be limited unless there is genuine change in 
‘how teachers teach and how textbooks are designed’.
The role of the teacher emerges as a major factor in many washback studies. 
Alderson and Hamp-Lyons (1996) investigated teacher attitudes and behaviour 
in TOEFL preparation classes and concluded that washback effects may vary 
significantly according to individual teacher characteristics. Burrows (2004) 
reached a similar conclusion in a study of adult ESL teachers’ reactions to the 
introduction of a new competency-based assessment system in the Adult 
Migrant English Program in Australia. She concluded that teachers’ responses are 
related to their attitudes towards and experiences of the implementation of the 
assessment, their perceptions of the quality of the assessment; the extent to which 
the assessment represented a departure from their previous practices; and their 
attitudes to change itself. All of these findings suggest that the nature and extent 
of washback are governed by a wide range of individual, educational and social 
factors. These include the political context in which a test or assessment system 
is introduced, the time that has elapsed since adoption, the knowledge, attitudes 
and beliefs of teachers and educational managers, the role of test agencies and 
publishers, the relationships between participants and the resources available. An 
adequate model of impact, according to Wall (1997: 297) needs to include all of 
these influences and to describe the relationships between them.
‘Alternative’ Assessment
The close interrelationship between teaching and assessment which is depicted in 
many of the washback studies described above has not always been reflected in 
the language testing literature. In comparison to standardized proficiency testing, 
the pedagogical role of assessment has until recently received relatively little 
attention (Rea-Dickins and Gardner, 2000; Brindley, 2007). However, over the last 
decade, there has been a growing acknowledgement of the need for closer links 
between assessment and instruction (Shohomy, 1992; Genesee and Hamayan, 
1994) accompanied by a recognition on the part of educational authorities in 
many countries that teacher-conducted assessments have an important role to play 
in determining learners’ achievement. As a result, we have seen the widespread 
adoption of ‘alternative’ assessment methods which directly reflect learning 
activities and which are carried out by practitioners in the context in which 
learning takes place (Brown and Hudson, 1998). Some of the more commonly 
used methods include the following.
Observation
Informal observation of learners’ language use is one of the most widely used 
methods of assessment in language classrooms (Brindley, 2001a; Brown, 2004). 
As Brown (2004: 266–7) notes, on the basis of the information that they build 
up through observing their students’ behaviour, experienced teachers’ estimates 


260 An Introduction to Applied Linguistics
of student ability are frequently highly correlated with more formal test results. 
Information derived from teacher observations may be used in a variety of ways to 
inform classroom decision-making (for example, whether learners have achieved 
the learning objectives for a particular unit of instruction and are ready to progress 
to the next unit). Types of observation that can be used to monitor progress and 
identify individual learning difficulties range from anecdotal records to checklists 
and rating scales.
In some educational systems, teachers’ observations of learner performance 
may form an important part of the evidence that is used for external reporting to 
authorities, and may thus require detailed recording of classroom language use. 
However, when used for this purpose, observation needs to be conducted with 
a great deal of care and attention if it is to yield valid and reliable information. 
In this context, Rea-Dickins and Gardner (2000) have identified a number of 
sources of potential unreliability in teachers’ transcription and interpretation of 
classroom language samples that may affect the validity of the inferences that are 
made. They call for more research into the validity and reliability of observational 
assessment and highlight the need to include classroom observation skills in 
teacher professional development programmes (Rea-Dickins and Gardner, 2000: 
238–239).
Portfolios
A portfolio is a purposeful collection of students’ work over time that contains 
samples of their language performance at different stages of completion, as well as 
the student’s own observations on his or her progress.
Three types of portfolio have been identified, reflecting different purposes 
and features (Valencia and Calfee, 1991). These are first, the ‘showcase’ portfolio 
which represents a collection of student’s best or favourite work. The entries in 
the showcase portfolio are selected by the student and thus portray an individual’s 
learning over time. No comparison with external standards or with other students 
is involved. Second, there is the ‘documentation’ portfolio which contains 
systematic ongoing records of progress. The documentation portfolio may 
include observations, checklists, anecdotal records, interviews, classroom tests 
and performance assessments. The selection of entries may be made by either 
the teacher or the student. According to Valencia and Calfee (1991: 337), ‘the 
documentation resembles a scrapbook, providing evidence but not judging the 
quality of the activities’. Finally, the ‘evaluation’ portfolio which is used as public 
evidence of learners’ achievement is more standardized than either the showcase 
or documentation portfolio because of the need for comparability. The contents 
of the evaluation portfolio and the assessment criteria used are largely determined 
by external requirements, although there is some room for individual selection 
and reflection activities. In the context of language education programmes in the 
USA, Gottlieb and Nguyen (2007) describe what they call a ‘pivotal portfolio’ that 
combines the features of the showcase and documentation portfolio. It contains 
essential evidence of the student’s work, along with common assessments 
administered by all teachers, and follows the learner for the duration of the 
programme.
The use of portfolios as a means of recording and assessing progress offers a 
number of advantages to language teachers and learners. Not only does it provide 
a way of relating assessment closely to instruction and motivating learners 


261
Assessment
(Fulcher, 1997) but it also offers learners the opportunity to reflect on their 
learning goals and strategies, thus promoting learner independence (Gottlieb 
and Nguyen, 2007). Another claimed advantage of assessment portfolios is that 
they provide concrete evidence of development that can be used to demonstrate 
tangible achievement to external stakeholders in language programmes (Genesee 
and Upshur, 1996: 100).
However, the introduction of portfolio assessment has not been without 
problems. There has been considerable debate in the research literature concerning 
issues such as the type and amount of student work that should be included in 
a portfolio, the extent to which students should be involved in selection of the 
entries and the amount of external assistance they should be allowed (Fulcher, 
1997; Brown and Hudson, 1998; Hamp-Lyons and Condon, 2000). In addition, 
research studies have highlighted both technical and practical difficulties 
associated with portfolio use. These include:
• Low levels of agreement between assessors on the quality of language samples 
(Brindley, 20001b).
• Lack of comparability between the samples submitted (Hamp-Lyons and 
Condon, 2000).
• The time and expense associated with collecting and grading large numbers of 
student texts on a continuing basis, conducting standard-setting meetings and 
discussing portfolios with students on an individual basis (Weigle, 2002).
In spite of these potential difficulties, however, it has been argued that the positive 
impact of portfolios on both teachers and learners is in itself sufficient reason to 
continue their use, even if it cannot be demonstrated that portfolio assessment 
is technically more reliable than more traditional means of assessment (Fulcher, 
1997; Hamp-Lyons and Condon, 2000). In addition, with the advent of new 
technology, the practical problems of data management and storage associated 
with paper-based portfolios do not arise, since the contents can be stored, 
displayed and transmitted electronically. A wide variety of work samples can now 
be captured in different electronic formats, ranging from video-recorded speech 
samples to writing assignments and used by teachers, learners and relevant third 
Download 1.71 Mb.

Do'stlaringiz bilan baham:
1   ...   101   102   103   104   105   106   107   108   ...   159




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling