An Introduction to Applied Linguistics

particular language abilities. Figure 15.6 summarizes Douglas’ view, showing

bet	105/159
Sana	09.04.2023
Hajmi	1,71 Mb.
	#1343253

1 ... 101 102 103 104 105 106 107 108 ... 159

Bog'liq
Norbert Schmitt (ed.) - An Introduction to Applied Linguistics (2010, Routledge) - libgen.li

particular language abilities. Figure 15.6 summarizes Douglas’ view, showing
language capacity and test method as responsible for test performance. Testing
experts differ on how to interpret and deal with the fact of test method inﬂuence
on performance; however, most agree that it is essential to identify those aspects
of test method that may play a role.
Characteristics of the test
tasks or testing method
Test-taker’s interpretation
of test tasks or method
Test-taker’s goals and
plans for participation
Test-taker’s language ability
Test-taker’s performance
Figure 15.6 Factors involved in the relationship between a test method and performance as
outlined by Douglas (1998)
The most encompassing framework for describing test methods has been
developed in two stages, ﬁrst as ‘test method facets’ (Bachman, 1990) and,
more recently, ‘test task characteristics’ (Bachman and Palmer, 1996). Test task
characteristics are deﬁned as:

254 An Introduction to Applied Linguistics
• The test ‘setting’, such as the physical speciﬁcations of the room and the
participants.
• The testing ‘rubrics’, including the instructions, test structure, allotted time,
response evaluation and calculation of scores.
• The ‘input’ to the test-taker, such as test length and grammatical and topical
characteristics of test questions.
• The ‘output’ expected from the learner, such as the length and grammatical and
topical features of responses.
• The relationship between input and output, such as whether or not the answers
to questions the examinee is asked depend on previous responses.
These test task characteristics provide the analytic tools needed for both
construction and analysis of language tests, and therefore have played a role in
test validation research.
Validation
The term ‘validity’ carries some meaning for almost everyone, but in educational
measurement, including language testing, this term has an extensive technical
sense about which volumes have been written. Many applied linguists learned at
one time that validity was deﬁned as consisting of three sub-types:
• ‘Content validity’ (whether the content of the test questions is appropriate).
• ‘Criterion-related validity’ (whether other tests measuring similar linguistic
abilities correlated with the test in question).
• ‘Construct validity’ (whether research shows that the test measures the
‘construct’ discussed above).
In addition, many people think of validity of a test being established by
measurement experts through statistical analysis of test scores. Although current
perspectives retain traces of these ideas, both the theory and practice of validation
are now markedly different from this view (Chapelle, 1999). One big change
is typically associated with a seminal paper by Messick (1989), which deﬁned
validation as the process of constructing an argument about the interpretations
and uses made from test scores. Such an argument may draw upon criterion-
related evidence, for example, but the goal of validation would be to establish an
argument by integrating a variety of evidence to support test score interpretation
and use. As an ‘argument’, rather than a black and white ‘proof’, validation may
draw upon a number of different types of data.
Such an argument is made on the basis of both qualitative and quantitative
research, and it relies on the perspectives obtained from technical work on
language testing and the perspectives of applied linguists, language teachers
and other test users. An ESL reading test provides an example of how these
perspectives worked together (Chapelle, Jamieson and Hegelheimer, 2003). A
publishing company contracted testing researchers to develop an ESL test to be
delivered on the world-wide web to ESL learners at a wide variety of proﬁciency
levels. Because the test-takers would have a great deal of variation in their reading
ability, the test developers decided to include three modules in the test, one with
beginning level texts for the examinees to read, one with somewhat simpliﬁed
texts and a third with advanced-level texts. Once this decision had been made,
however, the test developers needed to be able to show that the tests on the texts

255
Assessment
actually represented the intended differences in levels, and therefore three types
of evidence were used.
One type of evidence was the judgement of ESL teachers. Teams of ESL teachers
were formed and they worked together to form an understanding of what they
should be looking for in texts of various levels in ESL books. Then each passage
that had been selected for its potential as a reading text on the test was evaluated
by two members of the team to give it a rating of ‘beginning’, ‘intermediate’ or
‘advanced’. An interesting ﬁnding during this part of the work was that the two
ESL teachers did not always agree on the level of the text, nor did they always agree
with the original author’s assignment of the text to a particular level. This part of
the test development process resulted in a pool of texts about which two raters
agreed. In other words, if two raters thought that a text was a beginning level one,
it was retained in the pool of possible texts of the test, but if one rater thought it
was a beginning level one and the other rater thought it was intermediate, it was
eliminated. The text agreed upon then proceeded to the next stage of analysis.
The second type of analysis drew on the expertise of a corpus linguist, who
did a quantitative analysis of the language of each of the texts. The texts were
scanned to copy them into electronic ﬁles, which were then tagged and analysed
by use of a computer program that quantiﬁed characteristics of the texts that
signal difﬁculty, such as word length, sentence length and syntactic complexity.
The corpus linguist set cut scores for each of these features and then selected texts
that, on the basis of these characteristics, were clear examples of each level. These
texts formed the basis of the reading comprehension modules at the three levels
of difﬁculty. Test writers developed questions to test comprehension as well as
other aspects of reading comprehension and then the three module tests were
given to a group of examinees.
The third type of analysis was quantitative. The researchers wanted to see if
the texts that had been so carefully selected as beginning level actually produced
test items that were easier than those that had been selected as intermediate and
advanced. The question was whether or not the predicted number of examinees
got test questions correct for the beginning, intermediate and advanced level tests.
As Table 15.1 shows, the researchers predicted that a high percentage of examinees
would obtain correct responses on the beginning level texts and so on. The table
also shows the results that were obtained when a group of 47 learners took the
tests. In fact, the percentages of correct responses turned out as anticipated.
Predicted and
actual results
Intended test level
Beginning
Intermediate
Advanced
Predicted
High percentage
Medium percentage
Low percentage
Actual mean
percentage of
correct responses
85
74
68
Table 15.1 Summary of predictions and results in a quantitative validity argument
These three types of evidence about the reading test modules are obviously not
all that we would want to know about their validity as tests of reading, but these
data form one part of the validity argument. A second important development in

256 An Introduction to Applied Linguistics
validation practices has been evolving over the past several years to help testing
researchers to specify the types of evidence that are needed in view of the types
of inferences that that underlie the score interpretation and use (Kane, 2006).
These advances have been inﬂuential and useful in language testing (Bachman,
2005; Chapelle, Enright and Jamieson, 2008). The many types of qualitative and
quantitative analysis that are used in validity research would be too much to
describe in this introduction, but the idea of how testing researchers evaluate
test data can be illustrated through the description of two basic test analysis
procedures.
Test Analysis
Two types of analysis form the basis for much of the quantitative test analysis:
‘difﬁculty analysis’ and ‘correlational analysis’. Difﬁculty analysis refers to the type
of analysis that was described above, in which the concern is to determine how
difﬁcult the items on the test are. Correlational analysis is a means of obtaining
a statistical estimate of the strength of the relationship between two sets of test
scores. Computationally, each of these analyses is straightforward. The challenge
in language testing research is to design a study in which the results of the analysis
can be used to provide information about the questions that are relevant to the
validity of test use.
Item Difﬁculty
In the example above, the researchers were concerned that their intended levels of
text difﬁculty would actually hold true when examinees took the three modules
of the reading test. In the description of the results, we summarized the item
difﬁculties of each of the tests. However, in the actual study the researchers also
examined the item difﬁculties of each item on each of the tests. The item difﬁculty
is deﬁned as the percentage of examinees who answered the item correctly. To
obtain this percentage, the researchers divided the number who scored correctly
by the total number who took the test and multiplied by 100. On the reading test
described above, if 40 correct responses were obtained on an item, that would be
(40/47 = 0.85, and then 0.85
⳯ 100 = 85). People who write tests professionally
use this and other item statistics to decide which items are good and which ones
should be revised or deleted from a test during test development.
As illustrated above, the concept of difﬁculty can be used several different ways,
but it is best used in view of the construct that the test is intended to measure,
and the use of the test. If all of the items on a test have high values for item
difﬁculty, for example, the person analysing the test knows that the test is very
easy. But whether or not this means that the items should be changed depends
on the test construct, the examinees tested and the test use. In this regard, testing
researchers distinguish between ‘norm-referenced’ tests, which are intended to
make distinctions among examinees, and ‘criterion-referenced’ decisions, which
are intended to be used to make decisions about an individual’s knowledge of the
material reﬂected on the test. A test that is easy for a group of examinees would
not be successful in distinguishing between examinees, but it may have shown
correctly that individuals in that group knew the material tested. Moreover, when
difﬁculty is interpreted in view of the construct that an item of a test is intended
to measure, it can be used as one part of a validity argument.

257
Assessment
Correlation
A second statistical analysis used in validation research is ‘correlation’. When
testing researchers or teachers look at how similar two tests are, they are considering
the correlation between tests. For example, if a group of students takes two tests at
the beginning of the semester, their scores can be lined up next to each other and,
if the number of students is small, the degree of relationship between them may
be apparent, as shown in Table 15.2. With this small number, it is evident that the
student who performed well on the ﬁrst test also did so on the second. Student 5
scored the lowest on both tests, and the others line up in between. The correlation
allows for an exact number to be used to express the observation that the students
scored approximately the same on the two tests. The correlation is 0.97.
Examinees
Test 1
Test 2
Student 1
35
35
Student 2
25
26
Student 3
30
26
Student 4
34
32
Student 5
17
16
Table 15.2 The use of correlation in validation research
A correlation can range from 1.00 to –1.00, indicating a perfect positive
relationship or a perfect negative relationship. A correlation of 0.00 would indicate
no relationship. Table 15.3 illustrates two sets of scores that show a negative
relationship. The correlation among the scores in Table 15.3 is –0.79. Typically,
in language testing, correlations in the positive range are found when tests of
different language skills are correlated. However, like the analysis of difﬁculty, the
analysis of correlations requires an understanding of the constructs and test uses
of the tests investigated.
Examinees
Test 1
Test 2
Student 1
35
17
Student 2
25
26
Student 3
30
26
Student 4
34
28
Student 5
17
35
Table 15.3 Two sets of scores that show a negative relationship
The direction and strength of a correlation depend on many factors, including
the number of subjects and the distributions of scores, and therefore correlations
should be interpreted in view of both the construct that the test is intended
to measure and the data used to do the analysis. Correlational techniques are
the conceptual building blocks for many of the complex test analyses that are
conducted, which also require a clear understanding of the basic principles
outlined in the ﬁrst part of the chapter.

258 An Introduction to Applied Linguistics
Language Assessment and Language Teaching
The relationships between assessment and teaching are as multifaceted as the
contexts and purposes of assessment; however, some trends are worth noting.
The ﬁrst is an increased interest in social and political inﬂuences on assessment
(see McNamara and Roever, 2006 for a comprehensive overview). In this
context, most professional language testers, under the inﬂuence of Messick’s
(1989) argument that validation should ‘trace the social consequences’ of a test,
have embraced the idea that tests should be designed and used so as to have
a positive impact on teaching and learning. In recent years, researchers have
begun to study this impact in a range of educational contexts. Another notable
shift in the assessment landscape is a loss of faith in the capacity of ‘traditional’
forms of educational measurement such as standardized tests to capture learning
outcomes accurately and a corresponding move towards greater alignment of
curriculum and instruction through the adoption by teachers of new forms of
performance assessment (Leung and Rea-Dickins, 2007). A third aspect of language
assessment which is found in recent literature is the way in which governments
in many countries, under increasing pressure to demonstrate accountability and
measurable outcomes, are using assessment as a policy tool. Let us look at each
of these trends in more detail.
Washback
One result of Messick’s (1989) expansion of the concept of validity to include the
social consequences of test use, has been an increased focus on ‘washback’, a term
commonly used by writers on language assessment to denote the inﬂuence of
testing on teaching (Hughes, 2003: 1). This inﬂuence often tends to be presented
as harmful – it has been claimed, for example, that tests (particularly high-
stakes standardized tests) exercise a negative inﬂuence due to the temptation for
teachers to spend time on activities that will help students to succeed in the test
(for example, learning test-taking strategies) rather than on developing the skills
and knowledge which should be the object of instruction (Alderson and Hamp-
Lyons, 1996: 280–281). Conversely, it is also believed that ‘positive washback’ can
be brought about through the introduction of tests that target the skills needed
by language learners in real life (Cheng, 1998: 279). Seen in this way, a test could
be considered more or less valid according to how beneﬁcial its washback effects
were thought to be.
Although some washback studies have identiﬁed detrimental effects of
standardized testing on teaching practice (see, for example, Fox and Cheng,
2007; Slomp, 2008), Alderson and Wall (1993), reject such a view of washback
as simplistic and unsupported by evidence. They argue that ‘washback, if it
exists ... is likely to be a complex phenomenon which cannot be related to a
test’s validity’ (Alderson and Wall, 1993: 116). The ﬁndings of research into
washback in a range of language teaching contexts support Alderson and Wall’s
(1993) contention that washback effects are complex. In a study of the impact of
two national tests used in Israel, Shohamy, Donitsa-Schmidt and Ferman (1996)
found that washback patterns ‘can change over time and that the impact of tests
is not necessarily stable’. Wall and Alderson’s (1993) study of the introduction of
a new examination into the Sri Lankan educational system showed that a range
of constraints may inﬂuence the intended effects of an examination, including

259
Assessment
inadequate communication of information by educational authorities, low levels
of teacher awareness and lack of professional development support. These authors
conclude that ‘an exam on its own cannot reinforce an approach to teaching
the educational system has not adequately prepared its teachers for’ (Wall and
Alderson, 1993: 67). Cheng’s (1998) research into the introduction of a new task-
based examination into the Hong Kong examination system suggests that the
impact of assessment reform may be limited unless there is genuine change in
‘how teachers teach and how textbooks are designed’.
The role of the teacher emerges as a major factor in many washback studies.
Alderson and Hamp-Lyons (1996) investigated teacher attitudes and behaviour
in TOEFL preparation classes and concluded that washback effects may vary
signiﬁcantly according to individual teacher characteristics. Burrows (2004)
reached a similar conclusion in a study of adult ESL teachers’ reactions to the
introduction of a new competency-based assessment system in the Adult
Migrant English Program in Australia. She concluded that teachers’ responses are
related to their attitudes towards and experiences of the implementation of the
assessment, their perceptions of the quality of the assessment; the extent to which
the assessment represented a departure from their previous practices; and their
attitudes to change itself. All of these ﬁndings suggest that the nature and extent
of washback are governed by a wide range of individual, educational and social
factors. These include the political context in which a test or assessment system
is introduced, the time that has elapsed since adoption, the knowledge, attitudes
and beliefs of teachers and educational managers, the role of test agencies and
publishers, the relationships between participants and the resources available. An
adequate model of impact, according to Wall (1997: 297) needs to include all of
these inﬂuences and to describe the relationships between them.
‘Alternative’ Assessment
The close interrelationship between teaching and assessment which is depicted in
many of the washback studies described above has not always been reﬂected in
the language testing literature. In comparison to standardized proﬁciency testing,
the pedagogical role of assessment has until recently received relatively little
attention (Rea-Dickins and Gardner, 2000; Brindley, 2007). However, over the last
decade, there has been a growing acknowledgement of the need for closer links
between assessment and instruction (Shohomy, 1992; Genesee and Hamayan,
1994) accompanied by a recognition on the part of educational authorities in
many countries that teacher-conducted assessments have an important role to play
in determining learners’ achievement. As a result, we have seen the widespread
adoption of ‘alternative’ assessment methods which directly reﬂect learning
activities and which are carried out by practitioners in the context in which
learning takes place (Brown and Hudson, 1998). Some of the more commonly
used methods include the following.
Observation
Informal observation of learners’ language use is one of the most widely used
methods of assessment in language classrooms (Brindley, 2001a; Brown, 2004).
As Brown (2004: 266–7) notes, on the basis of the information that they build
up through observing their students’ behaviour, experienced teachers’ estimates

260 An Introduction to Applied Linguistics
of student ability are frequently highly correlated with more formal test results.
Information derived from teacher observations may be used in a variety of ways to
inform classroom decision-making (for example, whether learners have achieved
the learning objectives for a particular unit of instruction and are ready to progress
to the next unit). Types of observation that can be used to monitor progress and
identify individual learning difﬁculties range from anecdotal records to checklists
and rating scales.
In some educational systems, teachers’ observations of learner performance
may form an important part of the evidence that is used for external reporting to
authorities, and may thus require detailed recording of classroom language use.
However, when used for this purpose, observation needs to be conducted with
a great deal of care and attention if it is to yield valid and reliable information.
In this context, Rea-Dickins and Gardner (2000) have identiﬁed a number of
sources of potential unreliability in teachers’ transcription and interpretation of
classroom language samples that may affect the validity of the inferences that are
made. They call for more research into the validity and reliability of observational
assessment and highlight the need to include classroom observation skills in
teacher professional development programmes (Rea-Dickins and Gardner, 2000:
238–239).
Portfolios
A portfolio is a purposeful collection of students’ work over time that contains
samples of their language performance at different stages of completion, as well as
the student’s own observations on his or her progress.
Three types of portfolio have been identiﬁed, reﬂecting different purposes
and features (Valencia and Calfee, 1991). These are ﬁrst, the ‘showcase’ portfolio
which represents a collection of student’s best or favourite work. The entries in
the showcase portfolio are selected by the student and thus portray an individual’s
learning over time. No comparison with external standards or with other students
is involved. Second, there is the ‘documentation’ portfolio which contains
systematic ongoing records of progress. The documentation portfolio may
include observations, checklists, anecdotal records, interviews, classroom tests
and performance assessments. The selection of entries may be made by either
the teacher or the student. According to Valencia and Calfee (1991: 337), ‘the
documentation resembles a scrapbook, providing evidence but not judging the
quality of the activities’. Finally, the ‘evaluation’ portfolio which is used as public
evidence of learners’ achievement is more standardized than either the showcase
or documentation portfolio because of the need for comparability. The contents
of the evaluation portfolio and the assessment criteria used are largely determined
by external requirements, although there is some room for individual selection
and reﬂection activities. In the context of language education programmes in the
USA, Gottlieb and Nguyen (2007) describe what they call a ‘pivotal portfolio’ that
combines the features of the showcase and documentation portfolio. It contains
essential evidence of the student’s work, along with common assessments
administered by all teachers, and follows the learner for the duration of the
programme.
The use of portfolios as a means of recording and assessing progress offers a
number of advantages to language teachers and learners. Not only does it provide
a way of relating assessment closely to instruction and motivating learners

261
Assessment
(Fulcher, 1997) but it also offers learners the opportunity to reﬂect on their
learning goals and strategies, thus promoting learner independence (Gottlieb
and Nguyen, 2007). Another claimed advantage of assessment portfolios is that
they provide concrete evidence of development that can be used to demonstrate
tangible achievement to external stakeholders in language programmes (Genesee
and Upshur, 1996: 100).
However, the introduction of portfolio assessment has not been without
problems. There has been considerable debate in the research literature concerning
issues such as the type and amount of student work that should be included in
a portfolio, the extent to which students should be involved in selection of the
entries and the amount of external assistance they should be allowed (Fulcher,
1997; Brown and Hudson, 1998; Hamp-Lyons and Condon, 2000). In addition,
research studies have highlighted both technical and practical difﬁculties
associated with portfolio use. These include:
• Low levels of agreement between assessors on the quality of language samples
(Brindley, 20001b).
• Lack of comparability between the samples submitted (Hamp-Lyons and
Condon, 2000).
• The time and expense associated with collecting and grading large numbers of
student texts on a continuing basis, conducting standard-setting meetings and
discussing portfolios with students on an individual basis (Weigle, 2002).
In spite of these potential difﬁculties, however, it has been argued that the positive
impact of portfolios on both teachers and learners is in itself sufﬁcient reason to
continue their use, even if it cannot be demonstrated that portfolio assessment
is technically more reliable than more traditional means of assessment (Fulcher,
1997; Hamp-Lyons and Condon, 2000). In addition, with the advent of new
technology, the practical problems of data management and storage associated
with paper-based portfolios do not arise, since the contents can be stored,
displayed and transmitted electronically. A wide variety of work samples can now
be captured in different electronic formats, ranging from video-recorded speech
samples to writing assignments and used by teachers, learners and relevant third

Download 1,71 Mb.

Do'stlaringiz bilan baham:

1 ... 101 102 103 104 105 106 107 108 ... 159