Teaching English as a Foreign Language, Second Edition


Download 0.82 Mb.
Pdf ko'rish
bet72/114
Sana23.08.2023
Hajmi0.82 Mb.
#1669479
1   ...   68   69   70   71   72   73   74   75   ...   114
Bog'liq
teaching-english-as-a-foreign-language-routledge-education-books

Test qualities
There remains one other important question to ask about
any assessment of knowledge of the English language—‘Does
it work?’ Here again there may be at least four different ways
in which this question may be interpreted. The first of these
is revealed by the question ‘Does it measure consistently?’ A
metre stick measures the same distance each time because it
is rigid and accurately standardised against a given norm. A
piece of elastic with a metre marked on it is very unlikely to


Assessment and Examinations
160
measure the same every time. In this case the metre stick can
be said to be a reliable measure. In the same way reliability in
instruments for measuring language ability is obviously
highly desirable, but very difficult to achieve. Among the
reasons for this are the effects of variation in pupil
motivation, and of the range of tasks set in making an
assessment. A pupil who is just not interested in doing a test
will be unlikely to score highly on it. Generally speaking the
more instances of pupil language behaviour that can be
incorporated into a test the better. It is for this reason that
testing specialists have tended to prefer discrete item test
batteries in which a large number of different instances of
language activity are used, to essay type examinations where
the tasks set are seen as more limited in kind and number.
Variations in the conditions under which tests are taken can
also affect reliability—small variations in timing where
precise time limits are required for example, a stuffy room,
the time of day when the test is taken, or other equally trivial-
seeming factors may all distort test results. Perhaps most
important of all in its consequences on test results is the
reliability of the marker. This reliability may be high in
objectively marked tests—like multiple-choice tests—but can
be low in free response tests—like essays—if a structured
approach or multiple marking are not used. Determining test
reliability requires a certain amount of technical know-how
and familiarity with the statistical techniques which permit
the calculation of a reliability coefficient. Guidance to these
will be found in the books referred to for further reading at
the end of this chapter.
The second way in which the question ‘Does it work?’ can
be made more precise is by rephrasing it as ‘Does it
distinguish between one pupil and another?’ A metre stick
may be a suitable instrument for measuring the dimensions
of an ordinary room, but it would not be suitable for
measuring a motorway or the gap of a spark plug for a car. In
one case the scale of the object to be measured is too great, in
the other it is too small. Not only should the instrument
which is used be appropriate to the thing being measured but
the scale on the instrument should be right too. A micrometer
marked only in centimetres would not permit accurate
measurement of watch parts, the scale needs to be fractions


Assessment and Examinations
161
of millimetres. Tests which have the right sort of scale may be
said to discriminate well. Tests which are on the whole too
easy or too difficult for the pupils who do them do not
discriminate well, they do not spread the pupils out since
virtually all pupils score high marks or all pupils score low
marks. Ideally the test should give a distribution which
comes close to that of the normal distribution curve.
One needs to be careful in reading the literature on testing
when the term discrimination index is encountered. This has
little to do with discrimination in the sense discussed above.
It refers rather to the product of statistical procedures which
measure the extent to which any single item in a test
measures the same thing as the whole of the test. By
calculating a discrimination index for each item in a test it is
possible to select those items which are most efficient in
distinguishing between the top one-third and the bottom
one-third of any group for whom the test as a whole is about
right. In other words it will help to establish the measuring
scale within the limits of the instrument itself and ensure that
that is about right, giving a proper distribution of easy and
difficult questions within the test. But a discrimination index
has no absolute value; to get the overall level of difficulty of
the test right requires a pragmatic approach with repeated
retrials of the test items, accepting some and rejecting others
until the correct combination has been achieved. Again
details of these technical matters will be found in the books
for further reading.
The third way in which the ‘Does it work?’ question may
be more fully specified is by asking ‘Does it measure what it
is supposed to measure?’ A metre stick is very practical for
measuring cloth but it is irrelevant for measuring language
ability. ‘What it is supposed to measure’ in the case of English
language tests is presumably ability in English language, and
the only way that the extent to which a test actually does this
can be determined is by comparing the test results with some
other outside measurement, some other way of estimating
pupil ability, a way which ought to be at least as reliable and
accurate as the test itself. Where the results of the outside
measure match the results of the test reasonably closely the
test can be said to have empirical validity. Suitable outside
measures are difficult to come by. So far the best criterion



Download 0.82 Mb.

Do'stlaringiz bilan baham:
1   ...   68   69   70   71   72   73   74   75   ...   114




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling