Teaching English as a Foreign Language, Second Edition
Download 0.82 Mb. Pdf ko'rish
|
teaching-english-as-a-foreign-language-routledge-education-books
Test qualities
There remains one other important question to ask about any assessment of knowledge of the English language—‘Does it work?’ Here again there may be at least four different ways in which this question may be interpreted. The first of these is revealed by the question ‘Does it measure consistently?’ A metre stick measures the same distance each time because it is rigid and accurately standardised against a given norm. A piece of elastic with a metre marked on it is very unlikely to Assessment and Examinations 160 measure the same every time. In this case the metre stick can be said to be a reliable measure. In the same way reliability in instruments for measuring language ability is obviously highly desirable, but very difficult to achieve. Among the reasons for this are the effects of variation in pupil motivation, and of the range of tasks set in making an assessment. A pupil who is just not interested in doing a test will be unlikely to score highly on it. Generally speaking the more instances of pupil language behaviour that can be incorporated into a test the better. It is for this reason that testing specialists have tended to prefer discrete item test batteries in which a large number of different instances of language activity are used, to essay type examinations where the tasks set are seen as more limited in kind and number. Variations in the conditions under which tests are taken can also affect reliability—small variations in timing where precise time limits are required for example, a stuffy room, the time of day when the test is taken, or other equally trivial- seeming factors may all distort test results. Perhaps most important of all in its consequences on test results is the reliability of the marker. This reliability may be high in objectively marked tests—like multiple-choice tests—but can be low in free response tests—like essays—if a structured approach or multiple marking are not used. Determining test reliability requires a certain amount of technical know-how and familiarity with the statistical techniques which permit the calculation of a reliability coefficient. Guidance to these will be found in the books referred to for further reading at the end of this chapter. The second way in which the question ‘Does it work?’ can be made more precise is by rephrasing it as ‘Does it distinguish between one pupil and another?’ A metre stick may be a suitable instrument for measuring the dimensions of an ordinary room, but it would not be suitable for measuring a motorway or the gap of a spark plug for a car. In one case the scale of the object to be measured is too great, in the other it is too small. Not only should the instrument which is used be appropriate to the thing being measured but the scale on the instrument should be right too. A micrometer marked only in centimetres would not permit accurate measurement of watch parts, the scale needs to be fractions Assessment and Examinations 161 of millimetres. Tests which have the right sort of scale may be said to discriminate well. Tests which are on the whole too easy or too difficult for the pupils who do them do not discriminate well, they do not spread the pupils out since virtually all pupils score high marks or all pupils score low marks. Ideally the test should give a distribution which comes close to that of the normal distribution curve. One needs to be careful in reading the literature on testing when the term discrimination index is encountered. This has little to do with discrimination in the sense discussed above. It refers rather to the product of statistical procedures which measure the extent to which any single item in a test measures the same thing as the whole of the test. By calculating a discrimination index for each item in a test it is possible to select those items which are most efficient in distinguishing between the top one-third and the bottom one-third of any group for whom the test as a whole is about right. In other words it will help to establish the measuring scale within the limits of the instrument itself and ensure that that is about right, giving a proper distribution of easy and difficult questions within the test. But a discrimination index has no absolute value; to get the overall level of difficulty of the test right requires a pragmatic approach with repeated retrials of the test items, accepting some and rejecting others until the correct combination has been achieved. Again details of these technical matters will be found in the books for further reading. The third way in which the ‘Does it work?’ question may be more fully specified is by asking ‘Does it measure what it is supposed to measure?’ A metre stick is very practical for measuring cloth but it is irrelevant for measuring language ability. ‘What it is supposed to measure’ in the case of English language tests is presumably ability in English language, and the only way that the extent to which a test actually does this can be determined is by comparing the test results with some other outside measurement, some other way of estimating pupil ability, a way which ought to be at least as reliable and accurate as the test itself. Where the results of the outside measure match the results of the test reasonably closely the test can be said to have empirical validity. Suitable outside measures are difficult to come by. So far the best criterion |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling