Interrater Reliability
After test development and finalizing tests/assessment procedures, teachers have to score the test results and/or assess the learners in whatever form. Of course, it is important to have rules for scoring that are as ‘objective’ as possible and at a minimum, are the same for all test-takers. Teachers can also have the role of a rater.
If there is one teacher, s/he can score one test (1), and after some time (or perhaps after scoring another test) score that same test again (2). This is called intra-rater reliability. By comparing Test 1 and Test 2, the intra-rater reliability can be established: does the rater apply the same criteria with the same degree of severity or leniency to the two tests? If yes, the IRR is 1 (all rating scores are the same). If the two scores differ, there is cause for discussion and/or adaptation on the application of the scoring criteria.
If there are more teachers who have to score the test results, inter-rater reliability (IRR) should be established. IRR means the degree of agreement between the raters. During the training of all raters, they should be trained to use the criteria of a scoring instrument, for example, a right or wrong decision as in a vocabulary test or see the SLPI examples and practice applying these criteria to actual sign language performances of learners. Rating training is crucial to achieve IRR. Do all teachers have the same understanding, do they follow the same procedures, and is the scoring fair to the learner/test-taker? If no standards are available (e.g. when no description of the relevant sign language is available), it is still very important that the ‘gut-feelings’ and language-intuitions are made as explicit as possible by and to all scorers or raters.