Inter-rater reliability

In statistics, inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much , or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.

There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient and intra-class correlation.

For any task in which multiple raters are useful, raters are expected to disagree about the observed target. By contrast, situations involving unambiguous measurement, such as simple counting tasks (e.g. number of potential customers entering a store), often do not require more than one person performing the measurement. Measurement involving ambiguity in characteristics of interest in the rating target are generally improved with multiple trained raters. Such measurement tasks often involve subjective judgment of quality (examples include ratings of physician 'bedside manner', evaluation of witness credibility by a jury, and presentation skill of a speaker).

Variation across raters in the measurement procedures and variability in interpretation of measurement results are two examples of sources of error variance in rating measurements. Clearly stated guidelines for rendering ratings are necessary for reliability in ambiguous or challenging measurement scenarios. Without scoring guidelines, ratings are increasingly affected by experimenter's bias, that is, a tendency of rating values to drift towards what is expected by the rater. During processes involving repeated measurements, correction of rater drift can be addressed through periodic retraining to ensure that raters understand guidelines and measurement goals.

There are several operational definitions of "inter-rater reliability" in use by Examination Boards, reflecting different viewpoints about what is reliable agreement between raters.

...
Wikipedia