Cohen's Kappa

in #statistics8 years ago (edited)

Cohen's kappa is a statistical measurement for the agreement between two binary variables. It can be useful for comparing how much two different measuring techniques agree about a certain dataset.

In this article I'm going to discuss, visualize and provide python code for Cohen's Kappa in an attempt to give you an intuitive understanding for how it works.

Thinking naively about agreement among raters

Imagine you are at a fair and there is a pie contest with two judges.
The judges decide which pies are delicious, and hence worthy of going on to the next round and continuing in the competition.
In the first round, they rate the 10 pies in the contest as either mediocre or delicious and we would like to know how much the judges agreed with one another. Did Judge A think the same pies as judge B were delicious and did they agree about the mediocre ones or not? The following table shows how the judges rated the pies.

Judge A
mediocre delicious
Judge B mediocre 2 1
delicious 3 4

Now you might be tempted to look at that data and say, "Well easy, they agreed on 6 out of the 10, therefore there is a 60% agreement among them". This is the naive notion, its known as percent agreement, is simple enough to grasp and absolutely useful for quickly getting a rough idea about agreement. We can abstractly think about the previous table as follows:

mediocre delicious
mediocre a b
delicious c d

Then percent agreement is:

Or in words:

The sum of coinciding ratings divided by the total ratings.

The problem with percent agreement is that it ignores how much agreement could have occurred purely due to random chance.

And that is where Cohen's kappa comes into play. It builds on the idea of percent agreement but takes random chance agreement into account.

Before we move let's build our python function for percent agreement and make it work for N possible rating values i.e. (mediocre, delicious, average, etc.).

import numpy as np

def percent_agreement(x):
  agreement_sum = 0
  for i in range(0, len(x)):
    agreement_sum += x[i,i]
  return agreement_sum / np.sum(x)

# or in the much shorted numpy form
def percent_agreement(x):
  return x.diagonal().sum() / x.sum()

Taking random chance into account

So let's look at that table again:

Judge A
mediocre delicious
Judge B mediocre 2 1
delicious 3 4

We know we have an observed agreement of 0.6 or 60%. But we also know some percentage of that agreement is due to random chance.
So intuitively we know our equation has to look something like the following:

Let's start our search for expected agreement by using classical probability to look at each rater. We'll figure that due to the law of large numbers that the expected probability will be a pretty good measure for the agreement we should expect due to random chance.

So here's what we know about the judges.

Judge A: He rated 5 pies as mediocre and 5 pies as delicious.

Judge B: He rated 3 pies as mediocre and 7 pies as delicious.

So the expected probability that the judges will agree that the pie is mediocre is:

And the expected probability that the judges will agree that the pie is delicious is:

When we add these up we get

Finally, we could just subtract 0.5 from 0.6 and get 0.1. But this doesn't tell us much, especially from one case to the next.
One thing we might want to know is:

what percentage of the whole agreement is not due to random chance?

To get answer this question we could do:

But this still doesn't answer our primary question which is:

What is the level of agreement, taking chance into account?

Before we answer that let's code up everything we've done thus far.


def expected_agreement(x):
    """
    calculates the expected agreement of a 2x2 square matrix.
    """
    sum_total = x.sum()

    # calculate the probability of agreement on factor 1
    factor_b1 = (x[0,0] + x[0,1]) / sum_total
    factor_a1 = (x[0,0] + x[1,0]) / sum_total
    p1 = factor_b1 * factor_a1

    # calculate the probability of agreement on factor 2
    factor_b2 = (x[1,0] + x[1,1]) / sum_total
    factor_a2 = (x[0,1] + x[1,1]) / sum_total
    p2 = factor_b2 * factor_a2

    # total probability of agreement
    return p1 + p2

Cohen's Kappa formula

Finally, lets look at the formula for Cohen's kappa.

This gives us the agreement between the raters as a percentage of the maximum agreement possible which is obviously 1 (if you don't believe me just look at our percent agreement equation again and imagine b and c are both zero). Then by subtracting the expected agreement we have controlled for all the agreement due to random chance.

This is idea behind Cohen's kappa. Let's code that up in python.


def cohens_kappa(x):
  observed_agreement_value = percent_agreement(x)
  expected_agreement_value = expected_agreement(x)
  max_agreement_possible = 1

  return ((observed_agreement_value - expected_agreement_value) /
          (max_agreement_possible - expected_agreement_value))

If we now apply that function to our dataset we get:

So 0.2 is our agreement value after controlling for random chance - quite different from 60% or 0.6 wouldn't you say?

And now we come to the question of how we should interpret that value. Cohen suggested the following values:

  • values <= 0 = no agreement
  • 0.01-0.20 = no to slight agreement
  • 0.21-0.40 = fair agreement
  • 0.41-0.60 = moderate agreement
  • 0.61-0.80 = substantial agreement
  • 0.81-1.00 = almost perfect agreement

Which would mean that our initial percent agreement calucation that produced 60% agreement and seemed substantial actually represents virtually no to a merely slight agreement once run through Cohen's Kappa.

C'est la fin et merci beaucoup.

References

  1. Cohen's kappa. (2017, October 24). Retrieved November 12, 2017, from https://en.wikipedia.org/wiki/Cohen%27s_kappa
  2. Inter-rater reliability. (2017, October 20). Retrieved November 12, 2017, from https://en.wikipedia.org/wiki/Inter-rater_reliability
  3. Law of large numbers. (2017, October 09). Retrieved November 12, 2017, from https://en.wikipedia.org/wiki/Law_of_large_numbers
  4. McHugh, M. L. (2012, October). Interrater reliability: the kappa statistic. Retrieved November 12, 2017, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/
Sort:  

Congratulations @jfaucett, you have decided to take the next big step with your first post! The Steem Network Team wishes you a great time among this awesome community.


Thumbs up for Steem Network´s strategy

The proven road to boost your personal success in this amazing Steem Network

Do you already know that @originalworks will get great profits by following these simple steps, that have been worked out by experts?

Coin Marketplace

STEEM 0.04
TRX 0.32
JST 0.081
BTC 61655.78
ETH 1609.66
USDT 1.00
SBD 0.47