Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Latest commit 28e9e47 Nov 21, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R_CodeAndSupplementalAnalysis_files/figure-html minimal edits Nov 11, 2019
data Initiate repo Nov 11, 2019
DistOfScore.png Initiate repo Nov 11, 2019
KrippAlpha.png minimal edits Nov 11, 2019
README.md De-identify for review Nov 21, 2019
R_CodeAndSupplementalAnalysis.Rmd minor edit Nov 11, 2019

README.md

Introduction

This Notebook includes the data cleaning and analysis for "A mixed method study of self and peer assessment: Implications of grading online writing assignments on scientific news literacy". Specifically, it addresses research question two:

What is the inter-rater reliability among peer assessors in the four writing assignments?

The analysis for the other research questions are not documented here.

Terminology

Subject An individual paper from a student for a given assignment.
Rater A student peer reviewer.

Importantly, for most research on Inter-Rater Reliability (IRR), the raters are generally subject experts, and each rater rates every subject. This is very different from the current situation where the raters are students, have a definite stake in the results, and only rate a fraction of the subjects.

Agreement

Agreement can be viewed in two ways. From the perspective of subjects, agreement refers to subjects that have a unanimous score among raters. This type of agreement will be termed simple agreement in this analysis. It is not uncommon to allow some tolerance when agreement is calculated. The result is that adjacent scores may be considered as in agreement, e.g. a subject that is scored as 8, 9, 9 might be considered as unanimous amoung raters if tolerance = 1.

Note that tolerance should be determined judiciously to avoid "chaining" a series of non-adjacent scores, e.g. 7, 8, 9, 10 could be considered unanimous under some tolerance algorithms that might determine the distance between partially paired scores. In this analysis, tolerance is determined by the distance between the maximum and minimum scores, i.e. the range such that the scores, 6, 6, 7, 7 are unanimous with tolerance = 1, but 6, 6, 6, 8 is not.

Note that the final example illustrates the weakness of simple agreement--there was better agreement among reviewers in the second example than in the first if you consider pairwise agreement among raters:

Example 1: 6, 6, 7, 7
Example 2: 6, 6, 6, 8

Pair Distance Ex 1 Distance Ex 2
1 vs 2 0 0
1 vs 3 1 0
1 vs 4 1 2
2 vs 3 1 0
2 vs 4 1 2
3 vs 4 0 2

In Example 1, 2 out of 6 (33%) ratings agree if tolerance = 0 whereas in Example 2, 3 out of 6 (50%) agree at the same level of tolerance. If tolerance = 1, then pairwise agreement is 100% for Example 1 and still 50% for Example 2.

Which agreement is appropriate for this project? The R package named irr (for inter-rater reliability) implements simple agreement but with a key weakness--subjects with missing ratings are dropped. For this project, few subjects have 4 scores, so it drops the majority of the data. Pairwise agreement is not implemented in irr, but the idea is central to more sophisticated methods to assess IRR.

Krippendorf's Alpha

IRR considers the pairwise agreement among raters for a given subject. A weakness of simple metrics for IRR is that they do not take into account the possibility that raters have randomly assigned their scores. Kripp's alpha is a metric that examines the observed distribution of ratings and compares it to the expected distribution if scores are randomly assigned by raters. Kripp's alpha ranges from -1 to 1, with 1 being perfect agreeement among raters and 0 indicating that scores were assigned by chance. Importantly, Kripp's alpha is an estimate of the parameter, and a 95% confidence interval can be determined with the kripps.boot package.

In addition, Kripp's alpha can be used with various types of scores including simple binary, nominal, quantiative interval and quantitative ratio data. For this study, the scores were treated as interval data. The Kripp's algorithm supports multiple raters and missing data.

Import, Format and Clean Data

Anonymous student scores for each assignment were imported into R from csv files.

assign_files <- list.files(path="data", pattern="Assignment", full.names=TRUE)
assign_files
## [1] "data/Assignment1.csv" "data/Assignment2.csv" "data/Assignment3.csv"
## [4] "data/Assignment4.csv"

The data are three column comma-separated files. First column is an integer score that should range from 5 - 10. The last two columns are numeric identifiers. Will import data with type integer, character and character. This will convert non-numeric values in the first column to NA. Also, reading identifiers as character will prevent R from treating these nominal variables as numeric.

Will create a function to handle the import, drop rows with no data, and create a new column that corresponds to assignment. A column for assignment will be added so that we can easily do combined analysis across the data.

importAssignment <- function(assign_file){
  assign_data <- read_delim(assign_file, delim=",", col_types="icc")
  assign_data <- filter(assign_data, !is.na(PeerReviewScore))
  assign_name <- str_split(assign_file, pattern="/", simplify=TRUE)[2]
  assign_name <- str_replace(assign_name, ".csv", "")
  assign_name <- str_c(str_sub(assign_name, 1, 10), " ", str_sub(assign_name, 11, 11))
  assign_data <- mutate(assign_data, AssignName = assign_name)
  return(assign_data)
}

Import all files and create a single tibble that can be grouped by AssignName for subsequent analyses.

assigns <- lapply(assign_files, function(x)(importAssignment(x)))
## Warning: 36 parsing failures.
## row             col   expected actual                   file
##   2 PeerReviewScore an integer   NULL 'data/Assignment1.csv'
##  28 PeerReviewScore an integer   NULL 'data/Assignment1.csv'
##  40 PeerReviewScore an integer   NULL 'data/Assignment1.csv'
##  41 PeerReviewScore an integer   NULL 'data/Assignment1.csv'
##  65 PeerReviewScore an integer   NULL 'data/Assignment1.csv'
## ... ............... .......... ...... ......................
## See problems(...) for more details.
## Warning: 42 parsing failures.
## row             col   expected actual                   file
##   1 PeerReviewScore an integer   NULL 'data/Assignment2.csv'
##   4 PeerReviewScore an integer   NULL 'data/Assignment2.csv'
##  11 PeerReviewScore an integer   NULL 'data/Assignment2.csv'
##  36 PeerReviewScore an integer   NULL 'data/Assignment2.csv'
##  37 PeerReviewScore an integer   NULL 'data/Assignment2.csv'
## ... ............... .......... ...... ......................
## See problems(...) for more details.
## Warning: 36 parsing failures.
## row             col   expected actual                   file
##   3 PeerReviewScore an integer   NULL 'data/Assignment3.csv'
##  23 PeerReviewScore an integer   NULL 'data/Assignment3.csv'
##  30 PeerReviewScore an integer   NULL 'data/Assignment3.csv'
##  34 PeerReviewScore an integer   NULL 'data/Assignment3.csv'
##  36 PeerReviewScore an integer   NULL 'data/Assignment3.csv'
## ... ............... .......... ...... ......................
## See problems(...) for more details.
## Warning: Missing column names filled in: 'X4' [4], 'X5' [5], 'X6' [6],
## 'X7' [7]
## Warning: Unnamed `col_types` should have the same length as `col_names`.
## Using smaller of the two.
## Warning: 363 parsing failures.
## row             col   expected    actual                   file
##   1 NA              3 columns  7 columns 'data/Assignment4.csv'
##   2 PeerReviewScore an integer NULL      'data/Assignment4.csv'
##   2 NA              3 columns  7 columns 'data/Assignment4.csv'
##   3 NA              3 columns  7 columns 'data/Assignment4.csv'
##   4 NA              3 columns  7 columns 'data/Assignment4.csv'
## ... ............... .......... ......... ......................
## See problems(...) for more details.
assigns <- bind_rows(assigns)
assign_names <- pull(assigns, AssignName) %>%
  unique()
assigns %<>% 
  mutate(AssignName = parse_factor(AssignName, levels=assign_names))
assigns
## # A tibble: 1,125 x 4
##    PeerReviewScore ReviewerID StudentID AssignName  
##              <int> <chr>      <chr>     <fct>       
##  1               9 17255      17250     Assignment 1
##  2               7 17490      17250     Assignment 1
##  3              10 17492      17250     Assignment 1
##  4               9 17493      17457     Assignment 1
##  5               8 17465      17457     Assignment 1
##  6               8 16868      17457     Assignment 1
##  7               9 16658      17253     Assignment 1
##  8              10 17475      17253     Assignment 1
##  9               8 17476      17253     Assignment 1
## 10               9 17486      17253     Assignment 1
## # ... with 1,115 more rows

For the four assignments, there are 1125 scores in total.

Distribution of Peer Review Scores

Get a quick count of scores by assignment and visualize as a barplot.

assigns %>%
  group_by(PeerReviewScore) %>%
  ggplot() +
  aes(x=PeerReviewScore, fill=AssignName) +
  geom_bar(position="dodge", col="black") +
  scale_x_continuous(breaks=1:10) +
  scale_fill_viridis_d(labels=1:4) +
  ylab("Count") +
  xlab("Peer Review Assessment") +
  labs(fill="Assignment") +
  ggtitle("Distribution of Peer Review Scores")

There is one score of 4 which is not valid. The row will be removed.

assigns %<>%
  filter(PeerReviewScore > 4)

Confirm the modification.

assigns %>%
  group_by(PeerReviewScore) %>%
  ggplot() +
  aes(x=PeerReviewScore, fill=AssignName) +
  geom_bar(position="dodge", col="black") +
  scale_x_continuous(breaks=1:10) +
  scale_fill_viridis_d(labels=1:4) +
  ylab("Count") +
  xlab("Peer Review Assessment") +
  labs(fill="Assignment") +
  ggtitle("Distribution of Peer Review Scores")

Note, that there is no score of 5 for Assignment 1.

Alternate way to visualize the data. This figure was included in manuscript.

dist_plot <- assigns %>%
  count(AssignName, PeerReviewScore) %>%
  bind_rows(tibble(AssignName="Assignment 1", PeerReviewScore=5, n=0), .) %>%
  mutate(PeerReviewScore = as.factor(PeerReviewScore)) %>%
  ggplot() +
  aes(x=AssignName, y=n, fill=PeerReviewScore) +
  geom_col(position="dodge", color="black") +
  scale_fill_viridis_d() +
  labs(fill="Score", 
       y="Count", 
       x="", 
       title="Distribution of Peer Review Scores") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm"), 
        plot.title=element_text(hjust=0.5)) +
  guides(fill=guide_legend(nrow=1))
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
dist_plot

Save the plot.

ggsave("DistOfScore.png", dist_plot, width=6, height=4, units="in", dpi=300)

Alternate way to visualize the same data.

assigns %>%
  count(AssignName, PeerReviewScore) %>%
  mutate(PeerReviewScore = as.factor(PeerReviewScore)) %>%
  ggplot() +
  aes(x=AssignName, y=n, fill=PeerReviewScore) +
  geom_col(position="fill", color="black") +
  scale_fill_viridis_d() +
  scale_y_continuous(labels=scales::percent, breaks=seq(from=0, to=1, by=0.1)) +
  labs(fill="Score", 
       y="Count", 
       x="", 
       title="Distribution of Peer Review Scores") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm"), 
        plot.title=element_text(hjust=0.5)) +
  guides(fill=guide_legend(nrow=1))

Both visualizations show that students tended to grade quite high with 9 clearly the most abundant score, regardless of assignment.

Distribution of Student Assignment Scores

The above visualizations show the distribution of scores with as many as four scores per subject. The distribution is similar if we determine the mean for each subject (rounding to a whole number).

assigns %>%
  group_by(StudentID, AssignName) %>%
  summarize(mean_score = round(mean(PeerReviewScore), digits=0)) %>%
  ungroup() %>%
  mutate(mean_score = factor(mean_score, levels=5:10)) %>%
  count(AssignName, mean_score) %>%
  ggplot() +
  aes(x=AssignName, y=n, fill=mean_score) +
  geom_col(position="dodge", color="black") +
  scale_fill_viridis_d() +
  labs(fill="Score", 
       y="Papers", 
       x="", 
       title="Distribution of Mean Peer Review Scores") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm")) +
  guides(fill=guide_legend(nrow=1))

As a proportions plot.

assigns %>%
  group_by(StudentID, AssignName) %>%
  summarize(mean_score = round(mean(PeerReviewScore), digits=0)) %>%
  ungroup() %>%
  mutate(mean_score = factor(mean_score, levels=5:10)) %>%
  count(AssignName, mean_score) %>%
  ggplot() +
  aes(x=AssignName, y=n, fill=mean_score) +
  geom_col(position="fill", color="black") +
  scale_fill_viridis_d() +
  scale_y_continuous(labels=scales::percent, breaks=seq(from=0, to=1, by=0.1)) +
  labs(fill="Score", 
       y="Papers (%)", 
       x="", 
       title="Distribution of Mean Peer Review Scores") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm")) +
  guides(fill=guide_legend(nrow=1))

Analysis of Variance of Student Mean Scores by Assignment

It seems that student's scores were similar across the four assignments. It would not be unreasonable to expect students to improve. We can check this hypothesis with an Analysis of Variance.

mean_scores <- assigns %>%
  filter(AssignName != "Assignment5") %>%
  group_by(StudentID, AssignName) %>%
  summarize(mean_score = round(mean(PeerReviewScore), digits=0)) %>%
  ungroup()
mean_scores
## # A tibble: 379 x 3
##    StudentID AssignName   mean_score
##    <chr>     <fct>             <dbl>
##  1 10119     Assignment 1         10
##  2 10119     Assignment 2          8
##  3 10119     Assignment 3         10
##  4 10119     Assignment 4          9
##  5 11105     Assignment 1          7
##  6 11105     Assignment 2          9
##  7 11105     Assignment 3          9
##  8 11105     Assignment 4          9
##  9 11149     Assignment 1          9
## 10 11149     Assignment 2         10
## # ... with 369 more rows

Perform Analysis of Variance.

aov(mean_score ~ AssignName, data=mean_scores) %>%
          summary()
##              Df Sum Sq Mean Sq F value Pr(>F)
## AssignName    3   1.67  0.5571   0.719  0.541
## Residuals   375 290.44  0.7745

No indication that scores are affected by assignment. Regardless of how it is viewed, the peer review scores are quite high from the beginning.

Summary Table

Create a summary table for the assignments.

paper_count <- assigns %>%
  group_by(AssignName) %>%
  count(StudentID) %>%
  count(AssignName) %>%
  rename(subjects=n)
paper_count
## # A tibble: 4 x 2
## # Groups:   AssignName [4]
##   AssignName   subjects
##   <fct>           <int>
## 1 Assignment 1       96
## 2 Assignment 2       93
## 3 Assignment 3       96
## 4 Assignment 4       94

Get numer of scores by assignment.

score_count <- assigns %>%
  group_by(AssignName) %>%
  count(AssignName) %>%
  rename(scores=n)
score_count
## # A tibble: 4 x 2
## # Groups:   AssignName [4]
##   AssignName   scores
##   <fct>         <int>
## 1 Assignment 1    290
## 2 Assignment 2    279
## 3 Assignment 3    282
## 4 Assignment 4    273

Get mean and standard deviation of scores and join to above.

summary_tib <- assigns %>%
  group_by(AssignName) %>%
  summarize(mean_score=mean(PeerReviewScore), 
            sd_score=sd(PeerReviewScore)) %>% 
  left_join(score_count, ., by="AssignName") %>%
  left_join(paper_count, ., by="AssignName") %>%
  mutate(mean_score=signif(mean_score, digits=3), 
         sd_score=signif(sd_score, digits=3))
summary_tib
## # A tibble: 4 x 5
## # Groups:   AssignName [4]
##   AssignName   subjects scores mean_score sd_score
##   <fct>           <int>  <int>      <dbl>    <dbl>
## 1 Assignment 1       96    290       8.57     1.14
## 2 Assignment 2       93    279       8.67     1.11
## 3 Assignment 3       96    282       8.77     1.04
## 4 Assignment 4       94    273       8.64     1.03

Standard deviation is quite high. Look at this more closely by determining the score range for each paper aggregated by assignment.

assigns %>% 
  group_by(AssignName, StudentID) %>%
  summarize(mean_score=mean(PeerReviewScore), 
            sd_score=sd(PeerReviewScore), 
            range_score=diff(range(PeerReviewScore))) %>%
  mutate(range_score=factor(range_score, levels=c(4, 3, 2, 1, 0))) %>%
  ggplot() +
  aes(x=AssignName, fill=range_score) +
  geom_bar(position="fill", color="black") +
  scale_y_continuous(labels=scales::percent, breaks=seq(from=0, to=1, by=0.1)) +
  scale_fill_viridis_d(option="cividis") +
  labs(fill="Score Range", 
       y="Number of Papers (%)", 
       x="", 
       title="Range of Peer Review Scores by Assignment") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm")) +
  guides(fill=guide_legend(nrow=1))

This is an intuitive way to look at agreement between raters for each subject. The majority of subjects had a range of 1 or less for the score. There are a significant number with a larger range that pushes the standard deviation above 1.

Measures of Inter-Rater Reliability

The data must be reformatted to determine IRR. Currently, the scores are linked to StudentID, the student that submitted the assignment. Duplicate entries for a student occur because there was more that one peer reviewer, ReviewerID. We need a column that clearly indicates that each StudentID has more that one PeerReviewScore. ReviewerID will not work because not all subjects were score by all raters.

We can add a column that has "Score" joined with an integer to indicate the multiple possible scores. Need to be careful with this else student scores could be associated with wrong student.

Use arrange to sort by StudentID and AssignName.

assigns %<>%
  arrange(StudentID, AssignName)

Count the number of times each student appears for each assignment.

score_count <- assigns %>%
  count(StudentID, AssignName)

Create the "label" for each student score.

score_index <- mapply(`:`, 1, pull(score_count, n)) %>%
  c(recursive=TRUE) %>%
  str_c("Score", ., sep="")
head(score_index)
## [1] "Score1" "Score2" "Score1" "Score2" "Score3" "Score1"

Add this to tibble.

assigns %<>% 
  mutate(ScoreIndex = score_index)
assigns
## # A tibble: 1,124 x 5
##    PeerReviewScore ReviewerID StudentID AssignName   ScoreIndex
##              <int> <chr>      <chr>     <fct>        <chr>     
##  1               9 17469      10119     Assignment 1 Score1    
##  2              10 17256      10119     Assignment 1 Score2    
##  3               9 14642      10119     Assignment 2 Score1    
##  4               8 16905      10119     Assignment 2 Score2    
##  5               8 17483      10119     Assignment 2 Score3    
##  6              10 17463      10119     Assignment 3 Score1    
##  7               9 17502      10119     Assignment 3 Score2    
##  8              10 17480      10119     Assignment 3 Score3    
##  9              10 17265      10119     Assignment 4 Score1    
## 10              10 14169      10119     Assignment 4 Score2    
## # ... with 1,114 more rows

The next step will convert the data to wide format using ScoreIndex to form the new columns.

assigns_wide <- assigns %>%
  select(PeerReviewScore, StudentID, AssignName, ScoreIndex) %>%
  pivot_wider(names_from=ScoreIndex, values_from=PeerReviewScore)
assigns_wide
## # A tibble: 379 x 6
##    StudentID AssignName   Score1 Score2 Score3 Score4
##    <chr>     <fct>         <int>  <int>  <int>  <int>
##  1 10119     Assignment 1      9     10     NA     NA
##  2 10119     Assignment 2      9      8      8     NA
##  3 10119     Assignment 3     10      9     10     NA
##  4 10119     Assignment 4     10     10      8     NA
##  5 11105     Assignment 1      7      7      7     NA
##  6 11105     Assignment 2     10      8      8     NA
##  7 11105     Assignment 3     10      8      8     NA
##  8 11105     Assignment 4      9      9      8     NA
##  9 11149     Assignment 1      9      9     10     NA
## 10 11149     Assignment 2     10     10      9     NA
## # ... with 369 more rows

Functions to determine IRR vary on how the matrix must be formatted. Some require subjects in rows and raters in columns. For others, the matrix is transposed.

Make subject x rater matrix.

scores_mat <- assigns_wide %>%
  select(Score1:Score4) %>%
  as.matrix()
rownames(scores_mat) <- pull(assigns_wide, StudentID)
head(scores_mat, n=10)
##       Score1 Score2 Score3 Score4
## 10119      9     10     NA     NA
## 10119      9      8      8     NA
## 10119     10      9     10     NA
## 10119     10     10      8     NA
## 11105      7      7      7     NA
## 11105     10      8      8     NA
## 11105     10      8      8     NA
## 11105      9      9      8     NA
## 11149      9      9     10     NA
## 11149     10     10      9     NA

Make rater x subject matrix for Kripp's alpha calculation.

kripps_mat <- t(scores_mat)
kripps_mat[, 1:10]
##        10119 10119 10119 10119 11105 11105 11105 11105 11149 11149
## Score1     9     9    10    10     7    10    10     9     9    10
## Score2    10     8     9    10     7     8     8     9     9    10
## Score3    NA     8    10     8     7     8     8     8    10     9
## Score4    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Split each matrix by assignment to create a list of matrices. Will use a logical vector to extract the rows for each assignment.

assign_list <- pull(assigns, AssignName) %>%
  levels()
scores_mat_list <- lapply(assign_list, function(x)(scores_mat[pull(assigns_wide, AssignName) == x, ]))
names(scores_mat_list) <- assign_list
head(scores_mat_list[["Assignment 1"]])
##       Score1 Score2 Score3 Score4
## 10119      9     10     NA     NA
## 11105      7      7      7     NA
## 11149      9      9     10     NA
## 12183      8      9      9     NA
## 12386      8      8      7     NA
## 12690      8      9     NA     NA

For the Kripp's matrix must split by column.

kripps_mat_list <- lapply(assign_list, function(x)(kripps_mat[, pull(assigns_wide, AssignName) == x]))
names(kripps_mat_list) <- assign_list
kripps_mat_list[["Assignment 1"]][, 1:6]
##        10119 11105 11149 12183 12386 12690
## Score1     9     7     9     8     8     8
## Score2    10     7     9     9     8     9
## Score3    NA     7    10     9     7    NA
## Score4    NA    NA    NA    NA    NA    NA

Agreement Among Raters

First analysis for reliability of students scores is the simple idea of agreement, i.e. for individual subjects how often do raters give the same score. For quantitative scores, you can allow for a certain tolerance between scores. The most rigorous tolerance is zero. For this data, the scores range from 5 to 10, so a tolerance of 1 is quite permissive.

The function agree in the irr package can determine agreement, but it is not suitable for this data because of how it treats missing data. Every subject must have a score from every rater or the subject is dropped. For example, Assignment1 has 96 subjects and as many as 4 raters. The agree function will only calculate agreement for the 9 subjects that have scores from 4 raters. It drops most of the data!

I have written a similar function that will determine agreement that allows you to specify the number of raters and a tolerance. There must be at least 2 raters to determine agreement!

agreeSimple <- function(ratings, raters=2, tolerance=0){
  if(raters < 2){
    raters <- 2
  }
  valid_mat <- !apply(ratings, 2, is.na)
  valid_count <- apply(valid_mat, 1, sum)
  valid_filter <- valid_count >= raters
  ratings <- ratings[valid_filter, ]
  rating_spread <- apply(ratings, 1, function(x)(diff(range(x, na.rm=TRUE))))
  rating_agree <- sum(rating_spread <= tolerance)/length(rating_spread) * 100
  return(rating_agree)
}

Agreement Between Raters

The above method determines percent agreement per paper. It does not account for pairwise agreement between raters, e.g. if a paper has 3 rater (scores), then there are 3 pairwise possibilities, rater 1 vs rater 2, rater 1 vs rater 3 and rater 2 vs rater 3.

Create new function to determine pairwise agreement.

agreePairwise <- function(ratings, raters=2, tolerance=0){
  if(raters < 2){
    raters <- 2
  }
  valid_mat <- !apply(ratings, 2, is.na)
  valid_count <- apply(valid_mat, 1, sum)
  valid_filter <- valid_count >= raters
  ratings <- ratings[valid_filter, ]
  ratings_dist <- t(apply(ratings, 1, dist))
  ratings_agree <- ratings_dist <= tolerance
  agreement <- apply(ratings_agree, 1, sum, na.rm=TRUE)
  totals <- t(!apply(ratings_agree, 1, is.na))
  totals <- apply(totals, 1, sum)
  agreement_percent <- sum(agreement) / sum(totals) * 100
  return(agreement_percent)
}

Create tibble to hold results for raters and tolerance cutoffs.

rater_col <- rep(c(2, 2, 3, 3), 4)
tol_col <- rep(c(0, 1), 8)
assign_col <- rep(assign_list, each=4)
new_tib <- tibble(raters=rater_col, tolerance=tol_col, assignment=assign_col)
new_tib
## # A tibble: 16 x 3
##    raters tolerance assignment  
##     <dbl>     <dbl> <chr>       
##  1      2         0 Assignment 1
##  2      2         1 Assignment 1
##  3      3         0 Assignment 1
##  4      3         1 Assignment 1
##  5      2         0 Assignment 2
##  6      2         1 Assignment 2
##  7      3         0 Assignment 2
##  8      3         1 Assignment 2
##  9      2         0 Assignment 3
## 10      2         1 Assignment 3
## 11      3         0 Assignment 3
## 12      3         1 Assignment 3
## 13      2         0 Assignment 4
## 14      2         1 Assignment 4
## 15      3         0 Assignment 4
## 16      3         1 Assignment 4

Determine simple agreement and add to tibble.

my_agree <- vector("numeric", length=nrow(new_tib))
for(i in 1:nrow(new_tib)){
  ratings <- scores_mat_list[[pull(new_tib, assignment)[i]]]
  my_agree[i] <- agreeSimple(ratings, 
                       raters=pull(new_tib, raters)[i], 
                       tolerance=pull(new_tib, tolerance)[i])
}
new_tib %<>% 
  mutate(simple_agreement=my_agree)
new_tib
## # A tibble: 16 x 4
##    raters tolerance assignment   simple_agreement
##     <dbl>     <dbl> <chr>                   <dbl>
##  1      2         0 Assignment 1             11.5
##  2      2         1 Assignment 1             57.3
##  3      3         0 Assignment 1             11.2
##  4      3         1 Assignment 1             56.2
##  5      2         0 Assignment 2             15.1
##  6      2         1 Assignment 2             60.2
##  7      3         0 Assignment 2             14.4
##  8      3         1 Assignment 2             60  
##  9      2         0 Assignment 3             20.8
## 10      2         1 Assignment 3             67.7
## 11      3         0 Assignment 3             19.5
## 12      3         1 Assignment 3             66.7
## 13      2         0 Assignment 4             20.2
## 14      2         1 Assignment 4             59.6
## 15      3         0 Assignment 4             21.2
## 16      3         1 Assignment 4             56.5

Determine pairwise agreement and add to tibble.

my_agree <- vector("numeric", length=nrow(new_tib))
for(i in 1:nrow(new_tib)){
  ratings <- scores_mat_list[[pull(new_tib, assignment)[i]]]
  my_agree[i] <- agreePairwise(ratings, 
                       raters=pull(new_tib, raters)[i], 
                       tolerance=pull(new_tib, tolerance)[i])
}
new_tib %<>% 
  mutate(pairwise_agreement=my_agree)
new_tib
## # A tibble: 16 x 5
##    raters tolerance assignment   simple_agreement pairwise_agreement
##     <dbl>     <dbl> <chr>                   <dbl>              <dbl>
##  1      2         0 Assignment 1             11.5               33.2
##  2      2         1 Assignment 1             57.3               77.4
##  3      3         0 Assignment 1             11.2               33.7
##  4      3         1 Assignment 1             56.2               77.6
##  5      2         0 Assignment 2             15.1               32.6
##  6      2         1 Assignment 2             60.2               79.1
##  7      3         0 Assignment 2             14.4               32.6
##  8      3         1 Assignment 2             60                 79.2
##  9      2         0 Assignment 3             20.8               39.8
## 10      2         1 Assignment 3             67.7               82.8
## 11      3         0 Assignment 3             19.5               40  
## 12      3         1 Assignment 3             66.7               83.0
## 13      2         0 Assignment 4             20.2               37.1
## 14      2         1 Assignment 4             59.6               78.8
## 15      3         0 Assignment 4             21.2               38.0
## 16      3         1 Assignment 4             56.5               78.4

Whether there are 2 or 3 raters makes little difference. Will visualize for minimum of 2 raters.

Make plot that summarize results. Will have 4 column plots with tolerance split on x-axis of panels and raters split on the y-axis of panels.

my_colors <- brewer.pal(n=9, "Greys")[c(3, 7)]
new_tib %>%
  pivot_longer(cols=simple_agreement:pairwise_agreement, names_to="agreement", values_to="percent") %>%
  mutate(agreement=replace(agreement, agreement=="simple_agreement", "simple"),
         agreement=replace(agreement, agreement=="pairwise_agreement", "pairwise")) %>%
  mutate(assignment=as.factor(assignment),
         agreement=factor(agreement, levels=c("simple", "pairwise"))) %>%
  ggplot() +
  aes(fill=agreement, x=assignment, y=percent) +
  geom_col(position="dodge", color="black") +
  scale_y_continuous(limits=c(0, 100), breaks=seq(from=0, to=100, by=10)) +
  scale_fill_manual(values=my_colors) +
  facet_grid(raters ~ tolerance, 
             labeller=labeller(tolerance=c("0" = "tolerance 0", "1" = "tolerance 1"), 
                               raters=c("2" = "2 raters", "3" = "3 raters"))) +
  theme(axis.text.x=element_text(angle = -90)) +
  labs(fill="Agreement", 
       y="Agreement (%)", 
       x="", 
       title="Agreement on Peer Review Scores") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm"),
        legend.text=element_text()) +
  guides(fill=guide_legend(nrow=1))

Number of raters has minimal influence. Might want to present only the pairwise agreement. It is most similar to how Kripp's alpha is calculated in that it takes into account all ratings.

my_colors <- brewer.pal(n=9, "Greys")[c(3, 7)]
new_tib %>%
  pivot_longer(cols=simple_agreement:pairwise_agreement, names_to="agreement", values_to="percent") %>%
  mutate(agreement=replace(agreement, agreement=="simple_agreement", "simple"),
         agreement=replace(agreement, agreement=="pairwise_agreement", "pairwise")) %>%
  mutate(agreement=factor(agreement, levels=c("simple", "pairwise")), 
         tolerance=factor(tolerance)) %>%
  filter(agreement=="pairwise") %>%
  filter(raters==2) %>%
  ggplot() +
  aes(x=assignment, fill=tolerance, y=percent) +
  geom_col(position="dodge", color="black") +
  scale_fill_manual(values=my_colors) +
  labs(fill="Tolerance", 
       y="Agreement (%)", 
       x="", 
       title="Agreement on Peer Review Scores") +
  theme(legend.position="bottom", 
        legend.key.size=unit(0.6, "cm"),
        legend.text=element_text()) +
  guides(fill=guide_legend(nrow=1))

Krippendorf's Alpha

Kripp's Alpha

The function kripp.alpha in the irr package calculates Kripp's alpha. However, this statistic has some uncertainty because it does compare the expected result to the observed result. To account for this, a bootstrap method exists.

I used kripp.boot in the package kripp.boot to determine the mean alpha and the upper and lower bound of a 95% confidence interval.

kripps_boot <- lapply(kripps_mat_list, kripp.boot, iter=5000, probs=c(0.025, 0.975), method="interval")
names(kripps_boot) <- assign_list

Convert to a tibble.

kripps_tib <- tibble(AssignName=names(kripps_boot), 
                     mean_alpha=sapply(kripps_boot, function(x)(x$mean.alpha)), 
                     upper_alpha=sapply(kripps_boot, function(x)(x$upper)), 
                     lower_alpha=sapply(kripps_boot, function(x)(x$lower)))
kripps_tib
## # A tibble: 4 x 4
##   AssignName   mean_alpha upper_alpha lower_alpha
##   <chr>             <dbl>       <dbl>       <dbl>
## 1 Assignment 1      0.297       0.392       0.195
## 2 Assignment 2      0.292       0.397       0.176
## 3 Assignment 3      0.402       0.484       0.314
## 4 Assignment 4      0.298       0.394       0.196

Visualize the result.

kripp_fig <- kripps_tib %>%
  filter(AssignName != "Assignment 5") %>%
  ggplot() +
  aes(x=AssignName, y=mean_alpha, ymin=lower_alpha, ymax=upper_alpha) +
  geom_point(size=4) +
  geom_errorbar(width=0.25) +
  labs(y="Alpha", 
       x="", 
       title="Krippendorf's Alpha with 95% Confidence Intervals") +
  theme(plot.title=element_text(hjust=0.5))
kripp_fig

ggsave("KrippAlpha.png", kripp_fig, width=6, height=4, units="in")

How to intepret this result? Kripps alpha is equal to zero if scores were assigned by raters merely by chance and equal to one if there is perfect agreement. In this case, the scores were not assigned by chance, but the low values suggest that the scores are inconsistent among raters. This is expected in this case because the students are not experts. In addition, each rater only evaluated a subset of subjects so they were not exposed to a wide range of possibilites, e.g. some raters likely had excellent subjects and poor subjects making the evaluation easier.

Kripp's alpha should be greater than 0.8 in high stakes situations where the raters are experts, e.g. clinicians diagnosing a patient. For grading of assignments that would otherwise be impossible without peer assessment, Kripp's alpha can aid instructors in implementation of peer review especially to optimize the grading rubric or to determine the number of peer reviews. In support of this, recent research suggests that IRR between 0.21 and 0.40 can be considered fair reliability.

Session Information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kripp.boot_1.0.0   irr_0.84.1         lpSolve_5.6.13.3  
##  [4] magrittr_1.5       RColorBrewer_1.1-2 forcats_0.4.0     
##  [7] stringr_1.4.0      dplyr_0.8.3        purrr_0.3.2       
## [10] readr_1.3.1        tidyr_1.0.0        tibble_2.1.3      
## [13] ggplot2_3.2.1      tidyverse_1.2.1   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5  xfun_0.10         reshape2_1.4.3   
##  [4] haven_2.1.1       lattice_0.20-38   colorspace_1.4-1 
##  [7] vctrs_0.2.0       generics_0.0.2    viridisLite_0.3.0
## [10] htmltools_0.4.0   yaml_2.2.0        utf8_1.1.4       
## [13] rlang_0.4.0       pillar_1.4.2      glue_1.3.1       
## [16] withr_2.1.2       modelr_0.1.5      readxl_1.3.1     
## [19] plyr_1.8.4        lifecycle_0.1.0   munsell_0.5.0    
## [22] gtable_0.3.0      cellranger_1.1.0  rvest_0.3.4      
## [25] evaluate_0.14     labeling_0.3      knitr_1.25       
## [28] fansi_0.4.0       broom_0.5.2       Rcpp_1.0.2       
## [31] scales_1.0.0      backports_1.1.4   jsonlite_1.6     
## [34] hms_0.5.1         digest_0.6.21     stringi_1.4.3    
## [37] grid_3.6.1        cli_1.1.0         tools_3.6.1      
## [40] lazyeval_0.2.2    crayon_1.3.4      pkgconfig_2.0.3  
## [43] zeallot_0.1.0     xml2_1.2.2        lubridate_1.7.4  
## [46] assertthat_0.2.1  rmarkdown_1.16    httr_1.4.1       
## [49] rstudioapi_0.10   R6_2.4.0          nlme_3.1-140     
## [52] compiler_3.6.1
You can’t perform that action at this time.