# Function to process exams
<- function(exam_path, question_id) {
process_exam
<- read.csv(file.path("~", exam_path))
exam
<- as.data.table(exam)[Question.ID == question_id ,c("Username","Answer")]
exam_q1
$Answer <- gsub("<.*?>", "", exam_q1$Answer)
exam_q1
<- corpus(exam_q1, docid_field = "Username", text_field ="Answer")
exam_q1_corpus
<- dfm(exam_q1_corpus %>% tokens(), tolower= TRUE, remove_padding = TRUE)
exam_q1_dfm
<- textstat_simil(exam_q1_dfm, method = "cosine", margin = "documents")
tstat_hist
<- as.vector(tstat_hist)
tstat_hist_vector
<- textstat_simil(exam_q1_dfm, method = "cosine", margin = "documents", min_simil = 0.9)
tstat1
<- as.data.frame(as.table(as.matrix(tstat1)))
potential_cheater <- na.omit(potential_cheater)
potential_cheater
names(potential_cheater) <- c("id1", "id2", "cosine")
<- potential_cheater %>%
potential_cheater mutate(cheaters = ifelse((cosine!=1 & (id1!=id2)),1,0)) %>%
filter(cheaters ==1)
<- potential_cheater[which(!duplicated(potential_cheater$cosine)),]
potential_cheater
#list(tstat_hist_vector = tstat_hist_vector, potential_cheater = potential_cheater)
# Create a list to keep track of the source file and question ID
<- list(
result file = exam_path,
question_id = question_id,
data = tstat_hist_vector,
potential_cheater = potential_cheater)
}
<- list.files(pattern =".csv", path = "~/Dropbox/welda_website/cheating/csv")
path_files
<- path_files[1:3]
online_files <- path_files[4]
offline_files
<- c(1, 2)
online_questions <- c(18, 19) offline_questions
Data Science in Action: Detecting Cheaters with Cosine Similarity
Introduction
In this blog, we will endeavor to devise a code that identifies potential cheaters using cosine similarities of answers. A friend, who happens to be an educator, approached me seeking a quick and efficient way to detect potential cheaters and review their submissions. However, due to the ambiguous nature of mathematical expressions in the answers, the features provided by Blackboard couldn’t be of much assistance. Therefore, I formulated an R code to assist my friend in pinpointing students who might be cheating. The students’ IDs have been encrypted by my friend and hence, carry no significance to me. The following steps outline the procedure to detect potential cheaters:
Setting Up the libraries
Setting Up the libraries: Here, necessary libraries for the analysis are loaded. These include libraries for data manipulation (dplyr
, data.table
, tidyr
), text analysis (quanteda
, quanteda.textstats
), visualization (ggplot2
, kableExtra
, igraph
, networkD3
, network
, tidygraph
, ggraph
), and the knitr
library for neat output formatting.
The process_exam function: This function performs the main cheating detection analysis. It takes an exam file and a question ID as inputs. It reads the exam file, selects the answers for the specified question, removes HTML tags, and constructs a document-feature matrix. Cosine similarities between answers are computed and potential cheaters are identified as those pairs of answers with cosine similarity above 0.9 (excluding identical answers).
Online class and Offline class: These sections call the process_exam
function for each exam file (online and offline) and each question. The result is a list of results for each file and question.
Online class
<- paste0("Dropbox/welda_website/cheating/csv/", online_files)
online_files
# Process online exams
<- lapply(online_files, function(file) {
online_results # Loop over the range of online question IDs
lapply(1:2, function(question_id) {
process_exam(file, question_id = question_id)
}) })
Offline class
<- paste0("Dropbox/welda_website/cheating/csv/", offline_files)
offline_files
# Process offline exams
<- lapply(offline_files, function(file) {
offline_results # Loop over the range of offline question IDs
lapply(18:19, function(question_id) {
process_exam(file, question_id = question_id)
}) })
Density of the cosine similarities
Density of the cosine similarities: These sections create density plots of the cosine similarities between student answers. These plots provide a visual representation of the distribution of answer similarity, with a vertical line at 0.9 indicating the threshold for detecting potential cheating.
# Combine the tstat vectors for each question in the online exams
<- lapply(online_results, function(file_result) {
online_tstats_q1 1]]$data
file_result[[
})
<- lapply(online_results, function(file_result) {
online_tstats_q2 2]]$data
file_result[[
})
# Combine the tstat vectors for each question in the offline exams
<- lapply(offline_results, function(file_result) {
offline_tstats_q1 1]]$data
file_result[[
})
<- lapply(offline_results, function(file_result) {
offline_tstats_q2 2]]$data
file_result[[
})
# Create a data frame with each list as a separate row
<- data.frame(
tstat_data_q1 tstats = c(unlist(online_tstats_q1[[1]]), unlist(online_tstats_q1[[2]]), unlist(online_tstats_q1[[3]]), unlist(offline_tstats_q1)),
section = rep(c('online_1', 'online_2', 'online_3', 'offline'), times = c(length(unlist(online_tstats_q1[[1]])),
length(unlist(online_tstats_q1[[2]])),
length(unlist(online_tstats_q1[[3]])),
length(unlist(offline_tstats_q1))))
)
<- data.frame(
tstat_data_q2 tstats = c(unlist(online_tstats_q2[[1]]), unlist(online_tstats_q2[[2]]), unlist(online_tstats_q2[[3]]), unlist(offline_tstats_q1)),
section = rep(c('online_1', 'online_2', 'online_3', 'offline'), times = c(length(unlist(online_tstats_q2[[1]])),
length(unlist(online_tstats_q2[[2]])),
length(unlist(online_tstats_q2[[3]])),
length(unlist(offline_tstats_q1))))
)
library(ggplot2)
library(tidyr)
# Plot the densities
ggplot(tstat_data_q1, aes(x = tstats, fill = section)) +
geom_density(alpha = 0.5) +
geom_vline(xintercept = 0.9, linetype = "dashed", color = "red", size = 1) +
labs(title = "Density of cosine similarity between questions from different exams - question 1",
x = "Cosine Similarity",
y = "Density") +
scale_fill_manual(values = c("online_1" = "#FF6347", "online_2" = "#ADD8E6", "online_3" = "#90EE90", "offline" = "#800080")) +
theme_minimal() +
theme(legend.title = element_blank()) +
annotate("text", x = 0.9, y = 5, label = "Potential Cheaters", hjust = 1.2, color = "red")
When examining the distribution of cosine similarities, an intriguing observation arises: the offline class appears to skew more to the left in comparison to the online class, which exhibits a higher frequency for greater values of cosine similarities. This suggests that students in the online class are likely to cheat more frequently than those in the offline class. My friend noted that students tend to be more engaged in the offline class compared to the online one, so this outcome is not entirely surprising - in fact, it aligns with our expectations.
# Plot the densities
ggplot(tstat_data_q2, aes(x = tstats, fill = section)) +
geom_density(alpha = 0.5) +
geom_vline(xintercept = 0.9, linetype = "dashed", color = "red", size = 1) +
labs(title = "Density of cosine similarity between questions from different exams - question 2",
x = "Cosine Similarity",
y = "Density") +
scale_fill_manual(values = c("online_1" = "#FF6347", "online_2" = "#ADD8E6", "online_3" = "#90EE90", "offline" = "#800080")) +
theme_minimal() +
theme(legend.title = element_blank()) +
annotate("text", x = 0.9, y = 5, label = "Potential Cheaters", hjust = 1.2, color = "red")
Investigating the possible network of Cheaters in one of online sections
Investigating the possible network of Cheaters in one of online sections: This section visualizes potential cheaters as a network graph, where nodes represent students and edges represent potential cheating incidents. This visual representation can be useful for understanding the relationships between potential cheaters.
# Combine the cheaters vectors for each question in the online exams
<- lapply(online_results, function(file_result) {
online_cheaters rbind(file_result[[1]]$potential_cheater,
2]]$potential_cheater)
file_result[[
})
# Combine the cheaters vectors for each question in the offline exams
<- lapply(offline_results, function(file_result) {
offline_cheaters rbind(file_result[[1]]$potential_cheater,
2]]$potential_cheater)
file_result[[
})
# cheaters for section one
<- online_cheaters[[1]]
potential_cheater <- potential_cheater %>% arrange(desc(cosine))
potential_cheater ::kable(head(potential_cheater), booktabs = T) %>%
knitrkable_styling(position = "center",latex_options = "HOLD_position")
id1 | id2 | cosine | cheaters | |
---|---|---|---|---|
10 | XDQ114120107 | BDY118120115 | 0.9426680 | 1 |
8 | YDS115120111 | TDQ110120107 | 0.9404906 | 1 |
14 | YDS115120110 | XDQ114120107 | 0.9400602 | 1 |
61 | YDH115120099 | GDI097120099 | 0.9389956 | 1 |
12 | ITT099110110 | BDY118120115 | 0.9311809 | 1 |
29 | RDA108120117 | HDJ098120100 | 0.9302268 | 1 |
# Create a data frame with potential cheaters
<- data.frame(
data Source = potential_cheater$id1,
Target = potential_cheater$id2
)
# Plot the network graph
<- simpleNetwork(data, height = "150px", width = "100px",
p Source = 1, # Column number of source
Target = 2, # Column number of target
linkDistance = 10, # Distance between nodes
charge = -900, # Strength of node repulsion
fontSize = 24, # Size of node names
fontFamily = "serif", # Font of node names
linkColour = "#666", # Colour of edges
nodeColour = "#609960", # Colour of nodes
opacity = 0.9, # Opacity of nodes
zoom = TRUE # Enable zoom
)
p
Upon studying the network graph of potential cheaters, we observe some intriguing patterns. There’s a clear network of students who have strikingly similar answers, indicative of potential cheating.
What’s particularly interesting is how this network might be able to tell us where the cheating originated. By examining the centrality of this network, we can identify which students are most connected to others - these are students whose answers bear similarity to many others’. In network theory, such nodes are often considered influential within the network, because they interact with a large number of other nodes.
However, it’s crucial to approach this with caution. While a student being at the ‘center’ of this network could imply that they are the source of the copied answers, there could be other explanations as well. It’s also possible that this student’s work was shared without their consent. As such, while this method can identify suspicious patterns, further investigation will be required to definitively determine any cases of academic dishonesty. Nonetheless, by scrutinizing such networks, we have a powerful tool to identify potential instances of cheating, and further probe into the specifics.