This post is a companion for the short paper submitted to EC-TEL 2018 conference. It contains details of the whole study and some more information such as figures, which does not fit in to the paper. For the analysis we used Open University Learning Analytics dataset.
This is draft version of companion, if you spot an error, typo or you have any comment, please let me know.
Setup
First we will load the required packages:
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(oulad)
library(matlib)
library(DiagrammeR)
Data
We will be using data from OULAD and we will stick with only one course presentation, namely FFF/2014J. At first, the student Virtual Learning Environment (VLE) activity logs and list of available VLE “resources”" is extracted:
# available VLE resources
filtered_vle <- vle %>%
filter(code_module == "FFF",
code_presentation == "2014J")
str(filtered_vle)
## 'data.frame': 452 obs. of 6 variables:
## $ id_site : int 883082 883294 882713 883163 882561 882932 883195 882597 882985 883228 ...
## $ code_module : chr "FFF" "FFF" "FFF" "FFF" ...
## $ code_presentation: chr "2014J" "2014J" "2014J" "2014J" ...
## $ activity_type : chr "subpage" "url" "questionnaire" "url" ...
## $ week_from : int NA NA 19 14 9 NA 18 29 NA 23 ...
## $ week_to : int NA NA 19 14 9 NA 18 29 NA 23 ...
# student activity logs
filtered_student_vle <- student_vle %>%
filter(code_module == "FFF",
code_presentation == "2014J")
str(filtered_student_vle)
## 'data.frame': 1210359 obs. of 6 variables:
## $ code_module : chr "FFF" "FFF" "FFF" "FFF" ...
## $ code_presentation: chr "2014J" "2014J" "2014J" "2014J" ...
## $ id_student : int 2398260 2398260 2398260 2398260 2398260 2398260 2398260 2398260 2398260 2398260 ...
## $ id_site : int 883041 882602 882587 883142 883092 883064 883037 883060 882670 882663 ...
## $ date : int -18 -18 -18 -18 -18 -18 -18 -18 -18 -18 ...
## $ sum_click : int 1 1 2 1 1 1 1 1 1 5 ...
It can be observed that there are 452 resources available to study for the course and more than 1.2 million records of student activity in VLE.
Since the first assessment will be considered in part of our research, the information
is extracted from data.frame assessment
and week, in which the assessment has cut-off date is computed:
assessment_day <- assessment %>%
filter(code_module == "FFF",
code_presentation == "2014J",
assessment_type == "TMA") %>%
select(id_assessment, date) %>%
rename(assessment_day = date) %>%
mutate(assessment_week = ceiling(assessment_day/7))
assessment_day
## id_assessment assessment_day assessment_week
## 1 34899 24 4
## 2 34900 52 8
## 3 34901 94 14
## 4 34902 136 20
## 5 34903 199 29
We will also need the results of the first assessment:
assessments <- assessment %>%
filter(code_module == "FFF",
code_presentation == "2014J",
assessment_type == "TMA")
student_assessment_results <- student_assessment %>%
inner_join(assessments, by="id_assessment") %>%
select(id_student, id_assessment, date_submitted, is_banked, score)
head(student_assessment_results)
## id_student id_assessment date_submitted is_banked score
## 1 560047 34899 24 FALSE 84
## 2 560459 34899 29 FALSE 48
## 3 560645 34899 23 FALSE 100
## 4 560765 34899 23 FALSE 78
## 5 561231 34899 22 FALSE 80
## 6 562754 34899 21 FALSE 76
Finally we will need student final outcome of the course to be able those who withdrawn:
student_final_results <- student %>%
filter(code_module == "FFF",
code_presentation == "2014J") %>%
select(id_student, final_result)
student_unregistration <- student_registration %>%
filter(code_module == "FFF",
code_presentation == "2014J") %>%
select(id_student, date_unregistration)
student_final_results <- student_final_results %>%
left_join(student_unregistration, by = "id_student") %>%
mutate(date_unregistration = ceiling(date_unregistration/7))
Next the study plan information is refined:
# study plan
plan <- filtered_vle %>%
mutate(week = week_from) %>%
select(id_site,
activity_type,
week) %>%
arrange(week)
head(plan)
## id_site activity_type week
## 1 882707 page 1
## 2 883183 url 1
## 3 883177 url 1
## 4 883233 url 1
## 5 882669 oucontent 1
## 6 883040 subpage 1
# count how many planned resources are scheduled for the week
planed_resources <- plan %>%
group_by(week) %>%
summarise(max_resources = n()) %>%
ungroup() %>%
filter(!is.na(week))
head(planed_resources)
## # A tibble: 6 x 2
## week max_resources
## <int> <int>
## 1 1 13
## 2 2 5
## 3 3 5
## 4 4 1
## 5 5 6
## 6 7 2
The real student behaviour was prepared as follows. The activities from before the start of
the course were excluded; for every log entry week, in which student did the activity,
is computed from the date
and finally only columns of interest were extracted:
reality <- filtered_student_vle %>%
filter(date >= 0L) %>%
mutate(date = if_else(date == 0L, 1L, date),
week = ceiling(date/7)) %>%
select(id_student,
id_site,
week)
head(reality)
## id_student id_site week
## 1 489493 882537 1
## 2 489553 882545 1
## 3 489493 883040 1
## 4 489493 882587 1
## 5 489493 882545 1
## 6 489493 882545 1
Now we can combine real student behaviour with the plan:
data_combined <- reality %>%
rename(student_week = week) %>%
left_join(plan, by=c("id_site")) %>%
unique() %>%
rename(planned_week = week)
# example
data_combined %>%
filter(id_student == 35747) %>%
head()
## id_student id_site student_week activity_type planned_week
## 1 35747 882551 1 forumng NA
## 2 35747 882609 1 oucontent NA
## 3 35747 882545 1 forumng NA
## 4 35747 882547 1 forumng NA
## 5 35747 883092 1 subpage NA
## 6 35747 882537 1 homepage NA
If we look at plan, we can observe that there are planned activities up to week 29:
plan %>%
select(week) %>%
max(na.rm = TRUE)
## [1] 29
Thus we will focus only on first 29 weeks of studies.
We can now do activity categorization for the purpose of creating the Markov chain:
# find out if student did something in the week according to the plan
student_activities <- data_combined %>%
select(-id_site) %>%
mutate(planned_week = if_else(is.na(planned_week),-1L,planned_week),
planned = if_else(student_week == planned_week, "Y","N")) %>%
group_by(id_student, student_week, planned) %>%
count() %>%
ungroup() %>%
rename(week = student_week)
# get maximum week in plan
maximal_planned_week <- planed_resources %>%
summarise(max(week)) %>%
as.integer()
# classify student activities:
# for planned activity we have 0 for none, E for some and A for all
# for nonplanned activity we have 0 for none and E for did something
# format PlannedNonplanned, example: A0 - all from the plan, nothing else
student_activities_classified <- student_activities %>%
filter(week <= maximal_planned_week) %>%
left_join(planed_resources, by = "week") %>%
mutate(max_resources = if_else(planned == "N", n, max_resources),
percentage = n/max_resources # proportion of planned activities in week
) %>%
select(-n, -max_resources) %>%
group_by(id_student, week) %>%
# spread proportion of planned and nonplanned activity proportion for each week
spread(planned, percentage, fill = 0) %>%
mutate(class = if_else(Y == 0, # if no planned activity
if_else(N == 0, # if no other activity
"00",
"0E"),
# if all planned activity else part of planned activity
if_else(Y == 1,
if_else(N == 0, # if no other activity
"A0",
"AE"),
if_else(N == 0, # if no other activity
"E0",
"EE")
)
)
) %>%
select(-N, -Y)
head(student_activities_classified)
## # A tibble: 6 x 3
## # Groups: id_student, week [6]
## id_student week class
## <int> <dbl> <chr>
## 1 31296 2 0E
## 2 31296 3 0E
## 3 31296 4 0E
## 4 31296 5 0E
## 5 31296 6 0E
## 6 31296 8 0E
The problem we need to work on are students who do not engage with the VLE. We need to include them and we can combine all the data with the final student outcome:
unique_students <- student_final_results %>%
select(id_student) %>%
unique() %>%
unlist() %>%
unname()
weeks <- rep(1:maximal_planned_week,
length(unique_students))
students_with_all_weeks <- tibble(id_student = sort(rep(unique_students,
maximal_planned_week)
),
week = weeks)
students_activities_classified_with_final_result <- students_with_all_weeks %>%
left_join(student_activities_classified, by=c("id_student","week")) %>%
mutate(class = if_else(is.na(class), "00", class)) %>%
left_join(student_final_results, by="id_student") %>%
mutate(class = if_else(week >= date_unregistration & !is.na(date_unregistration),
"Withdrawn",
class)
) %>%
select(-date_unregistration) %>%
mutate(final_result = if_else(final_result == "Distinction", "Pass", final_result))
head(students_activities_classified_with_final_result)
## # A tibble: 6 x 4
## id_student week class final_result
## <int> <dbl> <chr> <chr>
## 1 31296 1 00 Fail
## 2 31296 2 0E Fail
## 3 31296 3 0E Fail
## 4 31296 4 0E Fail
## 5 31296 5 0E Fail
## 6 31296 6 0E Fail
Analysis
Whole cohort
At first we will check how many samples we got for each activity cathegory (state):
students_activities_classified_with_final_result %>%
group_by(class) %>%
count()
## # A tibble: 6 x 2
## # Groups: class [6]
## class n
## <chr> <int>
## 1 00 12146
## 2 0E 28155
## 3 AE 1434
## 4 E0 7
## 5 EE 7782
## 6 Withdrawn 19061
We can observe that A0
is not presented in data and E0
has only 7 representatives,
thus we will exclude them:
students_to_remove <- students_activities_classified_with_final_result %>%
filter(class == "A0" | class == "E0") %>%
select(id_student) %>%
unique() %>%
.[[1]]
students_activities_classified_with_final_result <- students_activities_classified_with_final_result %>%
filter(!id_student %in% students_to_remove)
Since we will be constructing more than one chain, we define the function to do that:
create_transition_matrix <- function(students_activities,
states){
# template
transition_matrix <- matrix(data = numeric(),
nrow = length(states),
ncol = length(states),
dimnames = list(states, states))
# prepare data
data_for_markov_chain <- students_activities %>%
select(-final_result) %>%
arrange(id_student, week)
state_vector <- data_for_markov_chain %>%
select(class) %>%
unlist() %>%
unname()
state_vector <- c(state_vector[2:length(state_vector)],
state_vector[1])
data_for_markov_chain$state_2 <- state_vector
data_for_markov_chain <- data_for_markov_chain %>%
rename(state_1 = class) %>%
filter(week != max(week)) %>%
select(state_1, state_2)
# count transitions
for(state1 in states){
for(state2 in states){
n <- data_for_markov_chain %>%
filter(state_1 == state1,
state_2 == state2) %>%
count() %>%
as.numeric()
transition_matrix[state1,state2] <- n
}
}
# normalize be row sums
normalization_matrix <- transition_matrix %>%
apply(1,sum) %>%
rep(length(states)) %>%
matrix(ncol = length(states), nrow = length(states))
transition_matrix <- transition_matrix/normalization_matrix
# return
transition_matrix
}
Finally we can compute our first Markov chain transition matrix:
# PLAN | OTHER ACTIVITIES OUTSIDE PLAN
# possible values:
# PLAN: 0, E, A; OTHER: 0, E
states_P_1 <- c("00",
"0E",
# "E0", # filtered out
"EE",
#"A0", #f iltered out
"AE",
"Withdrawn")
transition_matrix_P_1 <- create_transition_matrix(students_activities_classified_with_final_result,
states_P_1)
# show transition matrix
(P_1 <- round(transition_matrix_P_1, 2)*100)
## 00 0E EE AE Withdrawn
## 00 66 29 2 0 2
## 0E 13 75 9 2 1
## EE 5 46 37 11 1
## AE 3 25 63 9 0
## Withdrawn 0 0 0 0 100
We can now visualize the Markov chain. We will exlude transition edges with probability less than 10 percent
to achive better readability of graph (with exception of Withdrawn
column):
P_1_tmp <- P_1
P_1_tmp[P_1 <= 10] <- 0
P_1_tmp[,5] <- P_1[,5]
graph <- from_adj_matrix(P_1_tmp, mode = "directed", weighted = T)
x <- graph$edges_df
x$label <- x$weight
graph$edges_df <- x
graph %>% render_graph()
We can now compute time to absorption in Withdrawn
state, this is done using
computing of the fundamental matrix and again we will write the function to do that:
fundamental_matrix <- function(transition_matrix,
states) {
n <- length(states)
Q <- transition_matrix[-n,-n]
I <- diag(n-1)
fundamental_matrix <- inv(I-Q)
states[-n]
# fundamental matrix
list(fundamental_matrix = fundamental_matrix,
# time to absorption
absorption_time = fundamental_matrix %*% matrix(data=rep(1,n-1), nrow = n-1),
# probability of absorption
prob_of_absorption = fundamental_matrix %*% matrix(transition_matrix[-n,n],nrow=n-1)
)
}
And the result for the whole cohort is:
fundamental_matrix(transition_matrix_P_1,
states_P_1)
## $fundamental_matrix
##
## [1,] 22.29533 45.10336 9.349230 2.010790
## [2,] 20.83701 48.66435 9.959898 2.141060
## [3,] 20.44635 47.07235 11.447280 2.280842
## [4,] 20.58766 47.36192 10.950323 3.324548
##
## $absorption_time
## [,1]
## [1,] 78.75872
## [2,] 81.60232
## [3,] 81.24682
## [4,] 82.22446
##
## $prob_of_absorption
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
Students divided to those who submitted and who do not submitted first assessment
First we need to combine student activities with first assessment results:
# select first assessment id
assessment_id <- assessments %>%
select(id_assessment) %>%
.[[1]] %>%
.[1]
# select results
assessment_results <- student_assessment_results %>%
filter(id_assessment == assessment_id) %>%
select(id_student, score)
# combine activities and results
students_activities_classified_with_assessment_result <- students_activities_classified_with_final_result %>%
left_join(assessment_results, by = "id_student") %>%
mutate(submitted = if_else(is.na(score), "N", "Y")) %>%
select(-score)
head(students_activities_classified_with_assessment_result)
## # A tibble: 6 x 5
## id_student week class final_result submitted
## <int> <dbl> <chr> <chr> <chr>
## 1 31296 1 00 Fail Y
## 2 31296 2 0E Fail Y
## 3 31296 3 0E Fail Y
## 4 31296 4 0E Fail Y
## 5 31296 5 0E Fail Y
## 6 31296 6 0E Fail Y
Thanks to added information we can now split the students and perform the analysis on two groups - submitters and non-submitters, we will also filter out activities from first 4 weeks, since this is before the submission of the first assessment:
submitted_students <- students_activities_classified_with_assessment_result %>%
filter(submitted == "Y", week > 4)
head(submitted_students)
## # A tibble: 6 x 5
## id_student week class final_result submitted
## <int> <dbl> <chr> <chr> <chr>
## 1 31296 5 0E Fail Y
## 2 31296 6 0E Fail Y
## 3 31296 7 00 Fail Y
## 4 31296 8 0E Fail Y
## 5 31296 9 0E Fail Y
## 6 31296 10 00 Fail Y
non_submitted_students <- students_activities_classified_with_assessment_result %>%
filter(submitted == "N", week > 4)
head(non_submitted_students)
## # A tibble: 6 x 5
## id_student week class final_result submitted
## <int> <dbl> <chr> <chr> <chr>
## 1 54772 5 Withdrawn Withdrawn N
## 2 54772 6 Withdrawn Withdrawn N
## 3 54772 7 Withdrawn Withdrawn N
## 4 54772 8 Withdrawn Withdrawn N
## 5 54772 9 Withdrawn Withdrawn N
## 6 54772 10 Withdrawn Withdrawn N
Again we can check how many samples we have for each activity:
submitted_students %>%
group_by(class) %>%
count()
## # A tibble: 5 x 2
## # Groups: class [5]
## class n
## <chr> <int>
## 1 00 9257
## 2 0E 25378
## 3 AE 361
## 4 EE 4576
## 5 Withdrawn 5928
non_submitted_students %>%
group_by(class) %>%
count()
## # A tibble: 4 x 2
## # Groups: class [4]
## class n
## <chr> <int>
## 1 00 1920
## 2 0E 144
## 3 EE 8
## 4 Withdrawn 11403
We can observe that some activities are not represented in the data, thus we will not be using them.
Now, we need to compute transtion matrices. At first for submitters:
states_submitted <- c("00",
"0E",
"EE",
"AE",
"Withdrawn")
transition_matrix_P_2 <- create_transition_matrix(submitted_students,
states_submitted)
# show transition matrix
(P_2 <- round(transition_matrix_P_2, 2)*100)
## 00 0E EE AE Withdrawn
## 00 63 34 2 0 1
## 0E 13 78 8 1 0
## EE 6 60 34 1 0
## AE 1 32 60 7 0
## Withdrawn 0 0 0 0 100
P_2_tmp <- P_2
P_2_tmp[P_2 <= 10] <- 0
P_2_tmp[,5] <- P_2[,5]
graph <- from_adj_matrix(P_2_tmp, mode = "directed", weighted = T)
x <- graph$edges_df
x$label <- x$weight
graph$edges_df <- x
graph %>% render_graph()
And now - non-submitters:
states_non_submitted <- c("00",
"0E",
"EE",
"Withdrawn")
transition_matrix_P_3 <- create_transition_matrix(non_submitted_students,
states_non_submitted)
# show transition matrix
(P_3 <- round(transition_matrix_P_3, 2)*100)
## 00 0E EE Withdrawn
## 00 95 3 0 2
## 0E 51 41 3 5
## EE 38 38 0 25
## Withdrawn 0 0 0 100
P_3_tmp <- P_3
P_3_tmp[P_3 <= 10] <- 0
P_3_tmp[,4] <- P_3[,4]
graph <- from_adj_matrix(P_3_tmp, mode = "directed", weighted = T)
x <- graph$edges_df
x$label <- x$weight
graph$edges_df <- x
graph %>% render_graph()
We can again compute fundamental matrix and time to absorption for both matrices:
fundamental_matrix(transition_matrix_P_2,
states_submitted)
## $fundamental_matrix
##
## [1,] 37.14889 91.87871 12.59693 1.252732
## [2,] 35.82263 95.50298 13.02503 1.299411
## [3,] 35.81776 94.85128 14.45870 1.299243
## [4,] 35.72444 94.76524 13.90671 2.372618
##
## $absorption_time
## [,1]
## [1,] 142.8773
## [2,] 145.6500
## [3,] 146.4270
## [4,] 146.7690
##
## $prob_of_absorption
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
## [4,] 1
fundamental_matrix(transition_matrix_P_3,
states_non_submitted)
## $fundamental_matrix
##
## [1,] 47.64451 2.795167 0.1559150
## [2,] 43.02444 4.245332 0.1892808
## [3,] 34.00086 2.640187 1.1294484
##
## $absorption_time
## [,1]
## [1,] 50.59560
## [2,] 47.45905
## [3,] 37.77049
##
## $prob_of_absorption
## [,1]
## [1,] 1
## [2,] 1
## [3,] 1
Conclusions
We analysed one course from the OULAD data, namely FFF/2014J, using Markov chains. From constructed graph representation we can observe several interesting facts:
- students tends to do less next week
- those who did nothing have larger probability to withdraw
- if we analyse students who did not submitted first assessment those who did nothing tends to do nothing and are passive for the whole course
For more details see the paper submitted for ECTEL conference. This is its online companion.