Student drop-out modelling using Virtual Learning Environment behaviour data

2018-04-12 16 min read research

This post is a companion for the short paper submitted to EC-TEL 2018 conference. It contains details of the whole study and some more information such as figures, which does not fit in to the paper. For the analysis we used Open University Learning Analytics dataset.

This is draft version of companion, if you spot an error, typo or you have any comment, please let me know.

Setup

First we will load the required packages:

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(stringr)
library(oulad)
library(matlib)
library(DiagrammeR)

Data

We will be using data from OULAD and we will stick with only one course presentation, namely FFF/2014J. At first, the student Virtual Learning Environment (VLE) activity logs and list of available VLE “resources”" is extracted:

# available VLE resources
filtered_vle <- vle %>% 
  filter(code_module == "FFF", 
         code_presentation == "2014J")

str(filtered_vle)

## 'data.frame':    452 obs. of  6 variables:
##  $ id_site          : int  883082 883294 882713 883163 882561 882932 883195 882597 882985 883228 ...
##  $ code_module      : chr  "FFF" "FFF" "FFF" "FFF" ...
##  $ code_presentation: chr  "2014J" "2014J" "2014J" "2014J" ...
##  $ activity_type    : chr  "subpage" "url" "questionnaire" "url" ...
##  $ week_from        : int  NA NA 19 14 9 NA 18 29 NA 23 ...
##  $ week_to          : int  NA NA 19 14 9 NA 18 29 NA 23 ...

# student activity logs
filtered_student_vle <- student_vle %>% 
  filter(code_module == "FFF", 
         code_presentation == "2014J")

str(filtered_student_vle)

## 'data.frame':    1210359 obs. of  6 variables:
##  $ code_module      : chr  "FFF" "FFF" "FFF" "FFF" ...
##  $ code_presentation: chr  "2014J" "2014J" "2014J" "2014J" ...
##  $ id_student       : int  2398260 2398260 2398260 2398260 2398260 2398260 2398260 2398260 2398260 2398260 ...
##  $ id_site          : int  883041 882602 882587 883142 883092 883064 883037 883060 882670 882663 ...
##  $ date             : int  -18 -18 -18 -18 -18 -18 -18 -18 -18 -18 ...
##  $ sum_click        : int  1 1 2 1 1 1 1 1 1 5 ...

It can be observed that there are 452 resources available to study for the course and more than 1.2 million records of student activity in VLE.

Since the first assessment will be considered in part of our research, the information is extracted from data.frame assessment and week, in which the assessment has cut-off date is computed:

assessment_day <- assessment %>% 
  filter(code_module == "FFF", 
         code_presentation == "2014J", 
         assessment_type == "TMA") %>% 
  select(id_assessment, date) %>% 
  rename(assessment_day = date) %>% 
  mutate(assessment_week = ceiling(assessment_day/7))

assessment_day

##   id_assessment assessment_day assessment_week
## 1         34899             24               4
## 2         34900             52               8
## 3         34901             94              14
## 4         34902            136              20
## 5         34903            199              29

We will also need the results of the first assessment:

assessments <- assessment %>% 
  filter(code_module == "FFF", 
         code_presentation == "2014J",
         assessment_type == "TMA")

student_assessment_results <- student_assessment %>% 
  inner_join(assessments, by="id_assessment") %>% 
  select(id_student, id_assessment, date_submitted, is_banked, score) 

head(student_assessment_results)

##   id_student id_assessment date_submitted is_banked score
## 1     560047         34899             24     FALSE    84
## 2     560459         34899             29     FALSE    48
## 3     560645         34899             23     FALSE   100
## 4     560765         34899             23     FALSE    78
## 5     561231         34899             22     FALSE    80
## 6     562754         34899             21     FALSE    76

Finally we will need student final outcome of the course to be able those who withdrawn:

student_final_results <- student %>% 
  filter(code_module == "FFF", 
         code_presentation == "2014J") %>% 
  select(id_student, final_result) 

student_unregistration <- student_registration %>% 
  filter(code_module == "FFF", 
         code_presentation == "2014J") %>% 
  select(id_student, date_unregistration)

student_final_results <- student_final_results %>% 
  left_join(student_unregistration, by = "id_student") %>% 
  mutate(date_unregistration = ceiling(date_unregistration/7))

Next the study plan information is refined:

# study plan
plan <- filtered_vle %>% 
  mutate(week = week_from) %>% 
  select(id_site, 
         activity_type,
         week) %>% 
  arrange(week)

head(plan)

##   id_site activity_type week
## 1  882707          page    1
## 2  883183           url    1
## 3  883177           url    1
## 4  883233           url    1
## 5  882669     oucontent    1
## 6  883040       subpage    1

# count how many planned resources are scheduled for the week
planed_resources <- plan %>% 
  group_by(week) %>% 
  summarise(max_resources = n()) %>% 
  ungroup() %>% 
  filter(!is.na(week))

head(planed_resources)

## # A tibble: 6 x 2
##    week max_resources
##   <int>         <int>
## 1     1            13
## 2     2             5
## 3     3             5
## 4     4             1
## 5     5             6
## 6     7             2

The real student behaviour was prepared as follows. The activities from before the start of the course were excluded; for every log entry week, in which student did the activity, is computed from the date and finally only columns of interest were extracted:

reality <- filtered_student_vle %>% 
  filter(date >= 0L) %>% 
  mutate(date = if_else(date == 0L, 1L, date), 
         week = ceiling(date/7)) %>% 
  select(id_student,
         id_site,
         week)

head(reality)

##   id_student id_site week
## 1     489493  882537    1
## 2     489553  882545    1
## 3     489493  883040    1
## 4     489493  882587    1
## 5     489493  882545    1
## 6     489493  882545    1

Now we can combine real student behaviour with the plan:

data_combined <- reality %>%
  rename(student_week = week) %>% 
  left_join(plan, by=c("id_site")) %>% 
  unique() %>% 
  rename(planned_week = week)

# example
data_combined %>% 
  filter(id_student == 35747) %>% 
  head()

##   id_student id_site student_week activity_type planned_week
## 1      35747  882551            1       forumng           NA
## 2      35747  882609            1     oucontent           NA
## 3      35747  882545            1       forumng           NA
## 4      35747  882547            1       forumng           NA
## 5      35747  883092            1       subpage           NA
## 6      35747  882537            1      homepage           NA

If we look at plan, we can observe that there are planned activities up to week 29:

plan %>% 
  select(week) %>% 
  max(na.rm = TRUE)

## [1] 29

Thus we will focus only on first 29 weeks of studies.

We can now do activity categorization for the purpose of creating the Markov chain:

# find out if student did something in the week according to the plan
student_activities <- data_combined %>% 
  select(-id_site) %>% 
  mutate(planned_week = if_else(is.na(planned_week),-1L,planned_week),
         planned = if_else(student_week == planned_week, "Y","N")) %>% 
  group_by(id_student, student_week, planned) %>%
  count() %>% 
  ungroup() %>% 
  rename(week = student_week)

# get maximum week in plan
maximal_planned_week <- planed_resources %>% 
  summarise(max(week)) %>% 
  as.integer()

# classify student activities:
# for planned activity we have 0 for none, E for some and A for all
# for nonplanned activity we have 0 for none and E for did something
# format PlannedNonplanned, example: A0 - all from the plan, nothing else
student_activities_classified <- student_activities %>% 
  filter(week <= maximal_planned_week) %>%
  left_join(planed_resources, by = "week") %>% 
  mutate(max_resources = if_else(planned == "N", n, max_resources),
         percentage = n/max_resources # proportion of planned activities in week
  ) %>% 
  select(-n, -max_resources) %>% 
  group_by(id_student, week) %>% 
  # spread proportion of planned and nonplanned activity proportion for each week 
  spread(planned, percentage, fill = 0) %>% 
  mutate(class = if_else(Y == 0, # if no planned activity
                         if_else(N == 0, # if no other activity
                                 "00",
                                 "0E"),
                          # if all planned activity else part of planned activity
                         if_else(Y == 1,
                                 if_else(N == 0, # if no other activity
                                         "A0",
                                         "AE"),
                                 if_else(N == 0, # if no other activity
                                         "E0",
                                         "EE")
                                 )
                         )
        ) %>% 
  select(-N, -Y)

head(student_activities_classified)

## # A tibble: 6 x 3
## # Groups:   id_student, week [6]
##   id_student  week class
##        <int> <dbl> <chr>
## 1      31296     2 0E   
## 2      31296     3 0E   
## 3      31296     4 0E   
## 4      31296     5 0E   
## 5      31296     6 0E   
## 6      31296     8 0E

The problem we need to work on are students who do not engage with the VLE. We need to include them and we can combine all the data with the final student outcome:

unique_students <- student_final_results %>% 
  select(id_student) %>% 
  unique() %>% 
  unlist() %>% 
  unname()

weeks <- rep(1:maximal_planned_week, 
            length(unique_students)) 

students_with_all_weeks <- tibble(id_student = sort(rep(unique_students, 
                                                        maximal_planned_week)
                                                    ),
                                  week = weeks)

students_activities_classified_with_final_result <- students_with_all_weeks %>% 
  left_join(student_activities_classified, by=c("id_student","week")) %>% 
  mutate(class = if_else(is.na(class), "00", class)) %>% 
  left_join(student_final_results, by="id_student") %>% 
  mutate(class = if_else(week >= date_unregistration & !is.na(date_unregistration), 
                         "Withdrawn", 
                         class)
         ) %>% 
  select(-date_unregistration) %>% 
  mutate(final_result = if_else(final_result == "Distinction", "Pass", final_result))

head(students_activities_classified_with_final_result)

## # A tibble: 6 x 4
##   id_student  week class final_result
##        <int> <dbl> <chr> <chr>       
## 1      31296     1 00    Fail        
## 2      31296     2 0E    Fail        
## 3      31296     3 0E    Fail        
## 4      31296     4 0E    Fail        
## 5      31296     5 0E    Fail        
## 6      31296     6 0E    Fail

Analysis

Whole cohort

At first we will check how many samples we got for each activity cathegory (state):

students_activities_classified_with_final_result %>% 
  group_by(class) %>% 
  count()

## # A tibble: 6 x 2
## # Groups:   class [6]
##   class         n
##   <chr>     <int>
## 1 00        12146
## 2 0E        28155
## 3 AE         1434
## 4 E0            7
## 5 EE         7782
## 6 Withdrawn 19061

We can observe that A0 is not presented in data and E0 has only 7 representatives, thus we will exclude them:

students_to_remove <- students_activities_classified_with_final_result %>% 
  filter(class == "A0" | class == "E0") %>% 
  select(id_student) %>% 
  unique() %>% 
  .[[1]]

students_activities_classified_with_final_result <- students_activities_classified_with_final_result %>% 
  filter(!id_student %in% students_to_remove)

Since we will be constructing more than one chain, we define the function to do that:

create_transition_matrix <- function(students_activities,
                                     states){

  # template
  transition_matrix <- matrix(data = numeric(), 
                              nrow = length(states),
                              ncol = length(states),
                              dimnames = list(states, states))
  
  # prepare data
  data_for_markov_chain <- students_activities %>% 
    select(-final_result) %>% 
    arrange(id_student, week) 
  
  
  state_vector <- data_for_markov_chain %>% 
    select(class) %>% 
    unlist() %>% 
    unname()
  
  state_vector <- c(state_vector[2:length(state_vector)],
                    state_vector[1])
  

  data_for_markov_chain$state_2 <- state_vector
  
  data_for_markov_chain <- data_for_markov_chain %>% 
    rename(state_1 = class) %>% 
    filter(week != max(week)) %>% 
    select(state_1, state_2)
  
  # count transitions
  for(state1 in states){
    for(state2 in states){
      n <- data_for_markov_chain %>%
        filter(state_1 == state1,
               state_2 == state2) %>%
        count() %>%
        as.numeric()
      
      transition_matrix[state1,state2] <- n
    }
  }
  
  # normalize be row sums
  normalization_matrix <- transition_matrix %>% 
    apply(1,sum) %>% 
    rep(length(states)) %>% 
    matrix(ncol = length(states), nrow = length(states))
  
  transition_matrix <- transition_matrix/normalization_matrix
  
  # return
  transition_matrix
}

Finally we can compute our first Markov chain transition matrix:

# PLAN | OTHER ACTIVITIES OUTSIDE PLAN
# possible values: 
# PLAN: 0, E, A; OTHER: 0, E
states_P_1 <- c("00",
                "0E",
                # "E0", # filtered out
                "EE",
                #"A0", #f iltered out
                "AE",
                "Withdrawn")

transition_matrix_P_1 <- create_transition_matrix(students_activities_classified_with_final_result,
                                                  states_P_1)

# show transition matrix
(P_1 <- round(transition_matrix_P_1, 2)*100)

##           00 0E EE AE Withdrawn
## 00        66 29  2  0         2
## 0E        13 75  9  2         1
## EE         5 46 37 11         1
## AE         3 25 63  9         0
## Withdrawn  0  0  0  0       100

We can now visualize the Markov chain. We will exlude transition edges with probability less than 10 percent to achive better readability of graph (with exception of Withdrawn column):

P_1_tmp <- P_1
P_1_tmp[P_1 <= 10] <- 0
P_1_tmp[,5] <- P_1[,5]
graph <- from_adj_matrix(P_1_tmp, mode = "directed", weighted = T)
x <- graph$edges_df
x$label <- x$weight
graph$edges_df <- x

graph %>% render_graph()

We can now compute time to absorption in Withdrawn state, this is done using computing of the fundamental matrix and again we will write the function to do that:

fundamental_matrix <- function(transition_matrix,
                               states) {

n <- length(states)
Q <- transition_matrix[-n,-n]

I <- diag(n-1) 

fundamental_matrix <- inv(I-Q)

states[-n]

# fundamental matrix
list(fundamental_matrix = fundamental_matrix,
     # time to absorption
     absorption_time = fundamental_matrix %*% matrix(data=rep(1,n-1), nrow = n-1),
     # probability of absorption
     prob_of_absorption = fundamental_matrix %*% matrix(transition_matrix[-n,n],nrow=n-1)
)
}

And the result for the whole cohort is:

fundamental_matrix(transition_matrix_P_1,
                   states_P_1)

## $fundamental_matrix
##                                          
## [1,] 22.29533 45.10336  9.349230 2.010790
## [2,] 20.83701 48.66435  9.959898 2.141060
## [3,] 20.44635 47.07235 11.447280 2.280842
## [4,] 20.58766 47.36192 10.950323 3.324548
## 
## $absorption_time
##          [,1]
## [1,] 78.75872
## [2,] 81.60232
## [3,] 81.24682
## [4,] 82.22446
## 
## $prob_of_absorption
##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1

Students divided to those who submitted and who do not submitted first assessment

First we need to combine student activities with first assessment results:

# select first assessment id
assessment_id <- assessments %>% 
  select(id_assessment) %>% 
  .[[1]] %>% 
  .[1]

# select results
assessment_results <- student_assessment_results %>% 
  filter(id_assessment == assessment_id) %>% 
  select(id_student, score)

# combine activities and results
students_activities_classified_with_assessment_result <- students_activities_classified_with_final_result %>%
  left_join(assessment_results, by = "id_student") %>% 
  mutate(submitted = if_else(is.na(score), "N", "Y")) %>% 
  select(-score) 

head(students_activities_classified_with_assessment_result)

## # A tibble: 6 x 5
##   id_student  week class final_result submitted
##        <int> <dbl> <chr> <chr>        <chr>    
## 1      31296     1 00    Fail         Y        
## 2      31296     2 0E    Fail         Y        
## 3      31296     3 0E    Fail         Y        
## 4      31296     4 0E    Fail         Y        
## 5      31296     5 0E    Fail         Y        
## 6      31296     6 0E    Fail         Y

Thanks to added information we can now split the students and perform the analysis on two groups - submitters and non-submitters, we will also filter out activities from first 4 weeks, since this is before the submission of the first assessment:

submitted_students <- students_activities_classified_with_assessment_result %>%
  filter(submitted == "Y", week > 4) 

head(submitted_students)

## # A tibble: 6 x 5
##   id_student  week class final_result submitted
##        <int> <dbl> <chr> <chr>        <chr>    
## 1      31296     5 0E    Fail         Y        
## 2      31296     6 0E    Fail         Y        
## 3      31296     7 00    Fail         Y        
## 4      31296     8 0E    Fail         Y        
## 5      31296     9 0E    Fail         Y        
## 6      31296    10 00    Fail         Y

non_submitted_students <- students_activities_classified_with_assessment_result %>%
  filter(submitted == "N", week > 4) 

head(non_submitted_students)

## # A tibble: 6 x 5
##   id_student  week class     final_result submitted
##        <int> <dbl> <chr>     <chr>        <chr>    
## 1      54772     5 Withdrawn Withdrawn    N        
## 2      54772     6 Withdrawn Withdrawn    N        
## 3      54772     7 Withdrawn Withdrawn    N        
## 4      54772     8 Withdrawn Withdrawn    N        
## 5      54772     9 Withdrawn Withdrawn    N        
## 6      54772    10 Withdrawn Withdrawn    N

Again we can check how many samples we have for each activity:

submitted_students %>% 
  group_by(class) %>% 
  count()

## # A tibble: 5 x 2
## # Groups:   class [5]
##   class         n
##   <chr>     <int>
## 1 00         9257
## 2 0E        25378
## 3 AE          361
## 4 EE         4576
## 5 Withdrawn  5928

non_submitted_students %>% 
  group_by(class) %>% 
  count()

## # A tibble: 4 x 2
## # Groups:   class [4]
##   class         n
##   <chr>     <int>
## 1 00         1920
## 2 0E          144
## 3 EE            8
## 4 Withdrawn 11403

We can observe that some activities are not represented in the data, thus we will not be using them.

Now, we need to compute transtion matrices. At first for submitters:

states_submitted <- c("00",
                      "0E",
                      "EE",
                      "AE",
                      "Withdrawn")

transition_matrix_P_2 <- create_transition_matrix(submitted_students, 
                                                  states_submitted)

# show transition matrix
(P_2 <- round(transition_matrix_P_2, 2)*100)

##           00 0E EE AE Withdrawn
## 00        63 34  2  0         1
## 0E        13 78  8  1         0
## EE         6 60 34  1         0
## AE         1 32 60  7         0
## Withdrawn  0  0  0  0       100

P_2_tmp <- P_2
P_2_tmp[P_2 <= 10] <- 0
P_2_tmp[,5] <- P_2[,5]
graph <- from_adj_matrix(P_2_tmp, mode = "directed", weighted = T)
x <- graph$edges_df
x$label <- x$weight
graph$edges_df <- x

graph %>% render_graph()

And now - non-submitters:

states_non_submitted <- c("00",
                          "0E",
                          "EE",
                          "Withdrawn")

transition_matrix_P_3 <- create_transition_matrix(non_submitted_students, 
                                                 states_non_submitted)

# show transition matrix
(P_3 <- round(transition_matrix_P_3, 2)*100)

##           00 0E EE Withdrawn
## 00        95  3  0         2
## 0E        51 41  3         5
## EE        38 38  0        25
## Withdrawn  0  0  0       100

P_3_tmp <- P_3
P_3_tmp[P_3 <= 10] <- 0
P_3_tmp[,4] <- P_3[,4]
graph <- from_adj_matrix(P_3_tmp, mode = "directed", weighted = T)
x <- graph$edges_df
x$label <- x$weight
graph$edges_df <- x

graph %>% render_graph()

We can again compute fundamental matrix and time to absorption for both matrices:

fundamental_matrix(transition_matrix_P_2,
                   states_submitted)

## $fundamental_matrix
##                                         
## [1,] 37.14889 91.87871 12.59693 1.252732
## [2,] 35.82263 95.50298 13.02503 1.299411
## [3,] 35.81776 94.85128 14.45870 1.299243
## [4,] 35.72444 94.76524 13.90671 2.372618
## 
## $absorption_time
##          [,1]
## [1,] 142.8773
## [2,] 145.6500
## [3,] 146.4270
## [4,] 146.7690
## 
## $prob_of_absorption
##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1

fundamental_matrix(transition_matrix_P_3,
                   states_non_submitted)

## $fundamental_matrix
##                                 
## [1,] 47.64451 2.795167 0.1559150
## [2,] 43.02444 4.245332 0.1892808
## [3,] 34.00086 2.640187 1.1294484
## 
## $absorption_time
##          [,1]
## [1,] 50.59560
## [2,] 47.45905
## [3,] 37.77049
## 
## $prob_of_absorption
##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1

Conclusions

We analysed one course from the OULAD data, namely FFF/2014J, using Markov chains. From constructed graph representation we can observe several interesting facts:

students tends to do less next week
those who did nothing have larger probability to withdraw
if we analyse students who did not submitted first assessment those who did nothing tends to do nothing and are passive for the whole course

For more details see the paper submitted for ECTEL conference. This is its online companion.

academic learning-analytics markov-chain-modelling VLE