Github Source Code and Up-to-date Nanodegree Projects are available.

Introduction

EdX is a massive open online course (MOOC) provider and nonprofit online learning provider. It is available world-wide, mostly at no charge. Along with Coursera and Udacity , it is one of three major MOOC providers. Edx was founded by MIT and Harvard in May 2012 and as of October 22, 2014, has over 3 million users taking 300 offered courses online. [1]

On May 30, 2014 MIT and Harvard released de-identified data from 13 HarvardX and MITx courses offered in the inaugural 2012-13 time period.[2]

This data contains records from 476,532 students covering 641,138 separate course registrations.[3]

Initial Exploration and Data Cleanup

This data set provides demographic and anonymized online activity data, including final grades.

When exploring a new dataset, it’s always a good idea to start with a look at the structure of the data with str().

library(dplyr)
library(ggplot2)
library(tidyr)
library(scales)
library(gridExtra)
library(extrafont)

setwd("data")
mooc_data <- tbl_df(read.csv("HMXPC13_DI_v2_5-14-14.csv", na.strings = c("NA", "")))

dim(mooc_data)

## [1] 641138     20

str(mooc_data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    641138 obs. of  20 variables:
##  $ course_id        : Factor w/ 16 levels "HarvardX/CB22x/2013_Spring",..: 1 2 1 2 3 4 5 1 1 2 ...
##  $ userid_DI        : Factor w/ 476532 levels "MHxPC130000002",..: 353042 353042 220054 220054 220054 220054 220054 430083 70410 70410 ...
##  $ registered       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ viewed           : int  0 1 0 0 0 1 0 1 1 1 ...
##  $ explored         : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ certified        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ final_cc_cname_DI: Factor w/ 34 levels "Australia","Bangladesh",..: 33 33 33 33 33 33 33 8 33 33 ...
##  $ LoE_DI           : Factor w/ 5 levels "Bachelor's","Doctorate",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ YoB              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ gender           : Factor w/ 3 levels "f","m","o": NA NA NA NA NA NA NA NA NA NA ...
##  $ grade            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ start_time_DI    : Factor w/ 413 levels "2012-07-23","2012-07-24",..: 150 85 201 57 150 57 201 163 211 90 ...
##  $ last_event_DI    : Factor w/ 404 levels "2012-07-24","2012-07-25",..: 404 NA 404 NA NA 296 NA 287 229 NA ...
##  $ nevents          : int  NA NA NA NA NA 502 NA 42 70 NA ...
##  $ ndays_act        : int  9 9 16 16 16 16 16 6 3 12 ...
##  $ nplay_video      : int  NA NA NA NA NA 50 NA NA NA NA ...
##  $ nchapters        : num  NA 1 NA NA NA 12 NA 3 3 3 ...
##  $ nforum_posts     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ roles            : logi  NA NA NA NA NA NA ...
##  $ incomplete_flag  : int  1 1 1 1 1 NA 1 NA NA 1 ...

The exact meanings of the variables can looked up in the “Person Course Documentation.pdf”. First, let’s tidy up the data a bit by separating course_id, which contains 3 different types of information, into 3 separate columns. We can also add Age using the user-provided birth year. [3]

We will then add the full title of the courses based on the Course Codes from the “Person Course Documentation.pdf”.

MITx/HarvardX Course Descriptions

levels(mooc_data$course_id)

##  [1] "HarvardX/CB22x/2013_Spring"  "HarvardX/CS50x/2012"        
##  [3] "HarvardX/ER22x/2013_Spring"  "HarvardX/PH207x/2012_Fall"  
##  [5] "HarvardX/PH278x/2013_Spring" "MITx/14.73x/2013_Spring"    
##  [7] "MITx/2.01x/2013_Spring"      "MITx/3.091x/2012_Fall"      
##  [9] "MITx/3.091x/2013_Spring"     "MITx/6.002x/2012_Fall"      
## [11] "MITx/6.002x/2013_Spring"     "MITx/6.00x/2012_Fall"       
## [13] "MITx/6.00x/2013_Spring"      "MITx/7.00x/2013_Spring"     
## [15] "MITx/8.02x/2013_Spring"      "MITx/8.MReV/2013_Summer"

levels(mooc_data$gender) <- c("Female", "Male", "Other")

mooc_data <- mooc_data %>%
  separate(course_id, into = c("Institution", "Course_Code", "Semester"), sep="/", convert=TRUE) %>%
  rename(Country = final_cc_cname_DI,
         Level_of_Edu = LoE_DI,
         Registration_Date = start_time_DI,
         Last_Interaction = last_event_DI) %>%
    mutate(Age = 2014 - YoB,
         certified = factor(certified, labels=c("No", "Yes")),
         Level_of_Edu = factor(Level_of_Edu, levels = c("Less than Secondary", "Secondary", "Bachelor's", "Master's", "Doctorate")))

## Add in full course titles from codebook.[3]
course_names <- data.frame("Course_Code" = factor(c("CB22x", "CS50x", "ER22x", "PH207x", "PH278x", "14.73x",
                                             "2.01x", "3.091x", "6.002x", "6.00x", "7.00x", "8.02x", "8.MReV")),
                           "Full_Title" = c("The Ancient Greek Hero", "Intro to Computer Science I",
                                            "Justice", "Health in #'s: Quant. Methods", 
                                            "Human Hlth & Glob Envir Chng",
                                            "Challenges of Glob. Poverty", "Elements of Structures",
                                            "Intro to Solid State Chem.", "Circuits and Electronics",
                                            "Intro to C.S. & Programming", "Intro to Biology",
                                            "Electricity and Magnetism", "Mechanics Review"))

mooc_data <- left_join(mooc_data, course_names, by="Course_Code")

mooc_data <- mooc_data %>%
  mutate(letter_grade = cut(grade, breaks=c(0.00, 0.01, .49, .59, .69, .79, .89, 1.01), include.lowest = TRUE, ordered_result = FALSE))



# Add in School Colors for plots
# http://web.mit.edu/graphicidentity/colors.html
# http://www.hbs.edu/marketing/color.html

mooc_data <- mooc_data %>%
  mutate(School_Color = ifelse(Institution == "HarvardX", "#A41034", "#000000"))

school_colors <- mooc_data$School_Color
names(school_colors) <- mooc_data$Institution

Univariate Exploration

This dataset contains data for 641,138 course registrations

476,532 unique students showing some students were registered for multiple courses in this dataset.

table(mooc_data$registered)

## 
##      1 
## 641138

str(levels(mooc_data$userid_DI))

##  chr [1:476532] "MHxPC130000002" "MHxPC130000003" ...

About 400,262 (62.4%) of students accessed the Courseware tab which contains the videos, problem sets, and exams. This leaves 240,876 students who never did anything in the course after registering.

prop.table(table(mooc_data$viewed))

## 
##      0      1 
## 0.3757 0.6243

table(mooc_data$viewed)

## 
##      0      1 
## 240876 400262

ggplot(mooc_data, aes(x=factor(viewed))) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-3

Only about 39,686 (6.2%) of students accessed at least half of the chapters for a given course.

prop.table(table(mooc_data$explored))

## 
##      0      1 
## 0.9381 0.0619

table(mooc_data$explored)

## 
##      0      1 
## 601452  39686

ggplot(mooc_data, aes(x=factor(explored))) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-4

Only 17,687 (2.8%) of registered students completed a given course and received a certificate (earned 50%-80% depending on course)

prop.table(table(mooc_data$certified))

## 
##      No     Yes 
## 0.97241 0.02759

table(mooc_data$certified)

## 
##     No    Yes 
## 623451  17687

ggplot(mooc_data, aes(x=factor(certified))) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-5

The US and India have the largest amounts of registered students

ggplot(mooc_data, aes(x=factor(Country))) + geom_histogram() + scale_y_continuous(labels=comma) + coord_flip()

plot of chunk unnamed-chunk-6

The most common education level of experience is a bachelors degree.

prop.table(table(mooc_data$Level_of_Edu))

## 
## Less than Secondary           Secondary          Bachelor's 
##             0.02633             0.31711             0.41068 
##            Master's           Doctorate 
##             0.22086             0.02502

table(mooc_data$Level_of_Edu)

## 
## Less than Secondary           Secondary          Bachelor's 
##               14092              169694              219768 
##            Master's           Doctorate 
##              118189               13387

ggplot(mooc_data, aes(x=Level_of_Edu)) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-7

Classes all start on different dates, this information doesn’t seem too useful by itself

ggplot(mooc_data, aes(x=Registration_Date)) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-8

199,151 (31%) students had 0 interactions beyond just registering 171,714 (26.8%) students had 10 or less interactions with the website beyond registration

prop.table(table(mooc_data$nevents<=10, useNA="always"))

## 
##  FALSE   TRUE   <NA> 
## 0.4216 0.2678 0.3106

table(mooc_data$nevents<=10, useNA="always")

## 
##  FALSE   TRUE   <NA> 
## 270273 171714 199151

g1 <- ggplot(mooc_data, aes(x=nevents)) + geom_histogram() + scale_y_continuous(labels=comma) + scale_x_log10() + xlab("Log10 Scale")
g2 <- ggplot(mooc_data, aes(x=nevents)) + geom_histogram(binwidth=10) + coord_cartesian(xlim=c(0, 500)) + scale_y_continuous(labels=comma) + xlab("Binwidth = 10; limit 500")
grid.arrange(g1, g2)

plot of chunk unnamed-chunk-9

457,530 (71.4%) did not play any videos at all within courses. 11.1% played videos 10 or less times, while 17.5% played videos more than 10 times.This is slightly startling because most of the instruction for courses are presented by the videos. One reason for the large proportion may be because some courses allow downloads of videos which are more effective for slower/intermittent internet connections.

prop.table(table(mooc_data$nplay_video<=10, useNA="always"))

## 
##  FALSE   TRUE   <NA> 
## 0.1750 0.1114 0.7136

table(mooc_data$nplay_video<=10, useNA="always")

## 
##  FALSE   TRUE   <NA> 
## 112203  71405 457530

g1 <- ggplot(mooc_data, aes(x=nplay_video)) + geom_histogram() + scale_y_continuous(labels=comma) + scale_x_log10(labels=comma) + xlab("Log10 Scale")
g2 <- ggplot(mooc_data, aes(x=nplay_video)) + geom_histogram(binwidth=10) + coord_cartesian(xlim=c(0, 500)) + scale_y_continuous(labels=comma) + xlab("Binwidth = 10; limit 500")
grid.arrange(g1, g2)

plot of chunk unnamed-chunk-10

162,743 (25.4%) did not interact with the course any days at all. 422,441 (65.9%) had interaction on 10 or less unique days. This isn’t much considering courses typically run 2-3 months long. 55,954 (8.7%) interacted with the courses on more than 10 unique days. This may be a good indicator for how well a student ultimately does.

prop.table(table(mooc_data$ndays_act<=10, useNA="always"))

## 
##   FALSE    TRUE    <NA> 
## 0.08727 0.65889 0.25383

table(mooc_data$ndays_act<=10, useNA="always")

## 
##  FALSE   TRUE   <NA> 
##  55954 422441 162743

ggplot(mooc_data, aes(x=ndays_act)) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-11

40.4% of students didn’t interact with any chapters of the course. Number of chapters varies per course so this statistic may be difficult to interpret. 19% of students only interacted with 1 chapter.

prop.table(table(mooc_data$nchapters<=1, useNA="always"))

## 
##  FALSE   TRUE   <NA> 
## 0.4064 0.1900 0.4036

table(mooc_data$nchapters<=1, useNA="always")

## 
##  FALSE   TRUE   <NA> 
## 260548 121837 258753

ggplot(mooc_data, aes(x=nchapters)) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-12

Very few students make any forum posts. Only 7461 (1.2%) made any posts at all. 633,677 (98.8%) were either lurkers or did not visit the forums.

prop.table(table(mooc_data$nforum_posts==0, useNA="always"))

## 
##   FALSE    TRUE    <NA> 
## 0.01164 0.98836 0.00000

table(mooc_data$nforum_posts<=0, useNA="always")

## 
##  FALSE   TRUE   <NA> 
##   7461 633677      0

ggplot(mooc_data, aes(x=nforum_posts)) + geom_histogram() + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-13

100,161 (15.6%) records showed some inconsistency between log tables of different variables. This just shows that there is a bit of noise in the data, but overall it looks to be pretty consistent.

prop.table(table(mooc_data$incomplete_flag, useNA="always"))

## 
##      1   <NA> 
## 0.1562 0.8438

table(mooc_data$incomplete_flag, useNA="always")

## 
##      1   <NA> 
## 100161 540977

Mean age across the dataset is 28.75. Minimum of 1 and 592 people under 5 years old shows there may be some noise in the data. There are also 96,605 who did not enter a birth year. Age shows a right skew with the midsection at 26-28, consistent with the largest group having a Bachelor’s degree.

summary(mooc_data$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1      23      26      29      32      83   96605

table(mooc_data$Age<=5)

## 
##  FALSE   TRUE 
## 543941    592

ggplot(mooc_data, aes(x=Age)) + geom_bar(binwidth=1, fill="orange", color="black") + scale_x_continuous(breaks=seq(0, 85, 5)) + scale_y_continuous(labels=comma) + ggtitle("Distribution of Ages for Registered Students")

plot of chunk unnamed-chunk-15

Computer Science courses seem to be the most popular courses.

ggplot(mooc_data, aes(x=Full_Title)) + geom_histogram() + scale_y_continuous(labels=comma) + coord_flip()

plot of chunk unnamed-chunk-16

MITx/HarvardX courses have roughly similar amounts of registered students in the dataset.

ggplot(mooc_data, aes(x=Institution)) + geom_histogram() + scale_y_continuous(labels=comma) + coord_flip()

plot of chunk unnamed-chunk-17

Flagging Active Students as a feature.

Distributions may be very different for students who are at least somewhat active when compared to those who register but do little to nothing after.

ggplot(mooc_data, aes(x=grade)) + geom_freqpoly()

plot of chunk unnamed-chunk-18

summary(mooc_data$grade)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0       0       0       1   57400

# Limit to students who viewed the Courseware section at least once; viewed == 1 and
# had at least 1 interaction with the the course beyond registration; nevents > 0
mooc_data_active <- mooc_data %>%
  filter(viewed == 1 & nevents > 0)

ggplot(mooc_data_active, aes(x=grade)) + geom_freqpoly()

plot of chunk unnamed-chunk-18

# Add the contraint that students had a final grade that was at least more than 0.0%; grade > 0.0
mooc_data_active <- mooc_data %>%
  filter(viewed == 1 & nevents > 0 & grade > 0.0)

ggplot(mooc_data_active, aes(x=grade)) + geom_freqpoly()

plot of chunk unnamed-chunk-18

# Add the constraint that students explored at least 1/2 of the chapters of the course in the Courseware; explored == 1
mooc_data_active <- mooc_data %>%
  filter(viewed == 1 & nevents > 0 & grade > 0.0 & explored == 1)

ggplot(mooc_data_active, aes(x=grade)) + geom_freqpoly()

plot of chunk unnamed-chunk-18

Based on this exploration, it might be a good idea to create a new variable, Active_User, based on course activity.

Active Users will be defined as registered users who:

Viewed the Courseware section at least once; viewed == 1
Had at least 1 interaction with the the course beyond registration; nevents > 0
Explored at least 1/2 of the chapters of the course in the Courseware; explored == 1
Had a final grade that was at least more than 0.0%; grade > 0.0

Note: #4 has some implications for HarvardX Intro to CS I, since that course is pass/fail (0.0%/100.0%) with a threshold of 50% to receive passing certificate.

# Add Active_User as a feature
mooc_data <- mooc_data %>%
  mutate(Active_User = ifelse( (viewed == 1 & nevents > 0 & explored == 1 & grade > 0.0), 1, 0),
         Active_User = factor(Active_User, labels=c("Not Active", "Active")))

There are a relatively small group who have no information for any of these features, but most students can be categorized as Active or No Active There are 26,463 active students with 613,744 who were not active. 931 students had no information available in all 4 criteria.

table(mooc_data$Active_User, useNA="always")

## 
## Not Active     Active       <NA> 
##     613744      26463        931

About 4.1% of online students meet the criteria for being an active student

prop.table(table(mooc_data$Active_User))

## 
## Not Active     Active 
##    0.95866    0.04134

ggplot(subset(mooc_data, !is.na(Active_User)), aes(x=Active_User)) + geom_histogram()

plot of chunk unnamed-chunk-21

Multivariate Exploration

# Certified/Active
ggplot(subset(mooc_data, !is.na(Active_User)), aes(x=certified)) + geom_bar() + facet_wrap(~Active_User, scales = "free") + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-22

# Grades of Active Users

ggplot(subset(mooc_data, Active_User == "Active"), aes(x=grade)) + geom_freqpoly(aes(color=Institution), binwidth=0.01)

plot of chunk unnamed-chunk-22

ggplot(subset(mooc_data, Active_User == "Active"), aes(x=grade)) + geom_freqpoly(aes(color=Full_Title), binwidth=0.01)

plot of chunk unnamed-chunk-22

ggplot(subset(mooc_data, Active_User == "Active"), aes(x=grade)) + geom_freqpoly(aes(color=Country), binwidth=0.01)

plot of chunk unnamed-chunk-22

ggplot(subset(mooc_data, Active_User == "Active"), aes(x=grade)) + geom_freqpoly(aes(color=gender), binwidth=0.01)

plot of chunk unnamed-chunk-22

ggplot(subset(mooc_data, Active_User == "Active"), aes(x=grade)) + geom_freqpoly(aes(color=Level_of_Edu), binwidth=0.01)

plot of chunk unnamed-chunk-22

A small of users below 10 years old looks suspect. There’s an interesting split at 50%.

ggplot(mooc_data, aes(x=Age, y=grade)) + geom_jitter(aes(color=Institution), alpha=1/10) + scale_y_continuous(labels=percent, breaks=seq(0, 1, 0.1)) + scale_x_continuous(breaks=seq(0, 80, 10)) + scale_color_manual(values = school_colors)

plot of chunk unnamed-chunk-23

# Popularity
ggplot(mooc_data, aes(x=Full_Title)) + geom_histogram(aes(fill = Institution)) + theme(axis.text.x=element_text(angle=90, size=9)) + scale_fill_manual(values = school_colors) + coord_flip() + theme_minimal()

plot of chunk unnamed-chunk-23

# Gender Enrollment
ggplot(subset(mooc_data, (!is.na(gender) & gender != "Other")), aes(x=Full_Title)) + geom_histogram(aes(fill=gender), position="dodge", color="black") + theme_minimal() + theme(axis.text.x=element_text(angle=90, size=9)) + coord_flip()

plot of chunk unnamed-chunk-23

# Active User/Certification/Edu
ggplot(subset(mooc_data, !is.na(Active_User)), aes(x=Level_of_Edu)) + geom_bar(aes(fill=certified), position="dodge") + facet_wrap(~Active_User, scales = "free") + scale_y_continuous(labels=comma)

plot of chunk unnamed-chunk-23

CS50 is a pass/fail 0%/100% course. Average scores are very low when all registered students are included. It may be better to look at active students to get a better feel for student performance since so many registered but didn’t do anything in the course.

mooc_summary_any <- mooc_data %>%
  filter(!is.na(grade)) %>%
  group_by(Full_Title, Institution) %>%
  summarise(avg_grade_any = mean(grade),
            n_any = n())

mooc_summary_active <- mooc_data %>%
  filter(!is.na(grade),
         Active_User == "Active") %>%
  group_by(Full_Title, Institution) %>%
  summarise(avg_grade_active = mean(grade),
            n_active = n())

mooc_summary <- left_join(mooc_summary_any, mooc_summary_active) %>%
  mutate(active_perc = n_active/n_any)

ggplot(mooc_summary, aes(x=reorder(Full_Title, avg_grade_any), y=avg_grade_any)) + geom_bar(aes(fill=Institution), stat="identity") + theme(axis.text.x=element_text(angle=90, size=9)) + scale_fill_manual(values = school_colors) + coord_flip() + theme_minimal() + scale_y_continuous(labels=percent) + ggtitle("Among All Registered Users")

plot of chunk unnamed-chunk-24

What about among Active Users? Since Intro to CS1 is pass/fail, anyone who is active does not have 0.0 as a grade, and must have full 100% causing the average to be 100%.

ggplot(mooc_summary, aes(x=reorder(Full_Title, avg_grade_active), y=avg_grade_active)) + geom_bar(aes(fill=Institution), stat="identity") + theme(axis.text.x=element_text(angle=90, size=9)) + scale_fill_manual(values = school_colors) + coord_flip() + theme_minimal() + scale_y_continuous(labels=percent) + ggtitle("Among Active Users")

plot of chunk unnamed-chunk-25

An interesting thing to note here is that the most popular course for registrations, Intro to Science Science I, had by far the lowest percentage of active users.

ggplot(mooc_summary, aes(x=reorder(Full_Title, active_perc), y=active_perc)) + geom_bar(aes(fill=Institution), stat="identity") + theme(axis.text.x=element_text(angle=90, size=9)) + scale_fill_manual(values = school_colors) + coord_flip() + theme_minimal() + scale_y_continuous(labels=percent) + ggtitle("% of Active Users per Course")

plot of chunk unnamed-chunk-26

Number of events and video plays don’t seems to have a very high correlation with overall grades.

ggplot(mooc_data, aes(x=nevents, y=grade)) + geom_jitter(alpha=1/10)

plot of chunk unnamed-chunk-27

ggplot(mooc_data, aes(x=nplay_video, y=grade)) + geom_jitter(alpha=1/10)

plot of chunk unnamed-chunk-27

But number of unique days logged in and number of chapters viewed show some correlation.

ggplot(mooc_data, aes(x=ndays_act, y=grade)) + geom_jitter(alpha=1/10)

ggplot(mooc_data, aes(x=nchapters, y=grade)) + geom_jitter(alpha=1/10)

plot of chunk unnamed-chunk-28

ggplot(mooc_data, aes(x=nforum_posts, y=grade)) + geom_jitter(alpha=1/10)

plot of chunk unnamed-chunk-28

# Age and gender
# Summary Statistics for Age by gender are very similar.
by(mooc_data$Age, mooc_data$gender, summary)

## mooc_data$gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1      23      27      30      33      78    3555 
## -------------------------------------------------------- 
## mooc_data$gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1      23      26      28      32      83    6232 
## -------------------------------------------------------- 
## mooc_data$gender: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      26      26      26      26      26      26      12

# Certified percentages
ggplot(mooc_data, aes(x=certified, y=..count../sum(..count..))) + geom_bar() + scale_y_continuous(labels=percent)