
2 PREVIOUS RESEARCH
[
20
] is one of the most influential studies of using CV and ML
technologies for tracking students’ engagements. 34 undergraduate
students participated in a “Cognitive Skills Training" experiment
and played three types of games on an IPad. While playing games,
their front-face videos were recorded. A 4-level engagement rating
scale was used, where 1 stands for lacking engagement and 4 stands
for highly engaged. The human ratings were done on two time
scales: long clips (60 seconds) vs. short clips (10 seconds). [
20
]
found that human raters agreed more on short clips. Regarding
the features, they used both low-level features (e.g., Box Filter (BF)
and Gabor Energy Filters features) and high-level ones based on
facial expression analysis. In particular, 3
D
face pose and 20 action
units (AUs) were extracted by using the CERT toolkit [
16
]. In some
binary classification tasks on short video clips, they found that the
machine learning models performed with comparable accuracy to
humans.
[
6
] focused on students’ video watching learning sessions. The
video data from 12 students were recorded in a laboratory setup
by using Microsoft Kinect V2 sensor. Regarding human rating, [
6
]
used a continuous rating scale by allowing human raters to move a
slider in a custom-designed rating software to provide continuous
measurements. Each session was rated by 9 raters and their results
were combined to form the ground truth. A novel approach was
proposed to convert continuous ratings to trinary labels by using a
clustering method that considers human raters’ rating variations. In
their ML experiments, diverse types of features were used, including
facial landmarks, Facial Action Coding System (FACS) AUs, FACS
eye movement codes, emotion, optical flow related features, head
size and pose.
[
17
] used both videos and heart rate data collected from 22 stu-
dents when they were completing a structured writing activity. Both
concurrent and retrospective self-reporting were used to generate
engagement ratings. Three sets of features from videos, e.g., anima-
tion units (from Microsoft Kinect), pixel-level LBP-TOP features,
and heart rates measured by a device were jointly used to build
ML models to predict engagement. The face tracking features from
Kinect were found to yield the best performance among different
feature inputs. The overall best performance was achieved by using
a fusion of different sets of features.
[
14
] described the baseline engagement prediction system devel-
oped by the Challenge’s organizers. Since the labeled videos are
limited, to address this issue, multiple instance learning (MIL) [
3
]
was utilized. For each instance, i.e., a short video segment, both
DNN and Long-Short-Time-Memory (LSTM) [
12
] recurrent neural
networks (RNN) were used for modeling. However, the system’s
performance is not very satisfying yet - Pearson correlation be-
tween machine predicted engagements and human-rated ones is
less than 0.25.
3 CORPUS
The challenge organizers selected educational videos learn the Ko-
rean Language in 5 minutes to be the learning material for students
to watch. Each recorded video is about 5 minutes long. The data
collection was made to highlight its “wild" nature. First, a large
number of students were used. In the Challenge, both training and
validation sets contain 197 videos, much larger than the number of
videos used in previous research. Second, the data set was collected
in non-constrained environments, i.e., at different locations such as
computer labs, hostel rooms, open ground, and a video conference
setup.
In the Challenge, the training set (
N =
149) and the validation
set (
N =
48) were released in March 2018 for all participants to
develop their models. The label distributions of the training set
and validation set are shown in Table 1. Note that the label distri-
butions of the provided official dataset are slightly different from
that described in the baseline paper [
14
], and our experiments were
conducted based on the provided dataset. After June 2nd, the pro-
cessing results of using OpenFace version 0
.
23 were released. The
test set (
N =
67) was released on June 14th with the corresponding
OpenFace outputs.
Table 1: Label distributions of the dataset.
Label 0 1 2 3
Í
training 5 35 81 28 149
validation 4 10 19 15 48
Í
9 45 100 43 197
The engagement categories follow the 4-level scale defined in
[
20
]. In particular, intensity 0 means that the subject is completely
disengaged. 1 means barely engaged, such as moving restlessly in
the chair. 2 means the subject seems engaged, e.g., showing inter-
ests. 3 means the subject is highly engaged, e.g., being “glued" to
the MOOC content. Five annotators viewed the video content only
(without playing audio channels) and rated the subject’s engage-
ment level. Annotators’ labels with weighted Cohen’s
κ
less than
0
.
4 were dropped, and the final ground truth labels were formed in
a discrete 4-level. For evaluation, the intensity levels are linearly
scaled to the range of [0-1] as 0
.
0, 0
.
33, 0
.
66 and 1
.
0 by the Challenge
organizers.
4 METHODS
In this paper, we propose four different models for the engagement
detection task and build an ensemble model consisting of the four
individual models. In this section, we first introduce how we pre-
process the raw videos and extract segment-level features. Then we
present the details of the three cluster-based conventional models,
the end-to-end Neural Network (NN) models, and posture-based
heuristic rules developed to further boost the machine learning
based results.
4.1 Feature extraction
From reviewing the previous research on engagement detection,
we found that many methods used face analysis results rather
than low-level pixel-based features. Also, in recent years, high-
quality face analysis open-source implementations have become
available. Therefore, we used OpenFace [
5
] to track head pose,
gaze directions [
21
], and AUs [
4
], which are intuitively strong cues
about the attention of the subjects. In our experiments, we used the
OpenFace Version 1.0, which was the latest version in April 2018,
when processing all of the released videos.
ICMI’18, October 16-20, 2018, Boulder, CO, USA