Engagement detection is crucial for monitoring learner interaction in educational settings. This research paper presents an ensemble model that utilizes face and body tracking technologies to assess engagement levels effectively. The authors propose innovative methods, including a cluster-based framework and neural networks with attention pooling, to enhance prediction accuracy. The study demonstrates significant improvements in engagement detection performance, achieving a low Mean Squared Error (MSE) on test sets. This work is valuable for researchers and educators interested in leveraging computer vision and machine learning for engagement monitoring.

Key Points

  • Proposes an ensemble model for engagement detection using face and body tracking.
  • Implements a cluster-based framework for fast engagement level predictions.
  • Utilizes neural networks with attention pooling for improved accuracy.
  • Achieves a low Mean Squared Error (MSE) of 0.081 on test sets.
Sharanya Kamath
Author:Cheng Chang, Cheng Zhang, Lei Chen, Yang Liu
7 pages
Language:English
Type:Research Paper
Sharanya Kamath
Author:Cheng Chang, Cheng Zhang, Lei Chen, Yang Liu
7 pages
Language:English
Type:Research Paper
247
/ 7
An Ensemble Model Using Face and Body Tracking for
Engagement Detection
Cheng Chang
Liulishuo
Shanghai, China
cheng.chang@liulishuo.com
Cheng Zhang
TongJi University
Shanghai, China
zhangcheng0219@qq.com
Lei Chen
Liulishuo Silicon Valley AI Lab
San Mateo, CA, USA
lei.chen@liulishuo.com
Yang Liu
Liulishuo Silicon Valley AI Lab
San Mateo, CA, USA
yang.liu@liulishuo.com
ABSTRACT
Precise detection and localization of learners’ engagement levels
are useful for monitoring their learning quality. In the emotiW
Challenge’s engagement detection task, we proposed a series of
novel improvements, including (a) a cluster-based framework for
fast engagement level predictions, (b) a neural network using the
attention pooling mechanism, (c) heuristic rules using body posture
information, and (d) model ensemble for more accurate and robust
predictions. Our experimental results suggest that our proposed
methods effectively improved engagement detection performance.
On the validation set, our system can reduce the baseline Mean
Squared Error (MSE) by about 56%. On the final test set, our system
yielded a competitively low MSE of 0.081.
CCS CONCEPTS
Computing methodologies Neural networks
; Activity recog-
nition and understanding;
KEYWORDS
Engagement detection, face tracking, body tracking, machine learn-
ing, ensemble model
ACM Reference Format:
Cheng Chang, Cheng Zhang, Lei Chen, and Yang Liu. 2018. An Ensemble
Model Using Face and Body Tracking for Engagement Detection. In 2018
International Conference on Multimodal Interaction (ICMI ’18), October 16–20,
2018, Boulder, CO, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/
10.1145/3242969.3264986
1 INTRODUCTION
Engagement is defined as a connection between a user and a re-
source, and comprises emotional, cognitive, and behavioral nature
at any point in time [
22
]. Tracking users’ engagement levels is
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
ICMI ’18, October 16–20, 2018, Boulder, CO, USA
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5692-3/18/10... $15.00
https://doi.org/10.1145/3242969.3264986
important in many applications, e.g., making sure vehicle drivers
focus on road conditions, helping workers to improve manufactur-
ing quality and avoid accidents.
In education, tracking learners’ engagement levels is particularly
important for tracking how learners have mastered the instructed
material. Typically, in traditional classroom-based education, hu-
man teachers spend a significant amount of time on monitoring
the engagement levels of all the students. Recently, Massive On-
line Open Courses (MOOC) have been widely utilized and a lot
of students watch MOOC instructional videos to study. However,
in this condition, MOOC learners’ engagement levels are seldom
tracked. Also, the massive number of students who enroll in MOOC
education also prohibits using humans to track engagement lev-
els. Therefore, it is important to develop automatic engagement-
tracking methods based on rapidly developing computer vision
(CV) and machine learning (ML) technologies. With the widespread
online education business, several companies have announced that
CV-based engagement tracking systems have been used to help re-
tain attention of online learners. However, the tracking accuracy of
these systems has not been thoroughly verified. Given this technol-
ogy’s potential broad usage and importance of evaluating students,
we feel that the multimodal sensing research community needs
do more rigorous research and benchmark tests in engagement
tracking.
In the 2018 EmotiW challenge [
2
], one of the tasks is predicting
students’ engagement intensity during their MOOC video watch-
ing processes under diverse recording conditions and platforms.
Clearly, this task will help the research community in three folds:
(a) providing an opportunity to investigate how accurate an en-
gagement tracking system can be built using existing CV and ML
technologies, (b) providing a common data set to compare different
methods, and (c) testing various developed algorithms on “wild"
conditions that are close to real application scenarios of the future
tracking systems.
The rest of this paper is organized as follows: Section 2 briefly re-
views exemplary research works in tracking learners’ engagement
levels; Section 3 introduces the engagement detection task and
the provided corpus; Section 4 describes the methods we have ex-
plored; Section 5 reports our experimental results; At last, Section 6
summarizes our conclusions.
EmotiW Grand Challenge
ICMI’18, October 16-20, 2018, Boulder, CO, USA
616
2 PREVIOUS RESEARCH
[
20
] is one of the most influential studies of using CV and ML
technologies for tracking students’ engagements. 34 undergraduate
students participated in a “Cognitive Skills Training" experiment
and played three types of games on an IPad. While playing games,
their front-face videos were recorded. A 4-level engagement rating
scale was used, where 1 stands for lacking engagement and 4 stands
for highly engaged. The human ratings were done on two time
scales: long clips (60 seconds) vs. short clips (10 seconds). [
20
]
found that human raters agreed more on short clips. Regarding
the features, they used both low-level features (e.g., Box Filter (BF)
and Gabor Energy Filters features) and high-level ones based on
facial expression analysis. In particular, 3
D
face pose and 20 action
units (AUs) were extracted by using the CERT toolkit [
16
]. In some
binary classification tasks on short video clips, they found that the
machine learning models performed with comparable accuracy to
humans.
[
6
] focused on students’ video watching learning sessions. The
video data from 12 students were recorded in a laboratory setup
by using Microsoft Kinect V2 sensor. Regarding human rating, [
6
]
used a continuous rating scale by allowing human raters to move a
slider in a custom-designed rating software to provide continuous
measurements. Each session was rated by 9 raters and their results
were combined to form the ground truth. A novel approach was
proposed to convert continuous ratings to trinary labels by using a
clustering method that considers human raters’ rating variations. In
their ML experiments, diverse types of features were used, including
facial landmarks, Facial Action Coding System (FACS) AUs, FACS
eye movement codes, emotion, optical flow related features, head
size and pose.
[
17
] used both videos and heart rate data collected from 22 stu-
dents when they were completing a structured writing activity. Both
concurrent and retrospective self-reporting were used to generate
engagement ratings. Three sets of features from videos, e.g., anima-
tion units (from Microsoft Kinect), pixel-level LBP-TOP features,
and heart rates measured by a device were jointly used to build
ML models to predict engagement. The face tracking features from
Kinect were found to yield the best performance among different
feature inputs. The overall best performance was achieved by using
a fusion of different sets of features.
[
14
] described the baseline engagement prediction system devel-
oped by the Challenge’s organizers. Since the labeled videos are
limited, to address this issue, multiple instance learning (MIL) [
3
]
was utilized. For each instance, i.e., a short video segment, both
DNN and Long-Short-Time-Memory (LSTM) [
12
] recurrent neural
networks (RNN) were used for modeling. However, the system’s
performance is not very satisfying yet - Pearson correlation be-
tween machine predicted engagements and human-rated ones is
less than 0.25.
3 CORPUS
The challenge organizers selected educational videos learn the Ko-
rean Language in 5 minutes to be the learning material for students
to watch. Each recorded video is about 5 minutes long. The data
collection was made to highlight its “wild" nature. First, a large
number of students were used. In the Challenge, both training and
validation sets contain 197 videos, much larger than the number of
videos used in previous research. Second, the data set was collected
in non-constrained environments, i.e., at different locations such as
computer labs, hostel rooms, open ground, and a video conference
setup.
In the Challenge, the training set (
N =
149) and the validation
set (
N =
48) were released in March 2018 for all participants to
develop their models. The label distributions of the training set
and validation set are shown in Table 1. Note that the label distri-
butions of the provided official dataset are slightly different from
that described in the baseline paper [
14
], and our experiments were
conducted based on the provided dataset. After June 2nd, the pro-
cessing results of using OpenFace version 0
.
23 were released. The
test set (
N =
67) was released on June 14th with the corresponding
OpenFace outputs.
Table 1: Label distributions of the dataset.
Label 0 1 2 3
Í
training 5 35 81 28 149
validation 4 10 19 15 48
Í
9 45 100 43 197
The engagement categories follow the 4-level scale defined in
[
20
]. In particular, intensity 0 means that the subject is completely
disengaged. 1 means barely engaged, such as moving restlessly in
the chair. 2 means the subject seems engaged, e.g., showing inter-
ests. 3 means the subject is highly engaged, e.g., being “glued" to
the MOOC content. Five annotators viewed the video content only
(without playing audio channels) and rated the subject’s engage-
ment level. Annotators’ labels with weighted Cohen’s
κ
less than
0
.
4 were dropped, and the final ground truth labels were formed in
a discrete 4-level. For evaluation, the intensity levels are linearly
scaled to the range of [0-1] as 0
.
0, 0
.
33, 0
.
66 and 1
.
0 by the Challenge
organizers.
4 METHODS
In this paper, we propose four different models for the engagement
detection task and build an ensemble model consisting of the four
individual models. In this section, we first introduce how we pre-
process the raw videos and extract segment-level features. Then we
present the details of the three cluster-based conventional models,
the end-to-end Neural Network (NN) models, and posture-based
heuristic rules developed to further boost the machine learning
based results.
4.1 Feature extraction
From reviewing the previous research on engagement detection,
we found that many methods used face analysis results rather
than low-level pixel-based features. Also, in recent years, high-
quality face analysis open-source implementations have become
available. Therefore, we used OpenFace [
5
] to track head pose,
gaze directions [
21
], and AUs [
4
], which are intuitively strong cues
about the attention of the subjects. In our experiments, we used the
OpenFace Version 1.0, which was the latest version in April 2018,
when processing all of the released videos.
EmotiW Grand Challenge
ICMI’18, October 16-20, 2018, Boulder, CO, USA
617
By observing the raw video data, we noticed that subjects’ body
poses, especially their hand movements and body fidgets, have a
noticeable correlation with human-rated engagement intensities.
When a learner has more restless hand movements, he or she tends
to have lower engagement ratings. Therefore, beyond tracking faces,
we also performed pose and hand tracking. In fact, body tracking
has been used in previous engagement detection research by using
depth cameras, e.g., Microsoft Kinect devices. in this dataset with
only 2D videos, we used a recent breakthrough in CV, OpenPose [
7
],
to directly track head, body, and hands. Figure 1 shows the example
output from the two CV toolkits.
(a) openface (b) openpose
Figure 1: OpenFace analysis outputs in (a) and OpenPose
analysis outputs in (b).
Before running feature extraction, all of the videos were down-
sampled to 10 frames per second (FPS) in order to reduce computa-
tional cost. To eliminate possible unexpected effect from the initial
and the ending part of the video when subjects prepare for either
starting to watch MOOC videos or leaving MOOC, only the video
portions from 00 : 30 to 4 : 30 were used for engagement prediction.
OpenFace [
5
] was used to extract a 31-dimensional feature set for
each frame, which contains eye gaze movement, head movement,
and facial action units features:
(1) Eye Gaze
: It contains two 3-dimensional eye gaze direction
vectors for both eyes and one 2-dimensional eye gaze direc-
tion in radians averaged for both eyes. The total dimension
for the eye gaze features is 8.
(2) Head Pose
: It is a 6-dimensional feature set related to the
location of the head and the rotation of the head in radians
around x, y and z axes.
(3) Facial Action Units (AUs)
: As OpenFace supports 17 dif-
ferent AUs, we use the detected intensity (from 0 to 5) of
these extracted AUs to form a 17-dimensional feature vector.
Next, segment-level features for each video were constructed based
on the extracted 31-dimensional frame-level features. First, the 1
st
and 2
nd
order delta coefficients were computed and concatenated
with the original features, resulting in a 93-dimensional feature
vector for each frame. Second, we defined a sliding window of
k
frames and a stride of
l
frames to group frames into segments and
applied 6 moment functions to each segment to further capture the
dynamics among the frames in the segment, i.e.,
min
,
max
,
mean
,
std
,
kurtosis
, and
skewness
. Therefore, the total feature vector per
segment contains 31
·
3
·
6
=
558 attributes. Then we normalized all
the segment-level features to zero mean and unit variance by apply-
ing a normalization scalar trained on the training set. Finally, for
computational efficiency, we employed Principal Component Anal-
ysis (PCA) to reduce the normalized features to a lower dimension
of d.
For the sliding window size
k
and the stride
l
, we chose between
two different configurations based on their corresponding perfor-
mances on the validation set: (a)
k =
20
, l =
20, which means there
is no overlap between adjacent segments; and (b)
k =
40
, l =
20,
which means a larger segment and an overlap of 50% between ad-
jacent segments. Both strategies result in about 120 segments for
each video. The hyperparameter
d
was set to 20 by cross-validation.
These segment-level features were used for building machine learn-
ing models in Section 4.2 and 4.3 to predict engagement intensity.
4.2 Conventional method
Inspired by the baseline method introduced in [
14
], we designed sev-
eral conventional methods that share a common baseline pipeline
as depicted in Figure 2: (I) Use K-means clustering to group all
segments into
K
clusters. (II) Re-assign each segment’s engagement
intensity to its corresponding cluster’s average intensity. (III) For
the entire video, generate a video-level feature. In this paper, we
explored three different approaches to construct the video-level
features, as described in 4.2.1, 4.2.1 and 4.2.3. (IV) Build an Ad-
aBoost [
11
] regressor to estimate intensity based on the video-level
features obtained above. In our experiments, we varied
K
from a
set of values
K {
10
,
20
,
30
,
40
,
50
}
and based on cross-validation
results selected K = 20 for all the cluster-based methods.
4.2.1 Doc2vec. Inspired by the Doc2vec method introduced in [
15
]
and its successful applications in both NLP and multimodal tasks
[
8
], we designed a method to generate video-level features using it.
Doc2vec extends from learning words’ representation (word2vec) to
learning paragraph or document level representations. By treating
each video as a paragraph consisting of the “visual words" that are
clusters’ ids, we used the Gensim Python package [
19
] to compute
a distributed Doc2vec representation to represent the video. In this
work, we set the dimension of Doc2vec representation to 5 through
cross-validation experiments.
4.2.2 Moment functions. The quality of the Doc2vec feature rep-
resentations is heavily impacted by data sizes. Given the fact that
the engagement detection dataset is quite limited, we also tried
a much simpler feature representation than the Doc2vec method.
We applied the six moment functions, i.e.,
min
,
max
,
mean
,
std
,
kurtosis
, and
skewness
, to the re-assigned segmental labels of each
video to generate 6-dimensional representation of the entire video.
4.2.3 Moment functions and cluster distribution. Considering that
the moment functions may be too coarse to comprehensively repre-
sent a video, we also explored adding the cluster distribution infor-
mation. Specifically, we calculated the frequency distribution of the
K
clusters over the whole video and generated a
K
-dimensional vec-
tor. The cluster distribution vector is then concatenated to the orig-
inal 6-dimensional representation to form a
{K +
6
}
-dimensional
video-level feature. In this work, this approach resulted in a 26-
dimensional representation of the entire video (K = 20).
EmotiW Grand Challenge
ICMI’18, October 16-20, 2018, Boulder, CO, USA
618
/ 7
End of Document
247

FAQs

What methods were proposed for engagement detection in the study?
The study proposed an ensemble model that incorporates four distinct methods for engagement detection. These methods include a cluster-based framework for fast engagement level predictions, a neural network utilizing an attention pooling mechanism, heuristic rules based on body posture information, and a model ensemble to enhance prediction accuracy. The combination of these approaches aims to improve the robustness and precision of engagement detection in learners.
How did the proposed model perform compared to the baseline?
The proposed ensemble model achieved a significant improvement over the baseline model. On the validation set, it reduced the Mean Squared Error (MSE) by about 56%, achieving an MSE of 0.0441. This performance indicates the effectiveness of combining different models and methodologies in accurately predicting engagement levels during learning sessions.
What features were extracted for engagement detection?
The study utilized OpenFace and OpenPose to extract a variety of features for engagement detection. From OpenFace, a 31-dimensional feature set was derived, including eye gaze movement, head pose, and facial action units. Additionally, OpenPose was used to track body and hand movements, which were found to correlate with engagement levels. These features were then processed to create segment-level representations for machine learning models.
What were the engagement levels defined in the study?
The engagement levels in the study were categorized into a four-level scale. Level 0 indicates complete disengagement, Level 1 represents barely engaged behavior, Level 2 signifies that the subject seems engaged, and Level 3 indicates high engagement, where the subject is fully attentive to the content. This classification helps in quantifying the engagement intensity during learning sessions.
What is the significance of the attention mechanism in the neural network model?
The attention mechanism in the neural network model is crucial for enhancing engagement detection accuracy. It allows the model to focus on salient segments of the video, which may indicate varying levels of engagement over time. By utilizing this mechanism, the model can better capture the dynamics of engagement, leading to improved predictions as it emphasizes the most relevant parts of the input data.
How were the heuristic rules based on body posture applied?
Heuristic rules based on body posture were applied to adjust engagement intensity predictions. Specifically, the study implemented two rules: one for hand fidgeting, which penalizes engagement levels if excessive hand movements are detected, and another for body fidgeting, which rewards higher engagement levels if minimal body movement is observed. These rules were designed to refine the predictions made by the machine learning models.
What dataset was used for the engagement detection task?
The dataset used for the engagement detection task consisted of educational videos focused on learning the Korean language, each approximately five minutes long. The data collection emphasized 'wild' conditions, capturing a diverse set of environments such as computer labs and open grounds. This dataset included 197 videos for training and validation, providing a substantial basis for developing and testing the engagement detection models.