An Ensemble Model Using Face and Body Tracking for Engagement Detection PDF

An Ensemble Model Using Face and Body Tracking for

Engagement Detection

Cheng Chang

Liulishuo

Shanghai, China

[email protected]

Cheng Zhang

TongJi University

Shanghai, China

[email protected]

Lei Chen

Liulishuo Silicon Valley AI Lab

San Mateo, CA, USA

[email protected]

Yang Liu

Liulishuo Silicon Valley AI Lab

San Mateo, CA, USA

[email protected]

ABSTRACT

Precise detection and localization of learners’ engagement levels

are useful for monitoring their learning quality. In the emotiW

Challenge’s engagement detection task, we proposed a series of

novel improvements, including (a) a cluster-based framework for

fast engagement level predictions, (b) a neural network using the

attention pooling mechanism, (c) heuristic rules using body posture

information, and (d) model ensemble for more accurate and robust

predictions. Our experimental results suggest that our proposed

methods effectively improved engagement detection performance.

On the validation set, our system can reduce the baseline Mean

Squared Error (MSE) by about 56%. On the final test set, our system

yielded a competitively low MSE of 0.081.

CCS CONCEPTS

• Computing methodologies → Neural networks

; Activity recog-

nition and understanding;

KEYWORDS

Engagement detection, face tracking, body tracking, machine learn-

ing, ensemble model

ACM Reference Format:

Cheng Chang, Cheng Zhang, Lei Chen, and Yang Liu. 2018. An Ensemble

Model Using Face and Body Tracking for Engagement Detection. In 2018

International Conference on Multimodal Interaction (ICMI ’18), October 16–20,

2018, Boulder, CO, USA. ACM, New York, NY, USA, 7 pages. https://doi.org/

10.1145/3242969.3264986

1 INTRODUCTION

Engagement is defined as a connection between a user and a re-

source, and comprises emotional, cognitive, and behavioral nature

at any point in time [

]. Tracking users’ engagement levels is

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee. Request permissions from [email protected].

ICMI ’18, October 16–20, 2018, Boulder, CO, USA

ACM ISBN 978-1-4503-5692-3/18/10... $15.00

https://doi.org/10.1145/3242969.3264986

important in many applications, e.g., making sure vehicle drivers

focus on road conditions, helping workers to improve manufactur-

ing quality and avoid accidents.

In education, tracking learners’ engagement levels is particularly

important for tracking how learners have mastered the instructed

material. Typically, in traditional classroom-based education, hu-

man teachers spend a significant amount of time on monitoring

the engagement levels of all the students. Recently, Massive On-

line Open Courses (MOOC) have been widely utilized and a lot

of students watch MOOC instructional videos to study. However,

in this condition, MOOC learners’ engagement levels are seldom

tracked. Also, the massive number of students who enroll in MOOC

education also prohibits using humans to track engagement lev-

els. Therefore, it is important to develop automatic engagement-

tracking methods based on rapidly developing computer vision

(CV) and machine learning (ML) technologies. With the widespread

online education business, several companies have announced that

CV-based engagement tracking systems have been used to help re-

tain attention of online learners. However, the tracking accuracy of

these systems has not been thoroughly verified. Given this technol-

ogy’s potential broad usage and importance of evaluating students,

we feel that the multimodal sensing research community needs

do more rigorous research and benchmark tests in engagement

tracking.

In the 2018 EmotiW challenge [

], one of the tasks is predicting

students’ engagement intensity during their MOOC video watch-

ing processes under diverse recording conditions and platforms.

Clearly, this task will help the research community in three folds:

(a) providing an opportunity to investigate how accurate an en-

gagement tracking system can be built using existing CV and ML

technologies, (b) providing a common data set to compare different

methods, and (c) testing various developed algorithms on “wild"

conditions that are close to real application scenarios of the future

tracking systems.

The rest of this paper is organized as follows: Section 2 briefly re-

views exemplary research works in tracking learners’ engagement

levels; Section 3 introduces the engagement detection task and

the provided corpus; Section 4 describes the methods we have ex-

plored; Section 5 reports our experimental results; At last, Section 6

summarizes our conclusions.

EmotiW Grand Challenge

ICMI’18, October 16-20, 2018, Boulder, CO, USA

616

2 PREVIOUS RESEARCH

[

] is one of the most influential studies of using CV and ML

technologies for tracking students’ engagements. 34 undergraduate

students participated in a “Cognitive Skills Training" experiment

and played three types of games on an IPad. While playing games,

their front-face videos were recorded. A 4-level engagement rating

scale was used, where 1 stands for lacking engagement and 4 stands

for highly engaged. The human ratings were done on two time

scales: long clips (60 seconds) vs. short clips (10 seconds). [

]

found that human raters agreed more on short clips. Regarding

the features, they used both low-level features (e.g., Box Filter (BF)

and Gabor Energy Filters features) and high-level ones based on

facial expression analysis. In particular, 3

face pose and 20 action

units (AUs) were extracted by using the CERT toolkit [

]. In some

binary classification tasks on short video clips, they found that the

machine learning models performed with comparable accuracy to

humans.

[

] focused on students’ video watching learning sessions. The

video data from 12 students were recorded in a laboratory setup

by using Microsoft Kinect V2 sensor. Regarding human rating, [

]

used a continuous rating scale by allowing human raters to move a

slider in a custom-designed rating software to provide continuous

measurements. Each session was rated by 9 raters and their results

were combined to form the ground truth. A novel approach was

proposed to convert continuous ratings to trinary labels by using a

clustering method that considers human raters’ rating variations. In

their ML experiments, diverse types of features were used, including

facial landmarks, Facial Action Coding System (FACS) AUs, FACS

eye movement codes, emotion, optical flow related features, head

size and pose.

[

] used both videos and heart rate data collected from 22 stu-

dents when they were completing a structured writing activity. Both

concurrent and retrospective self-reporting were used to generate

engagement ratings. Three sets of features from videos, e.g., anima-

tion units (from Microsoft Kinect), pixel-level LBP-TOP features,

and heart rates measured by a device were jointly used to build

ML models to predict engagement. The face tracking features from

Kinect were found to yield the best performance among different

feature inputs. The overall best performance was achieved by using

a fusion of different sets of features.

[

] described the baseline engagement prediction system devel-

oped by the Challenge’s organizers. Since the labeled videos are

limited, to address this issue, multiple instance learning (MIL) [

]

was utilized. For each instance, i.e., a short video segment, both

DNN and Long-Short-Time-Memory (LSTM) [

] recurrent neural

networks (RNN) were used for modeling. However, the system’s

performance is not very satisfying yet - Pearson correlation be-

tween machine predicted engagements and human-rated ones is

less than 0.25.

3 CORPUS

The challenge organizers selected educational videos learn the Ko-

rean Language in 5 minutes to be the learning material for students

to watch. Each recorded video is about 5 minutes long. The data

collection was made to highlight its “wild" nature. First, a large

number of students were used. In the Challenge, both training and

validation sets contain 197 videos, much larger than the number of

videos used in previous research. Second, the data set was collected

in non-constrained environments, i.e., at different locations such as

computer labs, hostel rooms, open ground, and a video conference

setup.

In the Challenge, the training set (

N =

149) and the validation

set (

N =

48) were released in March 2018 for all participants to

develop their models. The label distributions of the training set

and validation set are shown in Table 1. Note that the label distri-

butions of the provided official dataset are slightly different from

that described in the baseline paper [

], and our experiments were

conducted based on the provided dataset. After June 2nd, the pro-

cessing results of using OpenFace version 0

23 were released. The

test set (

N =

67) was released on June 14th with the corresponding

OpenFace outputs.

Table 1: Label distributions of the dataset.

Label 0 1 2 3

training 5 35 81 28 149

validation 4 10 19 15 48

9 45 100 43 197

The engagement categories follow the 4-level scale defined in

[

]. In particular, intensity 0 means that the subject is completely

disengaged. 1 means barely engaged, such as moving restlessly in

the chair. 2 means the subject seems engaged, e.g., showing inter-

ests. 3 means the subject is highly engaged, e.g., being “glued" to

the MOOC content. Five annotators viewed the video content only

(without playing audio channels) and rated the subject’s engage-

ment level. Annotators’ labels with weighted Cohen’s

less than

4 were dropped, and the final ground truth labels were formed in

a discrete 4-level. For evaluation, the intensity levels are linearly

scaled to the range of [0-1] as 0

0, 0

33, 0

66 and 1

0 by the Challenge

organizers.

4 METHODS

In this paper, we propose four different models for the engagement

detection task and build an ensemble model consisting of the four

individual models. In this section, we first introduce how we pre-

process the raw videos and extract segment-level features. Then we

present the details of the three cluster-based conventional models,

the end-to-end Neural Network (NN) models, and posture-based

heuristic rules developed to further boost the machine learning

based results.

4.1 Feature extraction

From reviewing the previous research on engagement detection,

we found that many methods used face analysis results rather

than low-level pixel-based features. Also, in recent years, high-

quality face analysis open-source implementations have become

available. Therefore, we used OpenFace [

] to track head pose,

gaze directions [

], and AUs [

], which are intuitively strong cues

about the attention of the subjects. In our experiments, we used the

OpenFace Version 1.0, which was the latest version in April 2018,

when processing all of the released videos.

EmotiW Grand Challenge

ICMI’18, October 16-20, 2018, Boulder, CO, USA

617

By observing the raw video data, we noticed that subjects’ body

poses, especially their hand movements and body fidgets, have a

noticeable correlation with human-rated engagement intensities.

When a learner has more restless hand movements, he or she tends

to have lower engagement ratings. Therefore, beyond tracking faces,

we also performed pose and hand tracking. In fact, body tracking

has been used in previous engagement detection research by using

depth cameras, e.g., Microsoft Kinect devices. in this dataset with

only 2D videos, we used a recent breakthrough in CV, OpenPose [

to directly track head, body, and hands. Figure 1 shows the example

output from the two CV toolkits.

(a) openface (b) openpose

Figure 1: OpenFace analysis outputs in (a) and OpenPose

analysis outputs in (b).

Before running feature extraction, all of the videos were down-

sampled to 10 frames per second (FPS) in order to reduce computa-

tional cost. To eliminate possible unexpected effect from the initial

and the ending part of the video when subjects prepare for either

starting to watch MOOC videos or leaving MOOC, only the video

portions from 00 : 30 to 4 : 30 were used for engagement prediction.

OpenFace [

] was used to extract a 31-dimensional feature set for

each frame, which contains eye gaze movement, head movement,

and facial action units features:

(1) Eye Gaze

: It contains two 3-dimensional eye gaze direction

vectors for both eyes and one 2-dimensional eye gaze direc-

tion in radians averaged for both eyes. The total dimension

for the eye gaze features is 8.

(2) Head Pose

: It is a 6-dimensional feature set related to the

location of the head and the rotation of the head in radians

around x, y and z axes.

(3) Facial Action Units (AUs)

: As OpenFace supports 17 dif-

ferent AUs, we use the detected intensity (from 0 to 5) of

these extracted AUs to form a 17-dimensional feature vector.

Next, segment-level features for each video were constructed based

on the extracted 31-dimensional frame-level features. First, the 1

and 2

order delta coefficients were computed and concatenated

with the original features, resulting in a 93-dimensional feature

vector for each frame. Second, we defined a sliding window of

frames and a stride of

frames to group frames into segments and

applied 6 moment functions to each segment to further capture the

dynamics among the frames in the segment, i.e.,

min

max

mean

std

kurtosis

, and

skewness

. Therefore, the total feature vector per

segment contains 31

558 attributes. Then we normalized all

the segment-level features to zero mean and unit variance by apply-

ing a normalization scalar trained on the training set. Finally, for

computational efficiency, we employed Principal Component Anal-

ysis (PCA) to reduce the normalized features to a lower dimension

of d.

For the sliding window size

and the stride

, we chose between

two different configurations based on their corresponding perfor-

mances on the validation set: (a)

k =

, l =

20, which means there

is no overlap between adjacent segments; and (b)

k =

, l =

20,

which means a larger segment and an overlap of 50% between ad-

jacent segments. Both strategies result in about 120 segments for

each video. The hyperparameter

was set to 20 by cross-validation.

These segment-level features were used for building machine learn-

ing models in Section 4.2 and 4.3 to predict engagement intensity.

4.2 Conventional method

Inspired by the baseline method introduced in [

], we designed sev-

eral conventional methods that share a common baseline pipeline

as depicted in Figure 2: (I) Use K-means clustering to group all

segments into

clusters. (II) Re-assign each segment’s engagement

intensity to its corresponding cluster’s average intensity. (III) For

the entire video, generate a video-level feature. In this paper, we

explored three different approaches to construct the video-level

features, as described in 4.2.1, 4.2.1 and 4.2.3. (IV) Build an Ad-

aBoost [

] regressor to estimate intensity based on the video-level

features obtained above. In our experiments, we varied

from a

set of values

K ∈ {

}

and based on cross-validation

results selected K = 20 for all the cluster-based methods.

4.2.1 Doc2vec. Inspired by the Doc2vec method introduced in [

]

and its successful applications in both NLP and multimodal tasks

[

], we designed a method to generate video-level features using it.

Doc2vec extends from learning words’ representation (word2vec) to

learning paragraph or document level representations. By treating

each video as a paragraph consisting of the “visual words" that are

clusters’ ids, we used the Gensim Python package [

] to compute

a distributed Doc2vec representation to represent the video. In this

work, we set the dimension of Doc2vec representation to 5 through

cross-validation experiments.

4.2.2 Moment functions. The quality of the Doc2vec feature rep-

resentations is heavily impacted by data sizes. Given the fact that

the engagement detection dataset is quite limited, we also tried

a much simpler feature representation than the Doc2vec method.

We applied the six moment functions, i.e.,

min

max

mean

std

kurtosis

, and

skewness

, to the re-assigned segmental labels of each

video to generate 6-dimensional representation of the entire video.

4.2.3 Moment functions and cluster distribution. Considering that

the moment functions may be too coarse to comprehensively repre-

sent a video, we also explored adding the cluster distribution infor-

mation. Specifically, we calculated the frequency distribution of the

clusters over the whole video and generated a

-dimensional vec-

tor. The cluster distribution vector is then concatenated to the orig-

inal 6-dimensional representation to form a

{K +

}

-dimensional

video-level feature. In this work, this approach resulted in a 26-

dimensional representation of the entire video (K = 20).

EmotiW Grand Challenge

ICMI’18, October 16-20, 2018, Boulder, CO, USA

618

Overview

An Ensemble Model Using Face and Body Tracking for Engagement Detection

Engagement detection is crucial for monitoring learner interaction in educational settings. This research paper presents an ensemble model that utilizes face and body tracking technologies to assess engagement levels effectively. The authors propose innovative methods, including a cluster-based framework and neural networks with attention pooling, to enhance prediction accuracy. The study demonstrates significant improvements in engagement detection performance, achieving a low Mean Squared Error (MSE) on test sets. This work is valuable for researchers and educators interested in leveraging computer vision and machine learning for engagement monitoring. Key Points Proposes an ensemble model for engagement detection using face and body tracking. Implements a cluster-based framework f…

/ 7

247

Figures

OpenFace analysis outputs in (a) and OpenPose

The proposed conventional model pipeline. First, the segment-level features are grouped into clusters and assigned

The proposed NN model pipeline. First, the segment-level features are fed to a FC-BN-RELU structure to get new

Examples showing the attention weights of four videos with different engagement levels. From left to right, from

FAQs

What methods were proposed for engagement detection in the study?

The study proposed an ensemble model that incorporates four distinct methods for engagement detection. These methods include a cluster-based framework for fast engagement level predictions, a neural network utilizing an attention pooling mechanism, heuristic rules based on body posture information, and a model ensemble to enhance prediction accuracy. The combination of these approaches aims to improve the robustness and precision of engagement detection in learners.

How did the proposed model perform compared to the baseline?

The proposed ensemble model achieved a significant improvement over the baseline model. On the validation set, it reduced the Mean Squared Error (MSE) by about 56%, achieving an MSE of 0.0441. This performance indicates the effectiveness of combining different models and methodologies in accurately predicting engagement levels during learning sessions.

What features were extracted for engagement detection?

The study utilized OpenFace and OpenPose to extract a variety of features for engagement detection. From OpenFace, a 31-dimensional feature set was derived, including eye gaze movement, head pose, and facial action units. Additionally, OpenPose was used to track body and hand movements, which were found to correlate with engagement levels. These features were then processed to create segment-level representations for machine learning models.

What were the engagement levels defined in the study?

The engagement levels in the study were categorized into a four-level scale. Level 0 indicates complete disengagement, Level 1 represents barely engaged behavior, Level 2 signifies that the subject seems engaged, and Level 3 indicates high engagement, where the subject is fully attentive to the content. This classification helps in quantifying the engagement intensity during learning sessions.

What is the significance of the attention mechanism in the neural network model?

The attention mechanism in the neural network model is crucial for enhancing engagement detection accuracy. It allows the model to focus on salient segments of the video, which may indicate varying levels of engagement over time. By utilizing this mechanism, the model can better capture the dynamics of engagement, leading to improved predictions as it emphasizes the most relevant parts of the input data.

How were the heuristic rules based on body posture applied?

Heuristic rules based on body posture were applied to adjust engagement intensity predictions. Specifically, the study implemented two rules: one for hand fidgeting, which penalizes engagement levels if excessive hand movements are detected, and another for body fidgeting, which rewards higher engagement levels if minimal body movement is observed. These rules were designed to refine the predictions made by the machine learning models.

What dataset was used for the engagement detection task?

The dataset used for the engagement detection task consisted of educational videos focused on learning the Korean language, each approximately five minutes long. The data collection emphasized 'wild' conditions, capturing a diverse set of environments such as computer labs and open grounds. This dataset included 197 videos for training and validation, providing a substantial basis for developing and testing the engagement detection models.

Figures

You May Also Like

FAQs

What methods were proposed for engagement detection in the study?

How did the proposed model perform compared to the baseline?

What features were extracted for engagement detection?

What were the engagement levels defined in the study?

What is the significance of the attention mechanism in the neural network model?

How were the heuristic rules based on body posture applied?

What dataset was used for the engagement detection task?