Conference Agenda

DLM2-O: Deep Learning for Multimedia 2
Tuesday, 22/Sept/2020:
4:15pm - 5:15pm

Session Chair: Patrick Le Callet
Location: Virtual platform

4:15pm - 4:30pm

Emotion Dependent Facial Animation from Affective Speech

Rizwan Sadiq, Sasan Asadiabadi, Engin Erzin

Koc University, Turkey

In human-to-computer interaction, facial animation

in synchrony with affective speech can deliver more naturalistic

conversational agents. In this paper, we present a two-stage

deep learning approach for affective speech driven facial shape

animation. In the first stage, we classify affective speech into

seven emotion categories. In the second stage, we train separate

deep estimators within each emotion category to synthesize

facial shape from the affective speech. Objective and subjective

evaluations are performed over the SAVEE dataset. The proposed

emotion dependent facial shape model performs better in terms

of the Mean Squared Error (MSE) loss and in generating the

landmark animations, as compared to training a universal model

regardless of the emotion.

Sadiq-Emotion Dependent Facial Animation from Affective Speech-104.pdf

4:30pm - 4:45pm

DEMI: Deep Video Quality Estimation Model using Perceptual Video Quality Dimensions

Saman Zadtootaghaj2, Nabajeet Barman1, Rakesh Rao3, Steve Goring3, Maria Martini1, Alexander Raake3, Sebastian Möller2,4

1Kingston University, United Kingdom; 2Quality and Usability Lab, TU Berlin, Germany; 3Technische Universitat Ilmenau, Germany; 4DFKI Projektburo Berlin, Germany

With the advent and integration of gaming video streaming on traditional platforms such as YouTubeGaming and Facebook Gaming, it is imperative that the quality estimation metrics proposed work both for gaming and non-gaming content. Existing works in the field of quality assessment focus separately on gaming and non-gaming content. Along with the traditional modeling approaches, deep learning based approaches have been used to develop quality models, due to their high prediction accuracy. Hence, we present in this paper a deep learning based quality estimation model considering both gaming and non-gaming videos. The model is developed in three phases. First, a convolutional neural network (CNN) is trained based on an objective metric which allows the CNN to learn video artifacts such as blurriness and blockiness. Next, the model is fine-tuned based on a small image quality dataset using blockiness and blurriness ratings. Finally, a Random Forest was used to pool frame-level predictions and temporal information of videos in order to predict the overall video quality.

Zadtootaghaj-DEMI Deep Video Quality Estimation Model using Perceptual Video Quality Dimensions-277.pdf

4:45pm - 5:00pm

Variational Bound of Mutual Information for Fairness in Classification

Zahir Alsulaimawi

Oregon State University, United States of America

Machine learning applications have emerged in many aspects of our lives, such as for credit lending, insurance rates, and employment applications.

Consequently, it is required that such systems be nondiscriminatory and fair in sensitive features user, e.g., race, sexual orientation, and religion.

To address this issue, this paper develops a minimax adversarial framework, called features protector (FP) framework, to achieve the information-theoretical trade-off between minimizing distortion of target data and ensuring that sensitive features have similar distributions.

We evaluate the performance of the proposed framework on two real-world datasets. Preliminary empirical evaluation shows that our framework provides both accurate and fair decisions.

Alsulaimawi-Variational Bound of Mutual Information for Fairness-203.pdf

5:00pm - 5:15pm

Profiling Actions for Sport Video Summarization: An attention signal analysis

Melissa Sanabria1,2,3, Frédéric Precioso1,2,3, Thomas Menguy4

1Université Cote d'Azur; 2Maasai, Inria Sophia Antipolis; 3Laboratoire d'Informatique, Signaux, et Systèmes de Sophia-Antipolis (I3S); 4Wildmoka

Analyzing video content to produce summaries and extracting highlights has been challenging for decades. One of the biggest challenges for automatic sports video summarization is to produce summaries almost immediately after it ended, witnessing the course of the match while preserving emotions. Currently, in broadcast companies many human operators select which actions should belong to the summary based on multiple rules they have built upon their own experience using different sources of information. These rules define the different profiles of actions of interest that help the operator to generate better customized summaries. Most of these profiles do not directly rely on broadcast video content but rather exploit metadata describing the course of the match. In this paper, we show how the signals produced by the attention layer of a recurrent neural network can be seen as a learnt representation of these action profiles and provide a new tool to support operators' work. The results in soccer matches show the capacity of our approach to transfer knowledge between datasets from different broadcasting companies, from different leagues, and the ability of the attention layer to learn meaningful action profiles.

Sanabria-Profiling Actions for Sport Video Summarization-206.pdf