DLM2-O: Deep Learning for Multimedia 2
4:15pm - 4:30pm
Emotion Dependent Facial Animation from Affective Speech
Koc University, Turkey
In human-to-computer interaction, facial animation
in synchrony with affective speech can deliver more naturalistic
conversational agents. In this paper, we present a two-stage
deep learning approach for affective speech driven facial shape
animation. In the first stage, we classify affective speech into
seven emotion categories. In the second stage, we train separate
deep estimators within each emotion category to synthesize
facial shape from the affective speech. Objective and subjective
evaluations are performed over the SAVEE dataset. The proposed
emotion dependent facial shape model performs better in terms
of the Mean Squared Error (MSE) loss and in generating the
landmark animations, as compared to training a universal model
regardless of the emotion.
4:30pm - 4:45pm
DEMI: Deep Video Quality Estimation Model using Perceptual Video Quality Dimensions
1Kingston University, United Kingdom; 2Quality and Usability Lab, TU Berlin, Germany; 3Technische Universitat Ilmenau, Germany; 4DFKI Projektburo Berlin, Germany
With the advent and integration of gaming video streaming on traditional platforms such as YouTubeGaming and Facebook Gaming, it is imperative that the quality estimation metrics proposed work both for gaming and non-gaming content. Existing works in the field of quality assessment focus separately on gaming and non-gaming content. Along with the traditional modeling approaches, deep learning based approaches have been used to develop quality models, due to their high prediction accuracy. Hence, we present in this paper a deep learning based quality estimation model considering both gaming and non-gaming videos. The model is developed in three phases. First, a convolutional neural network (CNN) is trained based on an objective metric which allows the CNN to learn video artifacts such as blurriness and blockiness. Next, the model is fine-tuned based on a small image quality dataset using blockiness and blurriness ratings. Finally, a Random Forest was used to pool frame-level predictions and temporal information of videos in order to predict the overall video quality.
4:45pm - 5:00pm
Variational Bound of Mutual Information for Fairness in Classification
Oregon State University, United States of America
Machine learning applications have emerged in many aspects of our lives, such as for credit lending, insurance rates, and employment applications.
Consequently, it is required that such systems be nondiscriminatory and fair in sensitive features user, e.g., race, sexual orientation, and religion.
To address this issue, this paper develops a minimax adversarial framework, called features protector (FP) framework, to achieve the information-theoretical trade-off between minimizing distortion of target data and ensuring that sensitive features have similar distributions.
We evaluate the performance of the proposed framework on two real-world datasets. Preliminary empirical evaluation shows that our framework provides both accurate and fair decisions.
5:00pm - 5:15pm
Profiling Actions for Sport Video Summarization: An attention signal analysis
1Université Cote d'Azur; 2Maasai, Inria Sophia Antipolis; 3Laboratoire d'Informatique, Signaux, et Systèmes de Sophia-Antipolis (I3S); 4Wildmoka
Analyzing video content to produce summaries and extracting highlights has been challenging for decades. One of the biggest challenges for automatic sports video summarization is to produce summaries almost immediately after it ended, witnessing the course of the match while preserving emotions. Currently, in broadcast companies many human operators select which actions should belong to the summary based on multiple rules they have built upon their own experience using different sources of information. These rules define the different profiles of actions of interest that help the operator to generate better customized summaries. Most of these profiles do not directly rely on broadcast video content but rather exploit metadata describing the course of the match. In this paper, we show how the signals produced by the attention layer of a recurrent neural network can be seen as a learnt representation of these action profiles and provide a new tool to support operators' work. The results in soccer matches show the capacity of our approach to transfer knowledge between datasets from different broadcasting companies, from different leagues, and the ability of the attention layer to learn meaningful action profiles.