Uncertainty-aware Label Distribution Learning for Facial Expression Recognition

1 of
Published on Video
Go to video
Download PDF version
Download PDF version
Embed video
Share video
Ask about this video

Page 1 (0s)

Uncertainty-aware Label Distribution Learning for Facial Expression Recognition.

Page 2 (23s)

[Audio] Ambiguity is still a key challenge in Facial Expression Recognition or FER for short. In real world scenarios, facial expressions of people are highly complex and they tend to express multiple emotions altogether. However, most existing datasets only provide single label as ground truth, which means that each facial image is labeled with only one emotion category. Unfortunately, people with different backgrounds might perceive and interpret facial expressions differently. This can lead to inconsistency, in which similar expressions can be labeled differently by different annotators. This can pose a big challenge for training many FER models. Moreover, using only single label can lead to insufficient supervision in training as we do not have a comprehensive description for each facial image. Our proposed solution is to construct emotion distributions for training images from the provided single label and use them to optimize the model..

Page 3 (1m 26s)

[Audio] We assume that facial images should have similar emotions to their neighbors in an auxiliary space. Subsequently, we want the auxiliary space to be highly correlate with the emotion space to exploit as much information as possible. Although information such as facial landmarks and action units can be useful, we find that valence-arousal is more suitable as it is more closely associated with discrete emotions. The valence-arousal are continuous value with valence describing how positive or negative an expression is, while arousal indicating the intensity of the expression. They has been widely used to represent the human emotional spectrum in many circumstances. Therefore, we choose the valence-arousal or VA space as the auxiliary space to help us construct the label distribution for the main instance. Firstly, we use use the K-Nearest Neighbor algorithm to identify neighbors for each main instance. We also utilize the valence-arousal value to calculate a local similarity score between each neighbor and the main instance, where higher similarity score indicates that the emotions of these two are more similar. However, we use an existing method to generate pseudo-valence-arousal since they are often hard to obtain in practice. Consequently, these values can be inaccurate and may lead to noisy similarity calculation. To handle this, we leverage the extracted features of the CNN backbone and calculate a calibration score for each pair of neighbor and the main instance. The calibration score is calculated by a multilayer perceptron with sigmoid activation at the last layer. We call this adaptive similarity where the local similarity is then multiplied with the calibration score to adaptively adjust their errors and we obtain the contribution degree (or weight) for each neighbor. To predict the emotion distribution for the images, we also use a classifier stacked on top of the CNN features via a softmax activation. We aggregate the neighbors as the linear combination of their distributions weighted by their corresponding contribution degree. We finally generate the target label distribution for the main instance from the provided one-hot label vector and the aggregated neighbor distribution. We also associate each instance a learnable uncertainty factor lambda to balance between the provided label and the neighbor information. The uncertainty value indicates whether the model is more confident about the provided label or it is in more agreement with the neighbor information..

Page 4 (4m 4s)

[Audio] To train our model, we use the standard cross entropy loss between the constructed target label distribution and the predicted distribution of the network. To further enhance the discrimination of the model and thus mitigating the ambiguity, we also incorporate an extra discriminative loss inspired by the popular center loss. Intuitively, the discrimnative loss encourages the feature vectors of one class to be close to their corresponding class center while improves the inter-class discrimination by pushing the cluster centers far away from each other..

Page 5 (4m 36s)

[Audio] We evaluate our method on three popular datasets for in the wild facial expression recognition. These dataset contains images collected from the internet and real world scenarios and manually annotated with one of seven discrete emotions. Following previous works, we use accuracy as the evaluation metric. First, we validate the robustness of models towards noise by injecting label noise to the original datasets and compare with recent noise-tolerant FER methods. Next, We test our method under a cross dataset evaluation protocol to verify the effectiveness of our methods towards label inconsistency between datasets. We also further perform experiments on the original datasets to evaluate of our method to the ambiguity that unavoidably exists in real-world FER datasets..

Page 6 (5m 29s)

[Audio] For noisy label setting, as we can see, our method helps improve the performance of the baseline significantly and also outperform recent work by a large margin..

Page 7 (6m 30s)

[Audio] For cross dataset evaluation, Our model achieves the best performance on all three datasets and surpasses the current state-of-the-art methods..

Page 8 (6m 50s)

[Audio] On the original "clean" datasets, our model achieve state-of-the-art performance as well..

Page 9 (7m 10s)

[Audio] We show some emotion distributions recovered by our method on mislabelled images at the top left. Despite the incorrect annotations, our approach is able to construct plausible distributions and discover the correct labels. At the top right image, we visualize the estimated uncertainty values of some training images. Highly uncertain labels can be caused by low-quality inputs or ambiguous facial expressions (as shown in the first row). In contrast, when the emotions can be easily recognized as those in the last row, the uncertainty factors are assigned low values. At the bottom we conducted a user study and compare predictions of our model with the survey results, in which we asked participants to choose the most clearly expressed emotion on random test images. We can see that our method can give consistent results and agree with human perception..

Page 10 (8m 5s)

[Audio] To summarize, we proposed a new method to handle the ambiguity problem in FER by constructing the target emotion distributions for training images instead of the provided single label. The experimental results also verify the effectiveness of our method over previous approaches under inconsistency and ambiguity in facial expression recognition..