This post overviews the paper Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking, which is part of the work conducted during my 2019 fall internship at Google Research, NYC.

—-
[EDIT August 4th, 2020]: Our paper has been accepted in IEEE Signal Processing Letters! So it is now also available with open access from IEEEXplore: https://ieeexplore.ieee.org/document/9130823
Note that the supplemental material is a bit hidden at the bottom of the page :)

—-

Do you use large audio datasets? Do they suffer from missing labels? We all like large datasets to train our models, but they inevitably bring in label noise issues, since it is intractable to exhaustively annotate massive amounts of audio. This is especially the case of sound event datasets, where often several events co-occur in the same clip, but many times only some of them are annotated. But, does it really matter? And if so, what can we do about it? We wanted to answer those questions using AudioSet.

Why the missing labels in AudioSet?

AudioSet is a great resource for audio recognition, yet it suffers from a number of label noise issues. The existence of missing labels in AudioSet is due to the dataset curation process, which is based on the human validation of previously nominated candidate labels. The nomination system can be sub-optimal at times, failing to nominate some existing sound events, which leads to missing labels.

We call the nominated labels that have received human validation explicit labels. The remaining labels, which are not proposed by the nomination system and have received no human validation (the vast majority), are referred to as implicit negative labels. Hence, it is likely that some of the implicit negative labels are indeed missing labels. We focus on these.

Proposed method

To address this issue, we propose a simple method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during learning. The process is represented in the next Fig.


First, a teacher model is trained using the original AudioSet labels, y, and we compute teacher’s predictions on the train set, which we use to decide on labels’ veracity. Our hypothesis is that the top-scored implicit negative labels (a.k.a. top-scored negatives) are likely to correspond mostly to missing “Present” labels. So we rank implicit negative labels based on the teacher’s predictions and we create a new enhanced label set, y, by flagging a given percentage of the top-scored negatives per class. Therefore, ours is a particular teacher, which we call skeptical teacher, since he prefers to highlight flaws in the current ground truth rather than directly guiding the students’ learning.

The second step consists of using y to train a student model, where the previously flagged labels are ignored via a mask applied to the binary cross-entropy loss function, using the info appropriately encoded in y.

Experiments

We analyse the effect of the method using two modes of different capacity (ResNet-50 and MobileNetV1) and two AudioSet train sets of different size (0.5 and 2.5M clips). We use d’ and lwlrap as evaluation metrics. d’ tells you if a given classifier rank things correctly, whereas lwlrap tells you if the right classifier gives the highest score. They complement each other!

Next Fig illustrates the impact of missing labels by plotting performance as a function of the amount of labels discarded. We experiment with progressively discarding 0, 0.1, 0.2, … , 1, 2, 3 … up to 20% (each % is a point in each curve). The leftmost point, marked with a square, corresponds to using all the labels (ie, our baseline).

In all curves we observe three regions from left to right: a steep increase at the beginning of the curve, followed by a sweet-spot, and a final decay. Why does this happen? Hypothesis:

  • The top-scored negatives correspond either to missing “Present” labels (ie, false negatives (FNs)), or to difficult true negatives (TNs) very informative and useful in learning.
  • First, we remove some critical FNs that damage the learning process, hence the sudden performance increase.
  • As we continue discarding more top-scored negatives, we keep removing FNs, but we also start to remove some TNs. So the performance increases more slowly, until reaching a sweet-spot.
  • Finally, if we keep ignoring more top-scored negatives, performance is degraded.

Insights

  • We see improvements in all cases, which is already relevant as AudioSet training examples are often treated as if they had complete labels
  • Most of the improvement comes from filtering out around 1% of the estimated missing labels, but in most cases, just by removing a critical tiny percentage (<=0.2%), approximately half of the total boost is already attained!!
  • The damage done by missing labels (and the performance boost obtained by discarding them) is larger as the training set gets smaller, yet it can still be observed even when training with massive amounts of audio (almost 7000h)

We carried out a small listening test in which we inspected some of the clips associated with the discarded labels for a few classes. As expected, most clips were missing “Present” labels!

Final thoughts

The proposed method, while simple, is effective in identifying missing labels in a human annotated dataset like AudioSet, and it is able to improve training over unnoticed missing labels, without additional compute. Also, it can complement other approaches focused on improving network architectures, and it can be useful for dataset cleaning / labeling refinement / active learning. We believe the insight found can generalize to other large-scale datasets since the problem of missing labels is, unfortunately, endemic.


Wanna learn more? Please check our paper for more details!
E. Fonseca, S. Hershey, M. Plakal, D. P. W. Ellis, A. Jansen, R. C. Moore, and X. Serra.
Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking.
arXiv preprint, 2020

Thanks!

Thanks to all the folks at Sound Understanding team of Google Research who contributed to this work, especially to my supervisors, DAn Ellis and Manoj Plakal, for a great supervision, and to Shawn Hershey, who initiated this line of work.