Last week I attended ICASSP virtually. It’s a pity we couldn’t welcome so many of you here at Barcelona. What comes next is a personal selection of some papers that called my attention related to environmental sound recognition/processing.

Jansen et al., Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision.
Since children, human learning relies on unsupervised learning and active learning. Why not our deep nets? They present a framework for sound representation and recognition that combines (i) a self-supervised objective based on a notion of coincidence (temporal proximity), (ii) a clustering objective to impose categorical structure, and (iii) an active learning method that queries weak supervision to consolidate categories into semantic classes. By training a combined sound embedding/clustering/classification net according to these criteria, impressive classification results are obtained with minimal supervision. Another amazing work by Aren & folks from Sound Understanding.

A few papers include taxonomic information into networks, pretty interesting IMO:

Cramer et al., Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers.
Authors propose to include taxonomic information into a deep net, TaxoNet, for structured classification of biacoustic signals. TaxoNet is trained as a multitask and multilabel model, following the principle of hierarchical composition, where shallow layers extract a shared representation to predict higher nodes, while deeper layers specialize to lower-rank nodes. They provide interesting discussion and results. Plus they introduce two new datasets for bird audio recognition. Code available.

Shrivastava et al., Mt-Gcn For Multi-Label Audio Tagging With Noisy Labels.
It deals with sound event tagging under noisy labels incorporating label relationships from the AudioSet ontology into the learning pipeline. The pipeline is based on two modules: i) a multitask learning module to learn from clean and noisy data from the FSDKaggle2019 dataset, and ii) a Graph Convolution module that utilizes the graphical structure among labels. They show how using label co-occurring relationships to regularize the network adds value. This work is based on this recent paper for image recognition. If you are interested in Graph Convolutional Nets for audio, this other paper is also interesting!

Some works also related to label noise:

Iqbal et al., Learning with Out-of-Distribution Data for Audio Classification.
Another work focused on learning from noisy labels, in this case approached from the perspective of out of distribution (or out-of-vocabulary) data. The approach consists of first detecting these instances using an auxiliary classifier trained on clean data, and then relabelling some of them (rather than discarding) as they can help learning. This may happen due to the fact that some OOD samples are actually quite close to the training distribution in the FSDnoisy18k dataset. Code available.

Kumar and Ithapu, SeCoST: Sequential Co-Supervision for Weakly Labeled Audio Event Detection. This is one of the few works dealing with AudioSet classification from the perspective of label noise. They bring together ideas from sequential learning and knowledge distillation. The system is based on a cascade of learners where the training label set at every stage is a combination of the original label set and the combined supervision from the previous classifiers. Plus, the title is pronounced Sequest - awesome :)

And a variety of works:

Ebrahimpour et al., End-to-end Auditory Object Recognition Via Inception Nucleus.
This work proposes a fully convolutional net for sound event classification using a combination of 1D conv modules inspired by those in the Inception architecture, and subsequent 2D convs. A comprehensive discussion of design choices and analysis of results is provided. The system attains SoTA on Urbansound8K with faaaar fewer weights than other systems.

Kong et al., Source separation with weakly labelled data: An approach to computational auditory scene analysis.
This work deals with source separation of environmental sounds a.k.a. universal sound separation - a topic that is becoming rather popular. Authors propose a source separation framework trained only with weak labels (AudioSet) as opposed to using mixture-clean paired data. The framework is based on regressing from a mixture of two random segments to a target segment suitably conditioned. For those interested, there is a DCASE Task covering the applicability of separation to improve sound event recognition systems!

Turpault et al., Limitations of weak labels for embedding and tagging.
Authors study the impact of the weakness of labels on performance for embedding learning and tagging tasks. They create synthetic datasets to conduct the study in a controlled scenario isolated from other factors, and provide insights into which applications are most sensitive to weakly labeled data. Insight on weak labels that many of us use is always appreciated. Dataset and code available.

Bilen et al., A Framework for the Robust Evaluation of Sound Event Detection.
They introduce a new evaluation metric for polyphonic SED. Based on more robust ways of defining TPs and FPs, they utilize the concepts of ROC curves to then reduce into a single polyphonic sound detection score (PSDS). This framework has some nice features, like decoupling the evaluation of model’s goodness from the tuning of the detection operating point. I do mostly classification rather than detection, but I think that coming up with new metrics to improve systems’ benchmarking is fundamental, in all research fields.


And finally, this year I was not presenting anything at ICASSP because I spent last fall interning at Google Research. Have a look at the preprint covering part of our work or check this blog post. The work is titled “Addressing Missing Labels in Large-scale Sound Event Recognition using a Teacher-student Framework with Loss Masking”. We identify missing labels as one of the most frequent labeling errors in AudioSet. And we propose a skeptical teacher-student with loss masking to identify the most critical missing labels, and ignore them during learning. Spoiler: missing labels do matter.

Thanks for reading!!