Last week I attended ICASSP 2019 in a rather sunny Brighton, UK. I think there were a bunch of great contributions. Next, I briefly summarize some that called my attention related (mostly) to environmental sound recognition/processing. Also, many of them release source-code, see note at the end.

Works related to architectures and loss functions:

Phaye et al. SUBSPECTRALNET – USING SUB-SPECTROGRAM BASED CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC SCENE CLASSIFICATION. First they show that some bands of mel-spectrograms carry more discriminative information than others for certain scenes. Based on that observation, SubSpectralNet splits the input spectrograms into subbands, perform classification band-wise, and merge the band-level predictions for the global classification. The whole model is optimized jointly and it is lightweight. The takeaway that fitting separate subnetworks over separate bands learns more salient features than directly training one CNN on entire spectrograms is interesting. Code available.


Jati et al. HIERARCHY-AWARE LOSS FUNCTION ON A TREE STRUCTURED LABEL SPACE FOR AUDIO EVENT DETECTION. They propose two hierarchy-aware loss functions for training deep networks in a setting where labels are displaced in a hierarchy (two levels) and the data is singly-labeled. The training is performed as a multi-task learning by jointly optimizing cross entropy loss and one hierarchy-aware loss function. The proposed method outperforms training with standard cross entropy at both levels of the hierarchy. Extending this approach to the multilabel setting and to an arbitrary number of hierarchical levels can be promising.


Wang et al., A COMPARISON OF FIVE MULTIPLE INSTANCE LEARNING POOLING FUNCTIONS FOR SOUND EVENT DETECTION WITH WEAK LABELING. They formulate the task of sound event tagging and detection as a multi-instance learning (MIL) problem. After a thorough theoretical and empirical comparison of several pooling functions, the linear softmax pooling function is found to perform best, and they propose a network called TALNet. They claim it is the first system to reach state-of-the-art audio tagging performance on AudioSet, while exhibiting strong detection performance on the DCASE2017 challenge at the same time. Code available.


Imoto and Kyochi, SOUND EVENT DETECTION USING GRAPH LAPLACIAN REGULARIZATION BASED ON EVENT CO-OCCURRENCE. This work puts emphasis on detecting overlapping sound events, for which sound event co-occurrence is modeled with graph Laplacian regularization. Nodes in the graph indicate the frequency of event occurrence while edges represent the co-occurrence of sound events. The graph structure is translated into an additional regularization term within the binary cross-entropy loss function to be minimized.


I was presenting LEARNING SOUND EVENT CLASSIFIERS FROM WEB AUDIO WITH NOISY LABELS, where we introduce FSDnoisy18k, a dataset containing 42.5 hours of audio across 20 sound classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. We characterize the label noise and provide a CNN baseline. We show that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data. And we conduct experiments with noise-robust loss functions and prove that they can be effective to mitigate the effect of label noise. Code and FSDnoisy18k dataset available.


Works related to semi-supervised learning, leveraging unlabeled data or few data:

Kothinti et al., JOINT ACOUSTIC AND CLASS INFERENCE FOR WEAKLY SUPERVISED SOUND EVENT DETECTION. This work describes a segmentation and recognition method for sound event detection based on joint unsupervised and semi-supervised methods. For event boundary detection they use a generative framework to track changes over time in the audio embedding space. This is used to flag deviations corresponding to new events. Recognition is done in a supervised fashion with a DNN. The former system provides guidance to the windows of interest while the latter system predicts the labels.


Wang et al., ACTIVE LEARNING FOR EFFICIENT AUDIO ANNOTATION AND CLASSIFICATION WITHA LARGE AMOUNT OF UNLABELED DATA. The authors develop a binary sound classification model that recognizes a target artifact noise, starting from unlabeled data and using a pool-based active-learning framework with human annotators in the loop. They propose a certainty-based query-sampling strategy that outperforms other conventional methods in their setting. They show that active learning can improve training efficiency and significantly reduce annotation effort, a promising direction in case of limited labeled data, and also for the future of large-scale datasets.


Meng et al., CONDITIONAL TEACHER-STUDENT LEARNING. This work adresses the problem of domain adaptation through a new variant of the teacher-student paradigm. In the proposed approach, the student model becomes smart so that it can criticize the knowledge imparted by the teacher model. As the student grows stronger, it begins to selectively choose the learning source from either the teacher or the ground truth, conditioned on whether the teacher’s prediction are correct; an elegant solution, plus the paper is very easy to read. Although not explicitly related to environmental sounds, the method could be applicable to the relevant problem of domain adaptation in sound recognition.


Pons et al., TRAINING NEURAL AUDIO CLASSIFIERS WITH FEW DATA. The paper studies whether a naive regularization of the solution space, prototypical networks, transfer learning, or their combination, can foster deep learning models to better leverage a small amount of training examples. Experiments are conducted for sound events and acoustic scenes, showing that prototypical networks can be promising under constraints of very limited amounts of data. Code available.


Other works:

Kong et al., ACOUSTIC SCENE GENERATION WITH CONDITIONAL SAMPLERNN. The authors present a conditional SampleRNN model for generating waveforms for acoustic scenes conditioned on the input classes of DCASE 2016 Task 1 dataset, consisting of 15 acoustic scenes. The model evaluation is inspired by inception scores, and it is done in terms of generation quality and diversity. Code and generated audio examples available. It’s great to see works generating audio beyond speech and music!


Cramer et al. LOOK, LISTEN, AND LEARN MORE: DESIGN CHOICES FOR DEEP AUDIO EMBEDDINGS. This paper revisits some of the key design choices of Look, Listen, and Learn (L3-Net), assessing their impact on the performance of downstream audio classifiers trained with the learnt embeddings. The embeddings are learnt in a self-supervised fashion, through audio-visual correspondence of the AudioSet soundtracks and their corresponding videos. The learnt embeddings outperform VGGish and SoundNet on UrbanSound8K, ESC-50 and DCASE2013 SCD. Pre-trained versions of the proposed L3-Net variants are openly available.


Martín-Morató et al. SOUND EVENT ENVELOPE ESTIMATION IN POLYPHONIC MIXTURES. The idea is to estimate the amplitude envelopes of target sound events and subsequently perform detection with them. Instead of using the traditional binary activity of events as ground truth during training, the ground truth used here is the computed envelope of the mixture, which means formulating the task as a frame-based regression instead of a classification. This creative approach could be a workaround to the troublesome task of deciding when to set onset/offset of some events that naturally fade in and out.


Wu et al., AUDIO CAPTION: LISTEN AND TELL. This paper introduces AudioCaption, a manually-annotated dataset for audio captioning; a task that has received a lot of attention in computer vision but not so much in audio processing. AudioCaption consists of 10h of audio from hospital environments, contains three annotations per clip, and it is labelled in Mandarin, although translated English annotations are included. A GRU Encoder-Decoder Model is provided as baseline system. Code and data available.