This week I attended my second DCASE Workshop. It was great to see some old faces and many new ones. It was also kind of different because I participated as a DCASE Challenge Task organizer (Task 2) instead of a participant, like last year. I think there were a bunch of great contributions in this edition of the Workshop. Next, I briefly highlight some of them that called my attention. Spoiler: this does not necessarily mean that they are the top-performing deep architectures in a certain Challenge task, but just papers that are worth highlighting for various reasons, from my perspective.


Mesaros et al. A multi-device dataset for urban acoustic scene classification. This paper mainly introduces the dataset used for Task 1 of the DCASE Challenge, called TUT Urban Acoustic Scenes 2018. This openly available dataset includes recordings made in six different cities, and the recordings were carried out with several devices featuring varying audio qualities. These aspects increase acoustic variability and allow the study of the mismatch produced by different recording devices, which can be a very critical issue. Currently, the dataset consists of ten classes and 24 hours of audio, but the authors keep expanding it. With all probability, this dataset will become the reference resource for urban acoustic scene classification for the coming years.


Jeong and Lim. Audio tagging system using densely connected convolutional networks. In this work, the authors leverage a number of approaches that took them to win the DCASE Challenge Task 2 (General-purpose tagging of Freesound audio with AudioSet labels), including a simplification of DenseNet architecture, mixup-based data augmentation, or a novel multi-head classifier module. One aspect that I particularly liked is their simple and elegant batch-wise loss masking to deal with the label noise, an unavoidable aspect in large-scale data collection that emerges as a pressing issue for the future of sound event classification.


Koutini et al. Iterative knowledge distillation in R-CNNs for weakly-labeled semi-supervised sound event detection. The idea behind knowledge distillation is to distill the knowledge from a more complex teacher model into a simpler shallower student model that performs similar to (or even better than) its teacher. Following this idea, the authors first obtain predictions on unlabeled data using a more complex CRNN. Then, shallower CRNNs are trained using a conveniently smoothed version of the predictions from the complex models, and this process is repeated iteratively. It is nice to see how, for some event classes, the shallow models generalize better than their teachers.


Dorfer and Widmer. Training general-purpose audio tagging networks with noisy labels and iterative self-verification. This paper is based on an iterative self-verification process that selects the correct labels from the pool of noisy labels in the FSDKaggle2018 dataset of Task 2. The approach uses a fully convolutional net and lies on a carefully-designed experimental setup leveraging prior knowledge of the dataset. The paper is worth reading to discover a few nicely explained interesting details. Also importantly, the work is especially reproducible, including a Github repo and a blog with a detailed description of the system and a rather cool demo :)


Li et al. Fast mosquito acoustic detection with field cup recordings: an initial investigation. This work establishes the first steps in the task of mosquito detection using mosquito flight tones: a crucial tool that can help to reduce the number of mosquito-induced deaths. The work describes a very interesting data collection process, including the capture and recording of the mosquitoes, and a simple, yet effective CNN approach to be eventually deployed in mobile devices for on-site species detection. Also, the authors are currently labelling more data to improve their systems.


Gharib et al. Unsupervised adversarial domain adaptation for acoustic scene classification. The authors propose unsupervised adversarial domain adaptation to solve the problem of mismatched conditions between training and testing data in acoustic scene classification. The work considers two sets of data: data (with labels) from a source domain, and data (without labels) from a target domain. The main difference between both sets is the acoustic channel condition due to different recording techniques. The ultimate goal is to categorize data coming from the unlabeled target domain distribution. The approach consists of first training a model using data from the source domain. Then, by using adversarial training, as in GANs, the model optimized on the source domain is adapted to the distribution of the target domain. The proposed method is independent of the model architecture used and shows promising results, an interesting solution to improve generalization under mismatched conditions.


Finally, as organizers of DCASE Challenge Task 2, we were presenting the paper General-purpose tagging of Freesound audio with AudioSet labels: task description, dataset, and baseline, where the FSDKaggle2018 dataset is described. It was gratifying to see the outcomes of our task: People participating, learning, sharing knowledge, and last but not least, having fun. All of this made the effort done between the Music Technology Group and Google to run this competition worthwhile. It was also very nice to see that quite a few papers presented at the DCASE Workshop were beyond the DCASE Challenge, covering a wider range of relevant topics. A good sign that this community is growing!