PH.D DEFENCE - PUBLIC SEMINAR

Acoustic Event Recognition: from Supervised Learning to Unsupervised Learning

Speaker
Mr. Wei Wei
Advisor
Dr Wang Ye, Associate Professor, School of Computing


05 Dec 2022 Monday, 04:00 PM to 05:30 PM

Executive Classroom, COM2-04-02

Abstract:

Audio, being one of the most common sources of multimedia information in our daily life, contains a lot of useful information to be analyzed. In an audio signal, an acoustic event is defined as a segment containing a particular audio event, such as a phone ringing, a singer singing, people talking, etc. Acoustic event recognition is an effective method to extract such useful information from audio signals. The task of acoustic event recognition is to predict a label, a start time and an end time for each detected acoustic event in the given audio input. In this thesis, three major types of audio signals are analyzed: singing voice, environmental audio, and speech. From singing voice to environmental audio and speech, the proposed models gradually develop from a fully supervised one to unsupervised ones.

For singing voice, we study the acoustic events in singing voice, which is defined as phonation modes. Prior work merely considers it as a single-phonation audio classification task, where each audio file contains a single phonation mode. In this work, we introduce a more complex PMD problem setting where each audio file contains multiple phonation modes, and we must detect the onset and offset of each phonation mode as well as its type. As for environmental audio, we look into the research problem of sound event detection (SED). An unsupervised adversarial domain adaptation model for sound event detection is proposed. Typically, the performance of a model will decrease if it is tested on a dataset which is different from the one that the model is trained on. With the proposed model, an SED model trained on existing datasets can be quickly adapted to achieve comparable results on a new dataset without using any additional human annotations. Finally, in speech, acoustic events are defined as mispronunciations. In this work, we mainly focus on the mispronunciation localization problem to detect and locate mispronounced phonemes in the speech of second-language (L2) learners. Existing mispronunciation detection models are mostly trained via supervised learning. We develop an unsupervised learning algorithm for the mispronunciation localization task. Our experimental results show that our unsupervised ML-VAE achieves comparable results to supervised learning approaches, without the need for human annotations for model training.