ASS-P: Audio Source Separation
6:25pm - 6:30pm
Audio-Fingerprinting via Dictionary Learning
University of Patras, Greece
In recent years, several successful schemes have been proposed to solve the song identification problem. These techniques aim to construct a signal’s audio-fingerprint by either employing conventional signal processing techniques or by computing its sparse representation in the time-frequency domain. This paper proposes a new audio-fingerprinting scheme which is able to construct a unique and concise representation of an audio signal by applying a dictionary, which is learnt here via the well-known K-SVD algorithm applied on a song database. The promising results which emerged while conducting the experiments suggested that, not only the proposed approach preformed rather well in its attempt to identify the signal content of several audio clips –even in cases this content had been distorted by noise - but also surpassed the recognition rate of a Shazam-based paradigm.
6:30pm - 6:35pm
Audio Inpainting based on Self-similarity for Sound Source Separation Applications
University College Dublin, Ireland
Sound source separation algorithms have advanced significantly in recent years but many algorithms can suffer from objectionable artefacts. The artefacts include phasiness, transient smearing, high frequency loss, unnatural sounding noise floor and reverberation to name a few. One of the main reasons for this is due to the fact that in many algorithm, individual time-frequency bins are often only attributed to one source at a time, meaning that many time-frequency bins will be set to zero for a separated source. This leads to an impressive signal to interference ratio but at the cost of natural sounding resynthesis. Here, we present a simple algorithm capable of audio inpainting based on self-similarity within the signal. The algorithm attempts to use the non-zero bin values observed in similar frames as substitutes for the zero bin values in the current analysis frame. We present results from subjective listening tests which show a preference for the inpainted audio over the original audio produced from a simple source separation algorithm. Further, we use the Fréchet Audio Distance metric to evaluate the perceptual effect of the proposed inpainting algorithm. The results of this evaluation support the subjective test preferences.
6:35pm - 6:40pm
Reverberant Audio Blind Source Separation via Local Convolutive Independent Vector Analysis
Université Sorbonne Paris Nord, France
In this paper, we propose a new formulation for the blind source separation problem for audio signals with convolutive mixtures to improve the separation performance of Independent Vector Analysis (IVA). The proposed method benefits from both the recently investigated convolutive approximation model and the IVA approaches that take advantage of the cross-band information to avoid permutation alignment. We first exploit the link between the IVA and the Sparse Component Analysis (SCA) methods through the structured sparsity. We then propose a new framework by combining the convolutive narrowband approximation and the Windowed- Group-Lasso (WGL). The optimization of the model is based on the alternating optimization approach where the convolutive kernel and the source components are jointly optimized. The proposed approach outperforms the existing methods through numerical evaluations in terms of objective measures.
6:40pm - 6:45pm
Solos: A Dataset for Audio-Visual Music Analysis
Universitat Pompeu Fabra
In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind
source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a big amount of its recordings are auditions and manually checked recordings, ensuring there is no background noise nor effects added in the video post-processing. Besides, it is, up to the best of our knowledge, the only dataset that contains the whole
set of instruments present in the URMP  dataset, a highquality dataset of 44 multi-instrument audio-visual recordings of classical music pieces with individual audio tracks. URMP was intented to be used for source separation, thus, we evaluate the
performance on the URMP dataset of two different BSS models trained on Solos