Conference Agenda

Overview and details of the sessions of this conference. Please register as a participant for the conference (free!) and then Login in order to have access to downloads in the detailed view. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Session Overview
CAI1-SS: Computational Audio Intelligence for Immersive Applications 1 (Music source separation)
Monday, 21/Sept/2020:
9:40am - 10:40am

Session Chair: Konstantinos Drossos
Location: Virtual platform

9:40am - 9:55am
⭐ This paper has been nominated for the best paper award.

Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

Pyry Pyykkönen1, Styliannos Mimilakis2, Konstantinos Drossos3, Tuomas Virtanen3

1Media Research Group, Tampere University, Tampere, Finland; 2Semantic Music Technologies Group, Fraunhofer-IDMT, Ilmenau, Germany; 3Audio Research Group, Tampere University, Finland

Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.

Pyykkönen-Depthwise Separable Convolutions Versus Recurrent Neural Networks-220.pdf

9:55am - 10:10am
⭐ This paper has been nominated for the best paper award.

Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CMNMF

Antonio Jesús Muñoz-Montoro1, Archontis Politis2, Konstantinos Drossos2, Julio José Carabias-Orti1

1Universidad de Jaén, Spain; 2Tampere University, Finland

This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser twin network, able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on complex-valued multichannel non-negative matrix factorization (CMNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CMNMF method outperforms both the individual monophonic DL-based separation and the multichannel CMNMF baseline methods.

Muñoz-Montoro-Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CMNMF-278.pdf

10:10am - 10:25am

Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation

Ching-Yu Chiu1,2, Wen-Yi Hsiao3, Yin-Cheng Yeh3, Yi-Hsuan Yang1,2,3, Alvin W. Y. Su1

1Graduate Program of Multimedia Systems and Intelligent Computing,National Cheng Kung University and Academia Sinica, Taiwan; 2Research Center for IT Innovation, Academia Sinica, Taiwan; 3Yating Music Team, Taiwan AI Labs

Blind music source separation has been a popular

and active subject of research in both the music information

retrieval and signal processing communities. To counter the lack

of available multi-track data for supervised model training, a

data augmentation method that creates artificial mixtures by

combining tracks from different songs has been shown useful

in recent works. Following this light, we examine further in

this paper extended data augmentation methods that consider

more sophisticated mixing settings employed in the modern music

production routine, the relationship between the tracks to be

combined, and factors of silence. As a case study, we consider the

separation of violin and piano tracks in a violin piano ensemble,

evaluating the performance in terms of common metrics, namely

SDR, SIR, and SAR. In addition to examining the effectiveness of

these new data augmentation methods, we also study the influence

of the amount of training data. Our evaluation shows that the

proposed mixing-specific data augmentation methods can help

improve the performance of a deep learning-based model for

source separation, especially in the case of small training data.

Chiu-Mixing-Specific Data Augmentation Techniques for Improved Blind ViolinPiano Source Separation-223.pdf

10:25am - 10:40am

Multi-channel U-Net for Music Source Separation

Venkatesh S. Kadandale1, Juan F. Montesinos1, Gloria Haro1, Emilia Gómez1,2

1Department of Information and Communications Technologies, Universitat Pompeu Fabra, Barcelona, Spain; 2Joint Research Centre, European Commission, Seville, Spain

A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However,

Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models.

We propose a multi-channel U-Net (M-U-Net) trained using a weighted multi-task loss as an alternative to the C-U-Net. We investigate two weighting strategies for our multi-task loss: 1) Dynamic Weighted Average (DWA), and 2) Energy Based Weighting (EBW). DWA determines the weights by tracking the rate of change of loss of each task during training. EBW aims to neutralize the effect of the training bias arising from the difference in energy levels of each of the sources in a mixture. Our methods provide two-fold advantages compared to the C-U-Net: 1) Fewer effective training iterations per epoch with no conditioning, and 2) Fewer trainable network parameters (no control parameters). Our methods achieve performance comparable to that of C-U-Net and the dedicated U-Nets at a much lower training cost.

S. Kadandale-Multi-channel U-Net for Music Source Separation-247.pdf