CAI1-SS: Computational Audio Intelligence for Immersive Applications 1 (Music source separation)
9:40am - 9:55am
⭐ This paper has been nominated for the best paper award.
Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation
1Media Research Group, Tampere University, Tampere, Finland; 2Semantic Music Technologies Group, Fraunhofer-IDMT, Ilmenau, Germany; 3Audio Research Group, Tampere University, Finland
Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.
9:55am - 10:10am
⭐ This paper has been nominated for the best paper award.
Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CMNMF
1Universidad de Jaén, Spain; 2Tampere University, Finland
This work addresses the problem of multichannel source separation combining two powerful approaches, multichannel spectral factorization with recent monophonic deep learning (DL) based spectrum inference. Individual source spectra at different channels are estimated with a Masker-Denoiser twin network, able to model long-term temporal patterns of a musical piece. The monophonic source spectrograms are used within a spatial covariance mixing model based on complex-valued multichannel non-negative matrix factorization (CMNMF) that predicts the spatial characteristics of each source. The proposed framework is evaluated on the task of singing voice separation with a large multichannel dataset. Experimental results show that our joint DL+CMNMF method outperforms both the individual monophonic DL-based separation and the multichannel CMNMF baseline methods.
10:10am - 10:25am
Mixing-Specific Data Augmentation Techniques for Improved Blind Violin/Piano Source Separation
1Graduate Program of Multimedia Systems and Intelligent Computing,National Cheng Kung University and Academia Sinica, Taiwan; 2Research Center for IT Innovation, Academia Sinica, Taiwan; 3Yating Music Team, Taiwan AI Labs
Blind music source separation has been a popular
and active subject of research in both the music information
retrieval and signal processing communities. To counter the lack
of available multi-track data for supervised model training, a
data augmentation method that creates artificial mixtures by
combining tracks from different songs has been shown useful
in recent works. Following this light, we examine further in
this paper extended data augmentation methods that consider
more sophisticated mixing settings employed in the modern music
production routine, the relationship between the tracks to be
combined, and factors of silence. As a case study, we consider the
separation of violin and piano tracks in a violin piano ensemble,
evaluating the performance in terms of common metrics, namely
SDR, SIR, and SAR. In addition to examining the effectiveness of
these new data augmentation methods, we also study the influence
of the amount of training data. Our evaluation shows that the
proposed mixing-specific data augmentation methods can help
improve the performance of a deep learning-based model for
source separation, especially in the case of small training data.
10:25am - 10:40am
Multi-channel U-Net for Music Source Separation
1Department of Information and Communications Technologies, Universitat Pompeu Fabra, Barcelona, Spain; 2Joint Research Centre, European Commission, Seville, Spain
A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However,
Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models.
We propose a multi-channel U-Net (M-U-Net) trained using a weighted multi-task loss as an alternative to the C-U-Net. We investigate two weighting strategies for our multi-task loss: 1) Dynamic Weighted Average (DWA), and 2) Energy Based Weighting (EBW). DWA determines the weights by tracking the rate of change of loss of each task during training. EBW aims to neutralize the effect of the training bias arising from the difference in energy levels of each of the sources in a mixture. Our methods provide two-fold advantages compared to the C-U-Net: 1) Fewer effective training iterations per epoch with no conditioning, and 2) Fewer trainable network parameters (no control parameters). Our methods achieve performance comparable to that of C-U-Net and the dedicated U-Nets at a much lower training cost.