Conference Agenda

Overview and details of the sessions of this conference. Please register as a participant for the conference (free!) and then Login in order to have access to downloads in the detailed view. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view.

Session Overview
CAI2-SS: Computational Audio Intelligence for Immersive Applications 2 (Spatial audio)
Tuesday, 22/Sept/2020:
5:25pm - 6:25pm

Session Chair: Archontis Politis
Location: Virtual platform

5:25pm - 5:40pm
⭐ This paper has been nominated for the best paper award.

Blind reverberation time estimation from ambisonic recordings

Andrés Pérez-López1,2, Archontis Politis3, Emilia Gómez1,4

1Department of Information and Communication Technologies, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain; 2Multimedia Technologies Unit, Eurecat, Centre Tecnologic de Catalunya, Barcelona, Spain; 3Faculty of Information Technology and Communication Sciences, Audio Research Group, Tampere University, Tampere, Finland; 4Centre for Advanced Studies, Joint Research Centre, European Commission, Seville, Spain

Reverberation time is an important room acoustic parameter, useful for many acoustic signal processing applications. Most of the existing work on blind reverberation time estimation focuses on the single-channel case. However, the recent developments and interest on immersive audio have brought to the market a number of spherical microphone arrays, together with the usage of ambisonics as a standard spatial audio convention. This work presents a novel blind reverberation time estimation method, which specifically targets ambisonic recordings, a field that remained unexplored to the best of our knowledge. Experimental validation on a synthetic reverberant dataset shows that the proposed algorithm outperforms state-of-the-art methods under most evaluation criteria.

Pérez-López-Blind reverberation time estimation from ambisonic recordings-229.pdf

5:40pm - 5:55pm

Time Difference of Arrival Estimation with Deep Learning -- From Acoustic Simulations to Recorded Data

Pasi Pertilä1, Mikko Parviainen1, Ville Myllyla2, Anu Huttunen2, Petri Jarske2

1Tampere University, Finland; 2Huawei Technologies, Terminal Research & Development, Tampere, Finland

The spatial information about a sound source is carried by acoustic waves to a microphone array and can be observed through estimation of phase and amplitude differences between microphones. Time difference of arrival (TDoA) captures the propagation delay of the wavefront between microphones and can be used to steer a beamformer or to localize the source. However, reverberation and interference can deteriorate the TDoA estimate. Deep neural networks (DNNs) through supervised learning can extract speech related TDoAs in more adverse conditions than traditional correlation -based methods.

Acoustic simulations provide large amounts of data with annotations, while real recordings require manual annotations or the use of reference sensors with proper calibration procedures. The distributions of these two data sources can differ. When a DNN model that is trained using simulated data is presented with real data from a different distribution, its performance decreases if not properly addressed.

For the reduction of DNN –based TDoA estimation error, this work investigates the role of different input normalization techniques, mixing of simulated and real data for training, and applying an adversarial domain adaptation technique. Results quantify the reduction in TDoA error for real data using the different approaches. It is evident that the use of normalization methods, domain-adaptation, and real data during training can reduce the TDoA error.

Pertilä-Time Difference of Arrival Estimation with Deep Learning ---147.pdf

5:55pm - 6:10pm

Deep Learning for Individual Listening Zone

Giovanni Pepe1, Leonardo Gabrielli1, Stefano Squartini1, Luca Cattani2, Carlo Tripodi2

1Università Politecnica delle Marche, Italy; 2ASK Industries SpA, Reggio Emilia, Italy

A recent trend in car audio systems is the generation of Individual Listening Zones (ILZ), allowing to improve phone call privacy and reduce disturbance to other passengers, without wearing headphones or earpieces. This is generally achieved by using loudspeaker arrays. In this paper, we describe an approach to achieve ILZ that eliminates the need for dedicated loudspeakers and exploits general purpose car loudspeakers and processing the signal through carefully designed FIR filters.

We propose a deep neural network approach for the design of filters coefficients in order to obtain a so-called bright zone, where the signal is clearly heard, and a dark zone, where the signal is attenuated. Additionally, the frequency response in the bright zone is constrained to be as flat as possible. Numerical experiments were performed taking the impulse responses measured with either one binaural pair or three binaural pairs for each passenger. The results in terms of attenuation and flatness prove the viability of the approach.

Pepe-Deep Learning for Individual Listening Zone-199.pdf

6:10pm - 6:25pm

Blind C50 estimation from single-channel speech using a convolutional neural network

Hannes Gamper

Microsoft, United States of America

The early-to-late reverberation energy ratio is an important parameter describing the acoustic properties of an environment. C50, i.e., the ratio between the first 50 ms and the remaining late energy, affects the perceived clarity and intelligibility of speech, and can be used as a design parameter in mixed reality applications or to predict the performance of speech recognition systems. While established methods exist to derive C50 from impulse response measurements, such measurements are rarely available in practice. Recently, methods have been proposed to estimate C50 blindly from reverberant speech signals.

Here, a convolutional neural network (CNN) architecture with a long short-term memory (LSTM) layer is proposed to estimate C50 blindly. The CNN-LSTM operates directly on the spectrogram of variable-length, noisy, reverberant utterances. A feature comparison indicates that log Mel spectrogram features with a frame size of 128 samples achieve the best performance with an average root-mean-square error of about 2.7 dB, outperforming previously proposed blind C50 estimators.

Gamper-Blind C50 estimation from single-channel speech using a convolutional neural network-270.pdf