SP-O: Speech processing
8:30am - 8:45am
Speaker-adaptive neural vocoders for parametric speech synthesis systems
1Naver Corp., Korea, Republic of (South Korea); 2Yonsei University, Korea, Republic of (South Korea)
This paper proposes speaker-adaptive neural vocoders for parametric text-to-speech (TTS) systems. Recently proposed WaveNet-based neural vocoding systems successfully generate a time sequence of speech signal with an autoregressive framework. However, it remains a challenge to synthesize high-quality speech when the amount of a target speaker's training data is insufficient. To generate more natural speech signals with the constraint of limited training data, we propose a speaker adaptation task with an effective variation of neural vocoding models. In the proposed method, a speaker-independent training method is applied to capture universal attributes embedded in multiple speakers, and the trained model is then optimized to represent the specific characteristics of the target speaker. Experimental results verify that the proposed TTS systems with speaker-adaptive neural vocoders outperform those with traditional source-filter model-based vocoders and those with WaveNet vocoders, trained either speaker-dependently or speaker-independently. In particular, our TTS system achieves 3.80 and 3.77 MOS for the Korean male and Korean female speakers, respectively, even though we use only ten minutes' speech corpus for training the model.
8:45am - 9:00am
SPECTROGRAM-BASED CLASSIFICATION OF SPOKEN FOUL LANGUAGE USING DEEP CNN
Multimedia University, Malaysia
Excessive content of profanity in audio and video files has proven to shape one’s character and behavior. Currently, conventional methods of manual detection and censorship are being used. Manual censorship method is time consuming and prone to misdetection of foul language. This paper proposed an intelligent model for foul language censorship through automated and robust detection by deep convolutional neural networks (CNNs). A dataset of foul language was collected and processed for the computation of audio spectrogram images that serve as an input to evaluate the classification of foul language. The proposed model was first tested for 2-class (Foul vs Normal) classification problem, the foul class is then further decomposed into a 10-class classification problem for exact detection of profanity. Experimental results show the viability of proposed system by demonstrating high performance of curse words classification with 1.24-2.71 Error Rate (ER) for 2-class and 5.49-8.30 F1-score. Proposed Resnet50 architecture outperforms other models in terms of accuracy, sensitivity, specificity, F1-score.
9:00am - 9:15am
A Low Complexity Long Short-Term Memory Based Voice Activity Detection
Harman international industries, China, People's Republic of
Voice Activity Detection (VAD) plays an important role in audio processing, but it is also a common challenge when a voice signal is corrupted with strong and transient noise. In this paper, an accurate and causal VAD module using a long short-term memory (LSTM) deep neural network is proposed. A set of features including Gammatone cepstral coefficients (GTCC) and selected spectral features are used. The low complex structure allows it can be easily implemented in speech processing algorithms and applications. With carefully pre-processing and labeling the collected training data in the classes of speech or non-speech and training on the LSTM net, experiments show the proposed VAD is able to distinguish speech from different types of noisy background effectively. Its robustness against changes including varying frame length, moving speech sources and speaking in different languages, are further investigated.
9:15am - 9:30am
Improving Speech Recognition for Under-resourced Languages Utilizing Audio-codecs for Data Augmentation
Otto-von-Guericke-Universität Magdeburg, Germany
To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.