Web
Analytics

Speech Enhancement Using Deep Learning Methods: A Review

       Asri Rizki Yuliani, M. Faizal Amri, Endang Suryawati, Ade Ramdan, Hilman Ferdinandus Pardede

Abstract


Speech enhancement, which aims to recover the clean speech of the corrupted signal, plays an important role in the digital speech signal processing. According to the type of degradation and noise in the speech signal, approaches to speech enhancement vary. Thus, the research topic remains challenging in practice, specifically when dealing with highly non-stationary noise and reverberation. Recent advance of deep learning technologies has provided great support for the progress in speech enhancement research field. Deep learning has been known to outperform the statistical model used in the conventional speech enhancement. Hence, it deserves a dedicated survey. In this review, we described the advantages and disadvantages of recent deep learning approaches. We also discussed challenges and trends of this field. From the reviewed works, we concluded that the trend of the deep learning architecture has shifted from the standard deep neural network (DNN) to convolutional neural network (CNN), which can efficiently learn temporal information of speech signal, and generative adversarial network (GAN), that utilize two networks training.


  http://dx.doi.org/10.14203/jet.v21.19-26

Keywords


speech enhancement; deep learning; neural networks; speech signal processing; non-stationary noise

Full Text:

  PDF

References


J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals. Wiley-IEEE Press, 2000.

P. C. Loizou, Speech Enhancement: Theory and Practice. CRC press, 2013.

S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoust., vol. 27, pp. 113–120, Apr. 1979. Crossref

T. Van den Bogaert, S. Doclo, J. Wouters, and M. Moonen, "Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids," J. Acoust. Soc. Am., vol. 125, no. 1, 2009. Crossref

Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean-square error-log-spectral amplitude estimator," IEEE Trans. Acoust., vol. 33, Apr. 1985. Crossref

P. Wang and D. L. Wang, "Enhanced spectral features for distortion-independent acoustic modeling," in Proc. Annu. Conf. Speech Communication Association Interspeech 2019, 2019, pp. 476–480. Crossref

C. Donahue, B. Li, and R. Prabhavalkar, "Exploring speech enhancement with generative adversarial networks for robust speech recognition," in 2018 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2018. Crossref

Y. Zhao, B. Xu, R. Giri, and T. Zhang, "Perceptually guided speech enhancement using deep neural networks," in 2018 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2018, pp. 5074-5078. Crossref

T. Gao, J. Du, L. R. Dai, and C. H. Lee, "Densely connected progressive learning for LSTM-Based speech enhancement," in 2018 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2018, pp. 5054-5058. Crossref

F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll, "The Munich feature enhancement approach to the 2013 CHiME challenge using BLSTM recurrent neural networks," in 2nd Int. Workshop Machine Listening Multisource Environments, 2013.

M. Wollmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, "Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly non-stationary noise," in 2013 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2013, pp. 6822-6826. Crossref

S. R. Park and J. W. Lee, "A fully convolutional neural network for speech enhancement," in Proc. Annu. Conf. Speech Communication Association Interspeech 2017, 2017. Crossref

A. Pandey and D. Wang, "A new framework for CNN-Based speech enhancement in the time domain," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, July. 2019. Crossref

F. G. Germain, Q. Chen, and V. Koltun, "Speech denoising with deep feature losses," in Proc. Annu. Conf. Speech Communication Association Interspeech 2019, 2019. Crossref

S. W. Fu, Y. Tsao, X. Lu, and H. Kawai, "Raw waveform-based speech enhancement by fully convolutional networks," in Proc. 9th Asia-Pacific Signal and Information Processing Association Annu. Summit and Conf. 2017, 2018. Crossref

J. Rownicka, P. Bell, and S. Renals, "Multi-Scale octave convolutions for robust speech recognition," in IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2020. Crossref

D. Rethage, J. Pons, and X. Serra, "A wavenet for speech denoising," in 2018 IEEE Int. Conf. IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2018. Crossref

X. Feng, Y. Zhang, and J. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," 2014 IEEE Int. Conf. Acoustics Speech and Signal Processing Proc., 2014. Crossref

D. Baby and S. Verhulst, "Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty," in 2019 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2019. Crossref

H. Phan et al., "Improving GANs for speech enhancement," IEEE Signal Process. Lett., vol. 27, 2020. Crossref

D. Bagchi, P. Plantinga, A. Stiff, and E. Fosler-Lussier, "Spectral feature mapping with mimic loss for robust speech recognition," in 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2018. Crossref

P. Plantinga, D. Bagchi, and E. Fosler-Lussier, "An exploration of mimic architectures for residual network based spectral mapping," in 2018 IEEE Workshop Spoken Language Technology Proc., 2019. Crossref

A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent neural networks for noise reduction in robust ASR," in 13th Annu. Conf. Speech Communication Association Interspeech 2012, 2012. Crossref

X. Lu, Y. Tsao, S. Matsuda, and C. Hori, "Speech enhancement based on deep denoising autoencoder," in Proc. Annu. Conf. Int. Speech Communication Association Interspeech 2013, 2013. Crossref

S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," in Proc. Annu. Conf. Int. Speech Communication Association Interspeech 2017, 2017. Crossref

T. Gao, J. Du, L. R. Dai, and C. H. Lee, "SNR-based progressive learning of deep neural network for speech enhancement," in Proc. Annu. Conf. Int. Speech Communication Association Interspeech 2016, 2016. Crossref

P. Karjol, M. A. Kumar, and P. K. Ghosh, "Speech enhancement using multiple deep neural networks," in 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2018. Crossref

Y. Xu, J. Du, Z. Huang, L. R. Dai, and C. H. Lee, "Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement," in Proc. Annu. Conf. Int. Speech Communication Association Interspeech 2015, 2015. Crossref

M. H. Soni, N. Shah, and H. A. Patil, "Time-Frequency masking-based speech enhancement using generative adversarial network," in 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2018. Crossref

K. Kinoshita, T. Ochiai, M. Delcroix, and T. Nakatani, "Improving noise robust automatic speech recognition with single-channel time-domain enhancement network," in 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2020. Crossref

Z. Xu, S. Elshamy, and T. Fingscheidt, "Using separate losses for speech and noise in mask-based speech enhancement," in 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2020. Crossref

A. Pandey and D. Wang, "On adversarial training and loss functions for speech enhancement," in 2018 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2018. Crossref

H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, "Phase-aware speech enhancement with deep complex U-Net," in Proc. Int. Conf. Learning Representations 2019, 2019.

X. Li and R. Horaud, "Multichannel speech enhancement based on time-frequency masking using subband long short-term memory," in 2019 IEEE Workshop Applications Signal Processing to Audio and Acoustics, 2019. Crossref

Z. Zhang, J. Geiger, J. Pohjalainen, A. E. D. Mousa, W. Jin, and B. Schuller, "Deep learning for environmentally robust speech recognition: An overview of recent developments," ACM Trans. Intelligent Syst. Technol., vol. 9, no. 5, 2018. Crossref

D. Michelsanti et al., "An overview of deep-learning-based audio-visual speech enhancement and separation," arXiv, 2020.

N. Das, S. Chakraborty, J. Chaki, N. Padhy, and N. Dey, "Fundamentals, present and future perspectives of speech enhancement," Int. J. Speech Technol., 2020. Crossref

L. Hertel, H. Phan, and A. Mertins, "Comparing time and frequency domain for audio event recognition using deep learning," in 2016 Int. Joint Conf. Neural Networks Proc., 2016. Crossref

S. A. Nossier, J. Wall, M. Moniri, C. Glackin, and N. Cannings, "Mapping and masking targets comparison using different deep learning based speech enhancement architectures," in 2020 Int. Joint Conf. Neural Networks Proc., 2020, pp. 1–8. Crossref

X. L. Zhang and D. Wang, "A deep ensemble learning method for monaural speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, May. 2016. Crossref

Y. Wang, A. Narayanan, and D. L. Wang, "On training targets for supervised speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, Dec. 2014. Crossref

D. B. Paul and J. M. Baker, "The design for the wall street journal-based CSR corpus," in Proc. Workshop Speech and Natural Language, 1992. Crossref

E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni, "The second 'chime' speech separation and recognition challenge: Datasets, tasks and baselines," in 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing Proc., 2013. Crossref

E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, "An analysis of environment, microphone and data simulation mismatches in robust speech recognition," Comput. Speech Lang., vol. 46, 2017. Crossref

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third 'CHiME' speech separation and recognition challenge: Dataset, task and baselines," in 2015 IEEE Workshop Automatic Speech Recognition and Understanding Proc., 2016. Crossref

C. Veaux, J. Yamagishi, and S. King, "The voice bank corpus: Design, collection and data analysis of a large regional accent speech database," in 2013 Int. Conf. Oriental COCOSDA 2013 Conf. Asian Spoken Language Research and Evaluation, 2013. Crossref

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon Tech. Rep. N, vol. 93, 1993.

D. Pearce and H. G. Hirsch, "The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions," in Sixth Int. Conf. Spoken Language Processing Interspeech 2000, Beijing, China, 2000.

A. van den Oord et al., "WaveNet: A generative model for raw audio," arXiv, 2016.

Y. Luo and N. Mesgarani, "Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 27, Aug. 2019. Crossref


Article Metrics

Metrics Loading ...

Metrics powered by PLOS ALM

Refbacks

  • There are currently no refbacks.




Copyright (c) 2021 Jurnal Elektronika dan Telekomunikasi

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.