Transformers in Speech Processing: A Survey

2023

Siddique Latif, Aun Zaidi, Heriberto Cuayahuitl, and 3 more authors

Mar 2023

Paper Abstract

The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.

@article{2303.11607v1,
  author = {Latif, Siddique and Zaidi, Aun and Cuayahuitl, Heriberto and Shamshad, Fahad and Shoukat, Moazzam and Qadir, Junaid},
  title = {Transformers in Speech Processing: A Survey},
  eprint = {2303.11607v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2023},
  month = mar,
  url = {http://arxiv.org/abs/2303.11607v1},
  file = {2303.11607v1.pdf},
  eprintnover = {2303.11607}
}

Three Important Things

1. Limited Gains

Applying transformers to speech processing tasks has shown limited or inconclusive gains compared to recurrent/convolutional neural network architectures. This is owing to the many limitations of transformers, such as requiring more training data due to its lack of inductive bias, when such speech data is already very limited.

2. Challenges of Transformers in Speech Processing

A major problem is that audio data has longer sequence lengths (i.e. a few thousand frames in Speech Emotion Recognition (SER)) and are less information-dense, and hence because self-attention has complexity quadratic to the number of frames, this results in very expensive runtime.

3. Lack of Standardized Benchmark

Speech processing lacks a standardized benchmark like GLUE for evaluating spoken dialogue systems. As a result, there is large diversity in the datasets used to evaluate such systems, making comparisons and establishing the state-of-the-art difficult.

Most Glaring Deficiency

It is generally difficult to criticize a survey paper, but one pertinent objection would be the lack of a summary and discussion of existing datasets and tasks used for evaluating spoken dialogue systems. This would help to provide some examples on the type of tasks these systems are being evaluated on in practice.

Conclusions for Future Work

There could possibly be a new architecture more suited for speech data, which has a better inductive bias than transformers and can therefore be trained with much less data.