Accurate Detection of Wake Word Start and End Using a CNN

2020

Accurate Detection of Wake Word Start and End Using a CNN

Christin Jose, Yuriy Mishchenko, Thibaud Senechal, and 3 more authors

Aug 2020

Interspeech 2020

Paper Abstract

Small footprint embedded devices require keyword spotters (KWS) with small model size and detection latency for enabling voice assistants. Such a keyword is often referred to as \textitwake word as it is used to wake up voice assistant enabled devices. Together with wake word detection, accurate estimation of wake word endpoints (start and end) is an important task of KWS. In this paper, we propose two new methods for detecting the endpoints of wake words in neural KWS that use single-stage word-level neural networks. Our results show that the new techniques give superior accuracy for detecting wake words’ endpoints of up to 50 msec standard error versus human annotations, on par with the conventional Acoustic Model plus HMM forced alignment. To our knowledge, this is the first study of wake word endpoints detection methods for single-stage neural KWS.

@article{2008.03790v1,
  author = {Jose, Christin and Mishchenko, Yuriy and Senechal, Thibaud and Shah, Anish and Escott, Alex and Vitaladevuni, Shiv},
  title = {Accurate Detection of Wake Word Start and End Using a CNN},
  eprint = {2008.03790v1},
  doi = {10.21437/Interspeech.2020-1491},
  archiveprefix = {arXiv},
  primaryclass = {eess.AS},
  year = {2020},
  month = aug,
  note = {Interspeech 2020},
  url = {http://arxiv.org/abs/2008.03790v1},
  file = {2008.03790v1.pdf},
  eprintnover = {2008.03790}
}

Three Important Things

1. Start-end Regression Model

The authors introduce two types of architectures for wake word (WW) detection. The first is the start-end regression model, where the input signal is passed through several stacked convolution and pooling layers, before forking off into two different outputs, as illustrated in the diagram below.

The first set of outputs is the probability that a wake word exists, and the second set of outputs is the start and end offsets of the wake word, normalized such that \([0,1]\) represents the window of the input.

The two outputs share most of the same backbone in the network, due to the belief that the network has learned useful representations that will help with both downstream tasks.

2. Multi-aligned Output Wake Word Model

The second architecture is the multi-aligned output WW model. In this architecture, instead of just detecting whether a WW exists, it detects three things:

Start of the WW
End of the WW
Main detector of the WW (centrally aligned)

Note that the combination of these three detection outputs means that it is no longer necessary to output the regression of where the start and end of the wake word is detected.

This architecture was found to perform the best in WW detection.

3. Pseudo-Ground Truth Labels for Training and Evaluation

Due to the difficulty of annotating WW endpoint labels, the authors used the then state-of-the-art acoustic model + Hidden Markov Model keyword spotter (AM+HMM KWS) to label the data as pseudo-ground truth labels.

Most Glaring Deficiency

The labels used are not ground truth labels but rather pseudo-ground truth labels, which may affect the reliability of the results obtained.

Conclusions for Future Work

Instead of just detecting for a particular feature in inputs, we could decompose it into detecting different parts of the feature. In the case of sequential data, we could break it up into start, middle, and end like for the multi-aligned output model in this paper. This also has the advantage of lower latency for detecting the start of the feature, which could help to improve the user experience for latency-sensitive applications.