Automatic speech recognition is on the verge of becoming the chief way of interacting with primary computing devices. A decade ago, the concept of automatic speech recognition was laughed at.
Anticipating this rise in voice-controlled electronics, a team of researchers from MIT have developed a low-power chip designed for automatic speech recognition. A cell phone running speech-recognition software might need roughly 1 watt of power, but the new chip requires between 0.2 and 10 milliwatts only, based on the number of words it has to recognize.
In a real-world application, that potentially means a power savings of 90 to 99%, which could make voice control feasible for moderately simple electronic devices. That includes power-constrained gadgets that have to go months between battery charges or extract energy from their environments.
Such gadgets become the technological backbone of what is called the “internet of things” or IoT, which refers to the futuristic concept that appliances, vehicles, manufacturing equipment, civil-engineering structures, and even livestock will soon have sensors that report information directly to networked servers, thereby helping with maintenance and the management of tasks.
Speech input will become a natural interface for many wearable applications and intelligent devices. The miniaturization of these devices will require a different interface than touch or keyboard. It will be critical to embed the speech functionality locally to save system energy consumption compared to performing this operation in the cloud.
Anantha Chandrakasan, Vannevar Bush Professor of Electrical Engineering and Computer Science, MIT
“I don’t think that we really developed this technology for a particular application,” adds Michael Price, who led the design of the chip as an MIT graduate student in electrical engineering and computer science and currently works for chipmaker Analog Devices. “We have tried to put the infrastructure in place to provide better trade-offs to a system designer than they would have had with previous technology, whether it was software or hardware acceleration.”
Price, Chandrakasan, and Jim Glass, a senior research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory, explained about the new chip in a paper, which Price later presented at the International Solid-State Circuits Conference.
The Sleeper Wakes
Currently, the top-performing speech recognizers are, like several other sophisticated artificial-intelligence systems, based on neural networks, virtual networks of basic information processors approximately modeled on the human brain. Most of the new chip’s circuitry is aimed at executing speech-recognition networks in a highly efficient manner.
However even the leading power-efficient speech recognition systems would rapidly drain a device’s battery if it operated without interruption. Therefore, the chip also has a simpler “voice activity detection” circuit that keeps a check on ambient noise to establish whether it might be speech. In case the answer is yes, the chip immediately triggers the larger, more complex speech-recognition circuit.
For experimental purposes, the chip designed by the researchers had three different voice-activity-detection circuits, with varying degrees of complexity and, as a result, varying power requirements. Which circuit was highly power-efficient depended on the context, however in tests simulating a broad range of conditions, the most complex of the three circuits ended up being the greatest power savings for the system on the whole.
Although it used up nearly three times as much power as the simplest circuit, it produced much lesser false positives; the simpler circuits frequently chewed through their energy savings by falsely triggering the rest of the chip.
A common neural network comprises of several processing “nodes” that can only perform simple computations but are densely connected to each other. In the type of network generally used for voice recognition, these nodes are set into layers.
Voice data is fed into the network’s bottom layer, whose nodes process and transmit them to the nodes of the next layer, whose nodes process and transmit them to the next layer, and so on. The top layer’s output point to the probability that the voice data signifies a particular speech sound.
A voice-recognition network is too large to fit in an onboard memory of a chip, which is an issue because going off-chip for data consumes a lot more energy than retrieving it from local stores. So the MIT team’s design focused on reducing the quantity of data that the chip had to retrieve from off-chip memory.
A node in the center of a neural network might receive data from several other nodes and send out data to several others. Each of those connections has an associated “weight,” a number that signifies how prominently data transmitted across it should factor into the receiving node’s computations.
The primary step in reducing the new chip’s memory bandwidth is to pack together the weights associated with each node. Decompressing of the data happens only after they are brought on-chip.
The chip also takes advantage of the fact that, with speech recognition, wave upon wave of data must be transmitted via the network. The inward bound audio signal is divided into 10-millisecond increments, each of which must be assessed individually. The MIT team’s chip takes in a single node of the neural network at a time, but it transmits the data from 32 consecutive 10-millisecond increments through it.
If a node has 12 outputs, then the 32 conveys result in 384 output values, which is then locally stored by the chip. Each of those has to be coupled with 11 other values when fed to the subsequent layer of nodes, and so on. So the chip ends up needing a substantial onboard memory circuit for its intermediate computations. But it brings in only a single compressed node from off-chip memory at a time, keeping its power needs to a minimal.
For the next generation of mobile and wearable devices, it is crucial to enable speech recognition at ultralow power consumption. This is because there is a clear trend toward smaller-form-factor devices, such as watches, earbuds, or glasses, requiring a user interface which can no longer rely on touch screen. Speech offers a very natural way to interface with such devices.
Marian Verhelst, Professor, Catholic University of Leuven
The research was funded through the Qmulus Project, a joint venture between MIT and Quanta Computer. The chip was prototyped through the Taiwan Semiconductor Manufacturing Company’s University Shuttle Program.