- Alphabet's DeepMind Artificial Intelligence (AI) company announced AI-based speech generation technology WaveNet.
- WaveNet uses neural network technology related to those used in AlphaGo, DeepMind's breakthroughs Go-playing program.
- WaveNet efficiently generates realistic human voices, reducing the gap to human-level performance by over 50 percent.
A few months ago Amigobulls covered the spectacular story of Alphabet's (NASDAQ:GOOGL) AlphaGo, the first Artificial Intelligence (AI) program to beat the world's top-ranked human Go player. Before AlphaGo, beating a top ranked Go player was thought to be a remote, ten years in the future goal for AI. The AlphaGo breakthrough has been rightfully hailed as an important milestone for AI.
But one doesn't make much money playing Go. Top Go players do make very good money with prizes and sponsors in places like Japan and Korea where the game is very popular, but of course, Alphabet is after much bigger money than that. In other words, the company needs to convert its research results into commercial products.
AlphaGo was developed by Google DeepMind, a British AI company founded in 2010 as DeepMind Technologies and acquired by Google in 2014. For AlphaGo, DeepMind created deep neural networks that learn how to play games in a similar fashion to humans and appear to mimic key cognitive aspects of the human brain. But advanced deep neural networks have many applications besides games, and some applications have clear commercial value.
A few days ago DeepMind announced WaveNet, a deep generative model of raw audio waveforms was able to generate speech that mimics any human voice and sounds more natural than the best existing Text-to-Speech (TTS) systems, reducing the gap with the human performance by over 50 percent.
"Allowing people to converse with machines is a long-standing dream of human-computer interaction," reads the DeepMind announcement, which notes that the ability of computers to understand natural speech has been revolutionized in the last few years by the application of deep neural networks. However, generating speech with computers is still based on old techniques where short speech fragments are recorded from a single speaker and recombined. "This makes it difficult to modify the voice (for example switching to a different speaker, or altering the emphasis or emotion of their speech) without recording a whole new database," emphasizes the announcement.
Voice recognition and generation technology powers all sorts of computer systems that interact with users by voice, from customer services switchboard systems to personal assistants on smartphones, such as Apple's (NASDAQ:AAPL) Siri, Microsoft's (NASDAQ:MSFT) Cortana, and Alphabet's own Google Now. The holy grail of voice synthesis is generating computer voices that sound and feel exactly like human voices. Samantha, the science-fictional AI assistant in the film "Her," played by the disembodied voice of Scarlett Johansson, able to sound totally human and communicate deep emotional content, is still a far goal, but the DeepMind announcement represents an important step in that direction.
Make no mistake, there's a lot of money in computer assistants that sound like people. Consumers love them, not only for their practical utility but also because many people miss emotionally satisfying interactions with other people. That people look to computers as friends is perhaps a sad symptom of existential malaise in today's society, but it's also a fact that consumer-facing businesses can't ignore.
For example, Microsoft has been testing its AI-powered chatbot technology in China with XiaoIce, a program that people can add as a friend on Chinese social networks. Now XiaoIce is a huge hit in China, millions of Chinese people chat with her every day, and some consider her as a loved friend. XiaoIce is significantly more sophisticated than current generation personal assistants and is able to conduct human-like conversations with simulated emotional content.
Xiaoice is a text chatbot, though - it doesn't have voice. Which is exactly what DeepMind's WaveNet could deliver to "computer friends," but also to personal assistants and business systems. CNBC notes that TTS synthesis is a technology that companies from Apple to Microsoft are interested in as they could be critical in making digital personal assistants such as Siri or Cortana smarter and more human-like.
The DeepMind announcement provides technical details of how WaveNet works. Basically, the system is a neural network that learns the characteristics of many different voices, male and female, based on which it models the raw waveform of the desired output audio signal, one sample at a time. DeepMind notes that training WaveNet on many speakers made it better at modeling a single speaker than training on that speaker alone, suggesting a form of transfer learning. More technical details are given in the research paper "WaveNet: A Generative Model For Raw Audio."
"For both Chinese and English, Google's current TTS systems are considered among the best worldwide, so improving on both with a single model is a major achievement," emphasizes the DeepMind announcement. "WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese."
The path to commercial exploitation of WaveNet technology is clear: First, Alphabet can use its TTS technology to give an edge to its own voice interfaces over the competition. Second, the technology can be licensed to phone companies, car makers, call centers, computer game makers, and all enterprises that need voice-based interfaces. Though this is but a drop in the ocean of Alphabet's activities, it's an important one, which is good news for Alphabet investors.