Meta, one of the leading players in AI technology, has recently unveiled its latest creation, Voicebox. This text-guided, artificially intelligent speech generator is so powerful that it outperforms all existing models. According to Meta, Voicebox is capable of generating voices as easily as ChatGPT can generate text and Bing or Dall-E 2 can create images. Although it is not yet available for public use, demos of Voicebox are accessible to anyone interested in learning more about this revolutionary technology.
Voicebox has the potential to be used in audio editing by content creators and editors, as its voice generation makes for natural-sounding audio clips. It is versatile enough to intelligently edit noise out of voice clips, like dogs barking, and regenerate the voice without missing a beat. One of the most impressive abilities of Voicebox is that it can match the audio style of a sample and generate text-to-speech clips. This means that visually-impaired users could give Voicebox an audio clip of their friend as short as two seconds, and it would be able to read that friend’s written messages in their voice using AI.
The new generative AI tool can solve tasks via in-context learning, so it can process text it’s never been given before and correctly generate context and inflections much like a person would read it by using existing knowledge to learn and tackle new challenges. However, the ethical and legal implications of this groundbreaking tool are not easily dismissible. Anyone could generate audio clips using recordings of a person’s voice without permission and claim to have them say anything they want. In the published paper, Meta claims that a binary classification model can distinguish between real-world speech and that which Voicebox generates. Nevertheless, since the system is not publicly available, Meta’s metaphorical feet are yet to be held to the fire.
Meta trained Voicebox on 60,000 hours of English audiobooks and 50,000 hours of multilingual audiobooks in six languages for optimal performance. Its training enables it to perform multilingual text-to-speech with no training, speech denoising, styling, editing, and generating diverse speech samples. In a paper published by Meta AI, the company claims it can generate diverse audio samples 20 times faster than Microsoft’s VALL-E and more intelligible.
Compared to the previous state-of-the-art model, YourTTS, Voicebox was found to reduce the average word error rate from 10.9% to 5.2%, as well as increase the audio similarity from 0.335 to 0.481. Aside from being faster and making fewer errors than competitors, Meta claims Voicebox can convert written text into spoken words in one or multiple languages without being specifically trained for each language separately.
The potential applications of Voicebox are vast and varied. It could be used to create more natural-sounding audio for audiobooks, podcasts, and other audio content. It could also be used to create more realistic-sounding virtual assistants and chatbots. However, the technology’s potential for misuse and abuse cannot be ignored. As with any new technology, it is essential to consider the ethical and legal implications before widespread adoption.
In conclusion, Meta’s Voicebox is a groundbreaking AI technology that has the potential to revolutionize the way we create and interact with audio content. Its ability to generate natural-sounding voices and match the audio style of a sample is impressive, and its performance surpasses that of existing models. However, the ethical and legal implications of this technology must be considered before it is widely adopted.