To help understand the intersection between news, marketing and advertising, sign up to Lead Monitor fortnightly email newsletter, Marketing Matters, in partnership with Lead Monitor.

Each time, you will receive in-depth analysis on where the industries align, interviews with CMOs and business leaders and deep-dives into improving your marketing strategy.

"*" indicates required fields

Visit our privacy Policy for more information about our services, how Progressive Media Investments may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
This field is for validation purposes and should be left unchanged.

WALL-E meets its match: Microsoft researchers unveil VALL-E

Artificial intelligence (AI) is big business, so it is little surprise innovators are continuing to explore the outer limits of this technology.

Much like WALL-E – the much-loved garbage robot left on Earth to clean up long after we all jetted off to space – injected itself into human psyche via the medium of cute film-based shenanigans, Microsoft’s VALL-E is attempting to do similar with our speech patterns and, more specifically, how to mimic the human voice.

Although AI exists that can imitate our speech, Microsoft appears to have discovered technology that can achieve this speedily and with very little training – as little as 3 seconds of words are needed for VALL-E to continue the sentence.

“We introduce a language modeling approach for text to speech synthesis (TTS),” announced Microsoft in a statement. “During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.”

The results – which can be found here – are mixed, but it certainly introduces an interesting scenario where TTS and AI can be combined to help in education, entertainment or business, specifically for those with vision impairment or literacy issues.

As always with technological-driven software, there are ethical considerations; deep-fake video is becoming more mainstream and there are multiple debates to be had about permissions and misrepresentation.

Microsoft do attempt to address this issue, however, suggesting that, “when the model is generalized to unseen speakers, relevant components should be accompanied by speech editing models, including the protocol to ensure that the speaker agrees to execute the modification and the system to detect the edited speech.”

It is estimated the global AI market will reach a size of half a trillion US dollars in 2023, and it shows no sign of abating.

Managing partner at Contagious, Alex Jenkins, understands the need for circumspection when it comes to AI. “The year will see any number of opinions and chin-stroking thought leadership teams on the possibilities of AI creativity, with people touting the near-magical capabilities of text, image and video generators. The industry will debate whether it’s an existential threat or a boondoggle distracting from the real business of selling stuff. (FWIW – little bit of column A, little bit of column B).”

Mr Martech, Scott Brinker, now VP, platform ecosystem at HubSpot, agrees. “Consider this: who owns a novel written by an AI machine? Especially since AI-generated content is nearly identical to human-generated content. We have no idea how much of the content we consume on a daily basis is generated by AI.

“Businesses should ask questions as with any new technology, but this is one to be embraced. However, it is critical to understand that AI should serve as a supplement to humans rather than as a replacement. And it should be checked by a human.”