Meta has unveiled a dataset that will allow speech recognition systems to be trained on "clusters" of speakers
Meta AI has unveiled a new dataset that promises to improve the performance of automatic speech recognition (ASR) tools by clustering speakers.
Here's What We Know
Many datasets used to train ASR models are organised by demographics: age group, gender, nationality, English accent. This limits the variability of the pronunciation on which the algorithms are trained and prevents them from understanding a wide range of users.
To get around this problem, Meta AI has developed a dataset that relies on an utterance clustering method. Each cluster contains a similar set of phrases from different speakers. This means that the ASR model will learn to recognise the same utterance spoken by different people.
Meta's final dataset includes just over 27,000 command utterances collected from 595 volunteers from the US. Their phrases focus on seven main topics: music, taking pictures, utilities, managing notifications, messaging, making calls, and dictation.
As prompts, speakers were asked questions about how they would voice search for a song or make plans with friends.
Results from testing the dataset were promising: the model's performance improved "on all demographic groups [...], though by far the largest gains are with respect to more inclusivity of accents," the blog said.
Overall, ASR performance increased by 10% using the clustering method. Meanwhile, significant gains were also made in the 66-85 year old group, traditionally underrepresented in the voice command space.
Source: Meta AI