Machine learning predicts emotion from voice in 1.5 seconds with human-like accuracy
Domingo Alvarez E/Unsplash
Researchers from Germany have developed machine learning models that can recognise emotions in short voice snippets lasting just 1.5 seconds with accuracy comparable to humans.
Here's What We Know
In a new study published in the journal Frontiers in Psychology, researchers compared three types of models: deep neural networks (DNNs), convolutional neural networks (CNNs) and a hybrid model (C-DNN).
The models were trained on German and Canadian datasets with meaningless sentences spoken by actors with different emotional tones to rule out the influence of language and meaning on recognition.
"Our models achieved an accuracy similar to humans when categorizing meaningless sentences with emotional coloring spoken by actors" lead author Hannes Diemerling of the Max Planck Institute for Human Development said.
The researchers found that DNNs and a hybrid C-DNN combining audio and visual data performed better than CNNs using spectrograms alone. Overall, all models outperformed random guesses in emotion recognition accuracy.
Dimerling said the fact that humans and AI models performed comparably could mean that they rely on similar patterns in sound to detect emotional subtext.
Researchers noted that such systems could find applications in fields that require emotion interpretation, such as therapy or communication technology. However, further research is needed on the optimal duration of audio clips and analysing spontaneous emotional expressions.
Source: TechXplore