Large-Scale Nonverbal Vocalization Detection Using Transformers
Panagiotis Tzirakis (Hume AI); Alice Baird (Hume AI); Jeff Brooks (Hume AI); Chris Gagne (Hume.ai); Lauren Kim (Hume AI); Michael Opara (Hume AI); Christopher Gregory (Hume AI); Jacob Metrick (Hume AI); Garrett Boseck (Hume AI); Vineet Tiruvadi (Hume AI); Bjoern W. Schuller (Imperial College London); Dacher Keltner (UC Berkeley); Alan S Cowen (Hume AI)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Detecting emotionally expressive nonverbal vocalizations is essential to developing technologies that can converse fluently with humans. The affective computing community has largely focused on understanding the intonation of emotional speech and language. However, advances in the study of vocal emotional behavior suggest that emotions may be more readily conveyed not by speech but by nonverbal vocalizations such as laughs, sighs, shrieks, and grunts -- vocalizations that often occur in lieu of speech. The task of detecting such emotional vocalizations has been largely overlooked by researchers, likely due to the limited availability of data capturing a sufficiently wide variety of vocalizations. Most studies in the literature focus on detecting laughter or cries. In this paper, we present the first, to the best of our knowledge, nonverbal vocalization detection model trained to detect as many as 67 types of emotional vocalizations. For our purposes, we use the large-scale and in-the-wild Hume-VB dataset that provides more than 156h of data. We thoroughly investigate the use of pre-trained audio transformer models such as Wav2Vec2 and Whisper, and provide useful insights for the task at hand using different types of noise signals.