Neural Architecture of Speech
Subba Reddy Oota (IIIT Hyderabad); Khushbu Pahwa (University of California Los Angeles); Mounika Marreddy (IIIT Hyderabad); Manish Gupta (Microsoft); Raju Surampudi Bapi (International Institute of Information Technology Hyderabad)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Vast literature on brain encoding has effectively harnessed deep neural network models for accurately predicting brain activations from visual or text stimuli. Unfortunately, there is not much work on brain encoding for speech stimuli. The few existing studies on brain encoding for speech stimuli transcribe speech to text and then leverage text-only models for encoding, thereby ignoring audio signals completely. However, recently several speech representation learning models have revolutionized the field of speech processing. Inspired by the recent progress on deep learning models for speech, we present a first systematic study on understanding human speech processing by probing neural speech models to predict both language and auditory brain region activations. In particular, we investigate 30 speech representation models grouped into four categories: (i) traditional feature engineering, (ii) generative, (iii) predictive, and (iv) contrastive, , to study how these models encode the speech stimuli and align with human brain activity for the Moth Radio Hour fMRI (functional magnetic resonance imaging) dataset. We find that both contrastive (Wav2Vec2.0) and predictive models (HuBERT, Data2Vec) are very accurate. Specifically, Data2Vec aligns the best with both language and auditory brain regions among all investigated models. We make our code publicly available.