ASSD: SYNTHETIC SPEECH DETECTION IN THE AAC COMPRESSED DOMAIN
Amit Kumar Singh Yadav (Purdue University); Ziyue Xiang (Purdue University); Emily Bartusiak (Purdue University); Paolo Bestagini (Politecnico di Milano); Stefano Tubaro (Politecnico di Milano, Italy); Edward Delp (Purdue University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Synthetic human speech signals have become very easy to generate given modern text-to-speech methods. When these signals are shared on social media they are often compressed using the Advanced Audio Coding (AAC) standard. Our goal is to study if a small set of coding metadata contained in the AAC compressed bit stream is sufficient to detect synthetic speech. This would avoid decompressing of the speech signals before analysis. We call our proposed method AAC Synthetic Speech Detection (ASSD). ASSD extracts information from the AAC compressed bit stream without decompressing the speech signal. ASSD analyzes the information using a transformer neural network. In our experiments, we compressed the ASVSpoof19 dataset according to the AAC standard using different data rates. We compared the performance of ASSD to a time domain based and a spectrogram based synthetic speech detection methods. We evaluated ASSD on approximately 71k compressed speech signals. The results show that our proposed method typically only requires 1000 bits per speech block/frame from the AAC compressed bit stream to detect synthetic speech. This is much lower than other reported methods. Our method also had a 9.7 percentage points higher detection accuracy compared to existing methods.