Skip to main content

BTS-E: Audio Deepfake Detection using Breathing-Talking-Silence Encoder

Thien-Phuc Doan (Soongsil university); Long Nguyen-Vu (Soongsil university); Souhwan Jung (Soongsil university); Kihun Hong (Soongsil university)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

Voice phishing is increasingly popular due to the development of speech synthesis technology. The use of deep learning to generate an arbitrary content audio clip simulating the victim’s voice makes it difficult not only for humans but also for automatic speaker verification systems to distinguish. Countermeasure systems have been developed recently to help ASV combat synthetic speech. In this work, we propose BTS-E, a framework to evaluate the correlation between breathing, talking, and silence sounds in an audio clip, then use this information for deepfake detection tasks. We argue that natural human sounds are hard to synthesize by Text-to-speech system. We conducted an evaluation using ASVspoof 2019 and 2021 datasets to validate our hypothesis. The experiment results show the applicability of the breathing sound feature in detecting deepfake voices. In general, the proposed system significantly increases the performance of the classifier by up to 46%.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00