Hearing and Seeing Abnormality: Self-supervised Audio-Visual Mutual Learning for Deepfake Detection
ChangSung Sung (National Taiwan University); Jun-Cheng Chen (Academia Sinica); Chu-Song Chen (National Taiwan University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
The recent development of deepfakes has resulted in serious threats to society, such as spreading misinformation, defamation, etc. Although recent deepfake detection methods are capable of achieving satisfactory results for seen forgeries, the performance drops significantly for unseen ones. With proper supervised pretraining on auxiliary tasks as prior, the situation can be improved, but the requirement to collect a large number of additional annotations for these
tasks may restrict the further development of a generalized deepfake detector. To address this issue, we propose an Audio-Visual Temporal Synchronization for Deepfake Detection framework for detecting deepfakes that maintains reasonable detection capabilities for unseen ones. The primary objective of our framework is to determine whether there has been a forgery by evaluating the consistency between the sound and the faces in a video clip, together with
the relationship between the two features. First, the spatio-temporal feature extraction network is pretrained in a self-supervised manner by exploiting the audio-visual temporal synchronization task to build up a rich representation based on the temporal synchronization relationship between the audio and its corresponding video. For pretraining, we use only real data and carefully selected negative samples with contrastive loss to train the model. A temporal classifier network is used to determine whether or not the video has been manipulated using the representations obtained from the pretrained feature extraction networks. To prevent the model from overfitting to certain manipulation-specific artifacts, we froze the feature extraction networks and only trained the final classifier network on forged data. Extensive experiments on unseen forgery categories and unseen datasets have shown the effectiveness of our method to achieve state-of-the-art results.