Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection
Xiaohui Liu (Tianjin University, Tianjin, China); Meng Liu (Tianjin University); Longbiao Wang (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Hanyi Zhang (Tianjin University); Jianwu Dang (Tianjin University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. As synthetic speech exhibits local and global artifacts compared to natural speech, incorporating local-global dependency would lead to better anti-spoofing performance. To this end, we propose the Rawformer that leverages positional-related local-global dependency for synthetic speech detection. The two-dimensional convolution and Transformer are used in our method to capture local and global dependency, respectively. Specifically, we design a novel positional aggregator that integrates local-global dependency by adding positional information and flattening strategy with less information loss. Furthermore, we propose the squeeze-and-excitation Rawformer (SE-Rawformer), which introduces squeeze-and-excitation operation to acquire local dependency better. The results demonstrate that our proposed SE-Rawformer leads to 37% relative improvement compared to the single state-of-the-art system on ASVspoof 2019 LA and generalizes well on ASVspoof 2021 LA. Especially, using the positional aggregator in the SE-Rawformer brings a 43% improvement on average.