Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection

Xiaohui Liu (Tianjin University, Tianjin, China); Meng Liu (Tianjin University); Longbiao Wang (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Hanyi Zhang (Tianjin University); Jianwu Dang (Tianjin University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Automatic speaker verification (ASV) systems are vulnerable to spoofing attacks. As synthetic speech exhibits local and global artifacts compared to natural speech, incorporating local-global dependency would lead to better anti-spoofing performance. To this end, we propose the Rawformer that leverages positional-related local-global dependency for synthetic speech detection. The two-dimensional convolution and Transformer are used in our method to capture local and global dependency, respectively. Specifically, we design a novel positional aggregator that integrates local-global dependency by adding positional information and flattening strategy with less information loss. Furthermore, we propose the squeeze-and-excitation Rawformer (SE-Rawformer), which introduces squeeze-and-excitation operation to acquire local dependency better. The results demonstrate that our proposed SE-Rawformer leads to 37% relative improvement compared to the single state-of-the-art system on ASVspoof 2019 LA and generalizes well on ASVspoof 2021 LA. Especially, using the positional aggregator in the SE-Rawformer brings a 43% improvement on average.

Tags:

Speaker verification and anti-spoofing

Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection

Xiaohui Liu (Tianjin University, Tianjin, China); Meng Liu (Tianjin University); Longbiao Wang (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Hanyi Zhang (Tianjin University); Jianwu Dang (Tianjin University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Improving Transformer-Based Networks with Locality for Automatic Speaker Verification

Predictive SkiM: Contrastive Predictive Coding for Low-Latency Online Speech Separation

UNIVERSAL SPEAKER RECOGNITION ENCODERS FOR DIFFERENT SPEECH SEGMENTS DURATION

Join the IEEE Signal Processing Society