Time-domain speech separation networks with graph encoding auxiliary
Wang Tingting (Nanjing University of Posts and Tel); Zexu Pan (National University of Singapore); Meng Ge (Tianjin University); Zhen Yang (Nanjing University of Posts and Telecommunication); Haizhou Li (The Chinese University of Hong Kong, Shenzhen)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
End-to-end time-domain speech separation with masking strategy has shown its performance advantage, where a 1-D convolutional layer is used as the speech encoder to encode a sliding window of waveform to a latent feature representation, i.e. an embedding vector. A large window leads to low resolution in the speech processing, on the other hand, a small window offers high resolution but at the expense of high computational cost. In this work, we propose a graph encoding technique to model the fine structural knowledge of speech samples in a window of reasonable size. Specifically, we build a graph representation for each latent representation, and encode the structural details with a graph convolutional network encoder. The encoded graph feature representation complements the original latent feature representation and benefits the separation and reconstruction of speech. Experiments on various models and datasets show that our proposed encoding technique significantly improves the speech quality over other time-domain speech encoders.