Neural Diarization with Non-autoregressive Intermediate Attractors
Yusuke Fujita (LINE Corporation); Tatsuya Komatsu (LINE Corporation); Robin Scheibler (LINE Corporation); Yusuke Kida (LINE Corp); Tetsuji Ogawa (Waseda University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
End-to-end neural diarization (EEND) with encoder-decoder-based attractors (EDA) is a promising method to handle the whole speaker diarization problem simultaneously with a single neural network.
While the EEND model can produce all frame-level speaker labels simultaneously, it disregards output label dependency.
In this work, we propose a novel EEND model that introduces the label dependency between frames.
The proposed method generates non-autoregressive intermediate attractors to produce speaker labels at the lower layers and conditions the subsequent layers with these labels.
While the proposed model works in a non-autoregressive manner, the speaker labels are refined by referring to the whole sequence of intermediate labels.
The experiments with the two-speaker CALLHOME dataset show that the intermediate labels with the proposed non-autoregressive intermediate attractors boost the diarization performance.
The proposed method with the deeper network benefits more from the intermediate labels, resulting in better performance and training throughput than EEND-EDA.