Synthesizing Speech from ECoG with a Combination of Transformer-based Encoder and Neural Vocoder
Kai Shigemi (Tokyo University of Agriculture and Technology); Shuji Komeiji (Tokyo University of Agriculture and Technology); Takumi Mitsuhashi (Juntendo University School of Medicine); Yasushi Iimura (Juntendo University School of Medicine); Hiroharu Suzuki (Juntendo University School of Medicine); Hidenori Sugano (Juntendo University School of Medicine); Koichi SHINODA (Tokyo Institute of Technology); Kohei Yatabe (Tokyo University of Agriculture and Technology); Toshihisa Tanaka (Tokyo University of Agriculture and Technology)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
The present paper reports a novel invasive brain--computer interface (BCI) paradigm that has successfully reconstructed spoken sentences from invasive electrocorticograms (ECoGs) using deep neural network (DNN)--based encoders and a pre-trained neural vocoder.
We recorded ECoGs while 13 participants were speaking short sentences.
Our BCI is to estimate a map from the ECoG recording to the log-mel spectrograms of the spoken sentences using a bidirectional long short-term memory (BLSTM) or a Transformer. The estimated log-mel spectrograms were used in the Parallel WaveGAN to synthesize speech waveforms.
The evaluation of the model performance with MSE loss and Pearson correlation revealed that the Transformer model significantly differs (Wilcoxon signed-rank test, p < 0.001) in the MSE loss and Pearson correlation from the BLSTM.