Enriched Speech For Effortless Listening
Carol Chermaz, Sneha Raman, Muhammed Shifas PV, Avashna Govender, Dipjyoti Paul, Olympia Simantiraki
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 09:39
Human-machine speech interaction is increasingly common in the industrialised world. A (natural or synthetic) speech output that is optimised for high intelligibility and low cognitive load is of interest for both academia and industry: ENRICH (www.enrich-etn.eu) is a unique initiative - handling these topics from a joint perspective. The group has a special focus on the role of technology for individuals with different abilities in speech or hearing, with the aim of improving social inclusion by making communication easier. We would like to propose a series of demos, interleaved by the common thread of enriched speech for effortless listening. All demos are computer-based, with simple GUIs and headphones. 1)(NELE) Near End Listening Enhancement in realistic scenarios: The user can listen to speech playback in different simulated real-world acoustic scenarios, with or without NELE; they can enter the words they heard and calculate their intelligibility score. 2)Noise-aware speech enrichment using DNNs (Deep Neural Networks): The user can record their speech and subsequently listen to their utterance cleared from the noisy background and enhanced. 3)Enrichment of oesophageal speech with voice conversion based on LSTM (Long Shot-Term Memory) neural networks: A video of a patient emitting oesophageal speech is shown; the same speech is processed with a neural approach, reintroducing missing features like pitch. 4)Transforming casual speech into clear speech using Tacotron and WaveRNN vocoder and personalising synthetic voices with speaker embeddings extracted from a limited amount of data: The user can record their speech; the utterance is transcribed and speaker features are used to produce a synthetic voice that speaks clearly and sounds similar to the user. 5)Personalising speech playback: The user can modify arbitrary features of a recorded voice to maximise perceived intelligibility. They can see how their choices compare against average responses. Modifications include F0 (using STRAIGHT), spectral location of speech band and spectral tilt.