Towards a Speech Version of ChatGPT

Hung-yi Lee

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 01:06:22

Invited Speech 20 Dec 2023

ecent months have seen a surge in discussions about the capabilities of text foundation models, particularly large language models (LLMs). Known for their general processing abilities, LLMs can effectively perform a variety of tasks with appropriate instructions. Unlike text, speech contains rich, hierarchical information, necessitating distinct capabilities for diverse applications. This raises the question: how close are we to developing speech foundation models that can understand and execute task instructions?This presentation delves into the evolution of foundation models in speech processing, highlighting three significant phases: shared encoders with task-specific heads, universal models with adaptable parameters, and task instruction models. It begins with an introduction to the Speech Processing Universal PERformance Benchmark (SUPERB), which assesses shared encoders across multiple tasks. The discussion then shifts to exploring the use of prompting in speech language models. The presentation concludes with a focus on Dynamic SUPERB, a project aimed at evaluating task instruction models in speech processing.

Tags:

IEEE ASRU 2023

automatic speech recognition

ChatGPT

Towards a Speech Version of ChatGPT

Hung-yi Lee

More Like This

End-to-End Automatic Speech Recognition

Neural Signal Interpretation for Spoken Communication

Can the Production and Perception of Human Emotions Inspire Speech-Based Affective Computing?

Join the IEEE Signal Processing Society