An End-to-End Neural Network for Image-to-Audio Transformation

Chen Liu (Oregon Health & Science University); Michael Deisher (Intel Corporation); Munir Georges (Intel Corporation); Munir Georges (THI)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

This paper describes an end-to-end (E2E) neural architecture for the audio rendering of small portions of display content on low resource personal computing devices. It is intended to address the problem of accessibility for vision-impaired or vision-distracted users at the hardware level. Neural image-to-text (ITT) and text-to-speech (TTS) approaches are reviewed and a new technique is introduced to efficiently integrate them in a way that is both efficient and back-propagate-able, leading to a non-autoregressive E2E image-to-speech (ITS) neural network that is efficient and trainable. Experimental results are presented showing that, compared with the non-E2E approach, the proposed E2E system is 29% faster and uses 19% fewer parameters with a 2% reduction in phone accuracy. A future direction to address accuracy is presented.

Tags:

Multimodal processing of language

An End-to-End Neural Network for Image-to-Audio Transformation

Chen Liu (Oregon Health & Science University); Michael Deisher (Intel Corporation); Munir Georges (Intel Corporation); Munir Georges (THI)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

MGAT: Multi-granularity Attention based Transformers for Multi-modal Emotion Recognition

Join the IEEE Signal Processing Society