EFFICIENT PER-SHOT TRANSFORMER-BASED BITRATE LADDER PREDICTION FOR ADAPTIVE VIDEO STREAMING
Ahmed Telili, Wassim Hamidouche, Sid Ahmed Fezza, Luce Morin
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Recently, HTTP adaptive streaming (HAS) has become a standard approach for over-the-top (OTT)-based video streaming services due to its ability to provide smooth streaming. In HAS, stream representations are encoded to target a specific bitrate providing a wide range of operating bitrates known as the bitrate ladder. In the past, a fixed bitrate ladder approach for all videos has been widely used. However, such a method does not consider video content, which can vary considerably in motion, texture, and scene complexity. Moreover, building a per-title bitrate ladder based on an exhaustive encoding is quite expensive due to the large encoding parameter space. Thus, alternative solutions allowing accurate and efficient per-title bitrate ladder prediction are in great demand. Furthermore, self-attention-based architectures have achieved tremendous performance in large language models (LLMs) and particularly Vision Transformers (ViTs) in computer vision tasks. Therefore, this paper investigates ViT’s capabilities in building an efficient bitrate ladder without performing any encoding process. We provide the first in-depth analysis of the prediction accuracy and the complexity overhead induced by the ViTs model in predicting the bitrate ladder on a large scale video dataset. The source code of the proposed solution and the dataset will be made publicly available.