Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:09:00
22 Sep 2021

Neural network compression has become an important practical step when deploying trained models. We consider the problem of low-rank compression of the neural networks with the goal of optimizing the measured inference time. Given a neural network and a target device to run it, we want to find the matrix ranks and the weight values of the compressed model so that network runs as fast as possible on the device while having best task performance (e.g., classification accuracy). This is a hard optimization problem involving weights, ranks, and device constraints. To tackle this problem, we first implement a simple yet accurate model of the on-device runtime that requires only a few measurements. Then we give a suitable formulation of the optimization problem involving the proposed runtime model and solve it using alternating optimization. We validate our approach on various neural networks and show that by using our estimated runtime model we achieve better task performance compared to FLOPs based methods for the same runtime budget on the actual device.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00