Skip to main content

Prune then Distill: Dataset Distillation with Importance Sampling

Anirudh S Sundar (Georgia Institute of Technology); Gokce Keskin (Amazon Inc.); Chander Chandak (Amazon Inc.); I-Fan Chen (Amazon Inc.); Pegah Ghahremani (Amazon Inc.); Shalini Ghosh (Amazon Alexa AI)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

The development of large datasets for various tasks has driven the success of deep learning models but at the cost of increased label noise, duplication, collection challenges, storage capabilities, and training requirements. In this work, we investigate whether all samples in large datasets contribute equally to better model accuracy. We study statistical and mathematical techniques to reduce redundancies in datasets by directly optimizing data samples for the generalization accuracy of deep learning models. Existing dataset optimization approaches include analytic methods that remove unimportant samples and synthetic methods that generate new datasets to maximize the generalization accuracy. We develop Prune then distill, a combination of analytic and synthetic dataset optimization algorithms, and demonstrate up to 15% relative improvement in generalization accuracy over either approach used independently on standard image and audio classification tasks. Additionally, we demonstrate up to 38% improvement in generalization accuracy of dataset pruning algorithms by maintaining class balance while pruning.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00