CAN WE DISTILL KNOWLEDGE FROM POWERFUL TEACHERS DIRECTLY?
Chengyao Qian, Munawar Hayat, Mehrtash Harandi
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Knowledge distillation efficiently improves a small model's performance by mimicking the teacher model's behavior. Most existing methods assume that distilling from a large and accurate teacher model leads to better student models. However, several studies show the difficulty of distillation from large teacher models and opt for heuristics to address this. In this work, we demonstrate that large teacher models can still be effective in knowledge distillation. We show that the spurious features learned by large models are the cause of difficulty in knowledge distillation for small students. To overcome this issue, we propose employing ℓ1 regularization to prevent teacher models from learning an excessive number of spurious features. Our method alleviates the poor learning for small students when there is a significant disparity in size between the teachers and students. We achieve substantial improvement on various architectures, e.g., ResNet, WideResNet and VGG, across several datasets including CIFAR-100, Tiny-ImageNet and ImageNet.