Tutorial: Understanding Deep Representation Learning via Neural Collapse
Laura Balzano, Qing Qu, Peng Wang, Zhihui Zhu
-
SPS
IEEE Members: $22.00
Non-members: $30.00
The Neural Collapse phenomenon has garnered significant attention in both practical and theoretical fields of deep learning, as evident from the extensive research on the topic. The presenters' own works have made key contributions to this body of research. Below is a summary of the tutorial outline. The first half focuses on the structures of representations appearing in the last layer, and we generalize the study into intermediate layers in the second half of this tutorial.
1. Prevalence of Neural Collapse & Global Optimality
The tutorial starts with the introduction of the Neural Collapse phenomenon in the last layer and its universality in deep network training, and lays out the mathematical foundations of understanding its cause based upon simplified unconstrained feature model (UFM) . We then generalize and explain this phenomenon and its implications under data imbalanceness.
2. Optimization Theory of Neural Collapse
We provide a rigorous explanation of the emergence of Neural Collapse from an optimization perspective and demonstrate its impacts on algorithmic choices, drawing on recent works. Specifically, we conduct a global landscape analysis under the UFM to show that benign landscapes are prevalent across various loss functions and problem formulations. Furthermore, we demonstrate the practical algorithmic implications of Neural Collapse on training deep neural networks.
3. Progressive Data Compression & Separation Across Intermediate Layers
We open the black-box of deep representation learning by introducing a law that governs how real-world deep neural networks separate data according to their class membership from the bottom layers to the top layers. We show that each layer roughly improves a certain measure of data separation by an equal multiplicative factor. We demonstrate its universality by showing its prevalence across different network architectures, dataset, and training losses.
4. Theory & Applications of Progressive Data Separation
Finally, we delve into theoretical understandings of the structures in the intermediate layer via studying the learning dynamics of gradient descent. In particular, we reveal that there are certain parsimonious structures in gradient dynamics so that a certain measure of data separation exhibits layer-wise linear decay from shallow to deep layers. Finally, we demonstrate its practical implications of understanding the phenomenon in transfer learning and the study of foundation models, leading to efficient fine-tuning methods with reduced overfitting.