Tutorial Bundle: Tropical Geometry for Machine Learning and Optimization (Parts 1-2), ICASSP 2024
During the past decade, machine learning and high-dimensional data analysis have experienced explosive growth, due in major part to the extensive successes of deep neural networks. Despite their numerous achievements in disparate fields such as computer vision and natural language processing, which has led to their involvement in safety-critical data processing tasks (such as autonomous driving and security applications), such deep networks have remained mostly mysterious to their end users and even their designers. For this reason, the machine learning community continually places higher emphasis on explainable and interpretable models, those whose outputs and mechanisms are understandable by their designers and even end users. The research community has recently responded to this task with vigor, having developed various methods to add interpretability to deep learning. One such approach is to design deep networks which are fully white-box ab initio, namely designed through mechanisms which make each operator in the deep network have a clear purpose and function towards learning and/or transforming the data distribution. This tutorial will discuss classical and recent advances in constructing white-box deep networks from this perspective. We now present the Tutorial Outline:
- [Yi Ma] Introduction to high-dimensional data analysis (45 min): In the first part of the tutorial, we will discuss the overall objective of high-dimensional data analysis, that is, learning and transforming the data distribution towards template distributions with relevant semantic content for downstream tasks (such as linear discriminative representations (LDR), expressive mixtures of semantically-meaningful incoherent subspaces). We will discuss classical methods such as sparse coding through dictionary learning as particular instantiations of this learning paradigm when the underlying signal model is linear or sparsely generated. This part of the presentation involves an interactive Colab on sparse coding.
- [Sam Buchanan] Layer-wise construction of deep neural networks (45 min): In the second part of the tutorial, we will introduce unrolled optimization as a design principle for interpretable deep networks. As a simple special case, we will examine several unrolled optimization algorithms for sparse coding (especially LISTA and “sparseland”), and show that they exhibit striking similarities to current deep network architectures. These unrolled networks are white-box and interpretable ab initio. This part of the presentation involves an interactive Colab on simple unrolled networks.
- [Druv Pai] White-box representation learning via unrolled gradient descent (45 min): In the third part of the tutorial, we will focus on the special yet highly useful case of learning the data distribution and transforming it to an LDR. We will discuss the information theoretic and statistical principles behind such a representation, and design a loss function, called the coding rate reduction, which is optimized at such a representation. By unrolling the gradient ascent on the coding rate reduction, we will construct a deep network architecture, called the ReduNet, where each operator in the network has a mathematically precise (hence white-box and interpretable) function in the transformation of the data distribution towards an LDR. Also, the ReduNet may be constructed layer-wise in a forward-propagation manner, that is, without any back-propagation required. This part of the presentation involves an interactive Colab on the coding rate reduction.
- [Yaodong Yu] White-box transformers (45 min): In the fourth part of the tutorial, we will show that by melding the perspectives of sparse coding and rate reduction together, we can obtain sparse linear discriminative representations, encouraged by an objective which we call sparse rate reduction. By unrolling the optimization of the sparse rate reduction, and parameterizing the feature distribution at each layer, we will construct a deep network architecture, called CRATE, where each operator is again fully mathematically interpretable, we can understand each layer as realizing a step of an optimization algorithm, and the whole network is a white box. The design of CRATE is very different from ReduNet, despite optimizing a similar objective, demonstrating the flexibility and pragmatism of the unrolled optimization paradigm. Moreover, the CRATE architecture is extremely similar to the transformer, and many of the layer-wise interpretations of CRATE can be used to interpret the transformer, showing the benefits in interpretability from such-derived networks may carry over to understanding current deep architectures which are used in practice. We will highlight in particular the powerful and interpretable representation learning capability of these models for visual data by showing how segmentation maps for visual data emerge in their learned representations with no explicit additional regularization or complex training recipes.