Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 13:53
04 May 2020

We study the convergence, the implicit regularization and the generalization of stochastic mirror descent (SMD) algorithms in overparameterized nonlinear models, where the number of model parameters exceeds the number of training data points. Due to overparameterization, the training loss has infinitely many global minima where they define a manifold of interpolating solutions, and it is important to characterize which global minima the algorithms converge to. In this work, we first theoretically show that in the overparameterized nonlinear setting, if the initialization is close enough to the manifold of global minima, which is usually the case in the high overparameterization setting, using sufficiently small step size, SMD converges to a global minimum. We prove that this global minimum is approximately the closest one to the initialization in Bregman divergence, demonstrating the approximate implicit regularization of SMD. We then empirically confirm that these theoretical results are observed in practice. Finally, we provide an extensive study of the generalization of SMD algorithms. In our experiments, we show that on the CIFAR-10 dataset, SMD with $\ell_{10}$ norm potential (as a surrogate for $\ell_\infty$) consistently generalizes better than SGD (corresponding to an $\ell_2$ norm potential), which in turn consistently outperforms SMD with $\ell_1$ norm potential.