Abstract
Many results in recent years established polynomial time learnability of various models via neural networks algorithms (e.g. Andoni et al. [2014], Daniely et al. [2016], Daniely [2017], Cao and Gu [2019], Ji and Telgarsky [2019], Zou and Gu [2019], Ma et al. [2019], Du et al. [2018a], Arora et al. [2019], Song and Yang [2019], Oymak and Soltanolkotabi [2019a], Ge et al. [2019], Brutzkus et al. [2018]). However, unless the model is linearly separable Brutzkus et al. [2018], or the activation is quadratic Ge et al. [2019], these results require very large networks – much more than what is needed for the mere existence of a good predictor. In this paper we make a step towards learnability results with near optimal network size. We give a tight analysis on the rate in which the Neural Tangent KernelJacot et al. [2018], a fundamental tool in the analysis of SGD on networks, converges to its expectations. This results enable us to prove that SGD on depth two neural networks, starting from a (non standard) variant of Xavier initialization Glorot and Bengio [2010] can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime.
Original language | English |
---|---|
Journal | Advances in Neural Information Processing Systems |
Volume | 2020-December |
State | Published - 2020 |
Event | 34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online Duration: 6 Dec 2020 → 12 Dec 2020 |
Bibliographical note
Publisher Copyright:© 2020 Neural information processing systems foundation. All rights reserved.