Neural networks learning and memorization with (almost) no over-parameterization

Amit Daniely*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

6 Scopus citations

Abstract

Many results in recent years established polynomial time learnability of various models via neural networks algorithms (e.g. Andoni et al. [2014], Daniely et al. [2016], Daniely [2017], Cao and Gu [2019], Ji and Telgarsky [2019], Zou and Gu [2019], Ma et al. [2019], Du et al. [2018a], Arora et al. [2019], Song and Yang [2019], Oymak and Soltanolkotabi [2019a], Ge et al. [2019], Brutzkus et al. [2018]). However, unless the model is linearly separable Brutzkus et al. [2018], or the activation is quadratic Ge et al. [2019], these results require very large networks – much more than what is needed for the mere existence of a good predictor. In this paper we make a step towards learnability results with near optimal network size. We give a tight analysis on the rate in which the Neural Tangent KernelJacot et al. [2018], a fundamental tool in the analysis of SGD on networks, converges to its expectations. This results enable us to prove that SGD on depth two neural networks, starting from a (non standard) variant of Xavier initialization Glorot and Bengio [2010] can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime.

Original languageAmerican English
JournalAdvances in Neural Information Processing Systems
Volume2020-December
StatePublished - 2020
Event34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
Duration: 6 Dec 202012 Dec 2020

Bibliographical note

Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.

Fingerprint

Dive into the research topics of 'Neural networks learning and memorization with (almost) no over-parameterization'. Together they form a unique fingerprint.

Cite this