On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation

Amit Daniely, Elad Granot

Research output: Contribution to journalConference articlepeer-review

Abstract

This study delves into the sample complexity of two-layer neural networks. For a given reference matrix W0 ∈ RT ×d (typically representing initial training weights) and an O(1)-Lipschitz activation function σ : R → R, we examine the class This bound is optimal, barring logarithmic factors, and depends logarithmically on the width T . This finding improves on Vardi et al. (2022), who established a similar outcome for W0 = 0. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that kWkFrobenius << kW + W0kFrobenius. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by Daniely and Granot (2019). Finally, our results underline the crucial role of the element-wise nature of σ for achieving a logarithmic width-dependent bound. We prove that there exists an O(1)-Lipschitz (non-element-wise) activation function Ψ: RT → RT where the sample complexity of HWΨ0,B,R,r increases linearly with the width.

Original languageAmerican English
Pages (from-to)505-517
Number of pages13
JournalProceedings of Machine Learning Research
Volume237
StatePublished - 2024
Event35th International Conference on Algorithmic Learning Theory, ALT 2024 - La Jolla, United States
Duration: 25 Feb 202428 Feb 2024

Bibliographical note

Publisher Copyright:
© 2024 A. Daniely & E. Granot.

Keywords

  • Approximate Description Length
  • Lipschitz Activation Functions
  • Sample Complexity

Fingerprint

Dive into the research topics of 'On the Sample Complexity of Two-Layer Networks: Lipschitz Vs. Element-Wise Lipschitz Activation'. Together they form a unique fingerprint.

Cite this