Abstract
This study delves into the sample complexity of two-layer neural networks. For a given reference matrix W0 ∈ RT ×d (typically representing initial training weights) and an O(1)-Lipschitz activation function σ : R → R, we examine the class This bound is optimal, barring logarithmic factors, and depends logarithmically on the width T . This finding improves on Vardi et al. (2022), who established a similar outcome for W0 = 0. Our motivation stems from the real-world observation that trained weights often remain close to their initial counterparts, implying that kWkFrobenius << kW + W0kFrobenius. To arrive at our conclusion, we employed and enhanced a recently new norm-based bounds method, the Approximate Description Length (ADL), as proposed by Daniely and Granot (2019). Finally, our results underline the crucial role of the element-wise nature of σ for achieving a logarithmic width-dependent bound. We prove that there exists an O(1)-Lipschitz (non-element-wise) activation function Ψ: RT → RT where the sample complexity of HWΨ0,B,R,r increases linearly with the width.
Original language | English |
---|---|
Pages (from-to) | 505-517 |
Number of pages | 13 |
Journal | Proceedings of Machine Learning Research |
Volume | 237 |
State | Published - 2024 |
Event | 35th International Conference on Algorithmic Learning Theory, ALT 2024 - La Jolla, United States Duration: 25 Feb 2024 → 28 Feb 2024 |
Bibliographical note
Publisher Copyright:© 2024 A. Daniely & E. Granot.
Keywords
- Approximate Description Length
- Lipschitz Activation Functions
- Sample Complexity