TY - JOUR
T1 - Unified theoretical framework for wide neural network learning dynamics
AU - Avidan, Yehonatan
AU - Li, Qianyi
AU - Sompolinsky, Haim
N1 - Publisher Copyright:
© 2025 American Physical Society.
PY - 2025/4
Y1 - 2025/4
N2 - Artificial neural networks have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial theoretical advances have been achieved for wide networks, within two disparate theoretical frameworks: the neural tangent kernel (NTK), which assumes linearized gradient descent dynamics, and the Bayesian neural network Gaussian process (NNGP) framework. Here we unify these two theories using gradient descent learning dynamics with an additional small noise in an ensemble of wide deep networks. We construct an exact analytical theory for the network input-output function and introduce a new time-dependent neural dynamical kernel (NDK) from which both NTK and NNGP kernels are derived. We identify two learning phases characterized by different time scales: an initial gradient-driven learning phase, dominated by deterministic minimization of the loss, in which the time scale is mainly governed by the variance of the weight initialization. It is followed by a slow diffusive learning stage, during which the network parameters sample the solution space, with a time constant that is determined by the noise level and the variance of the Bayesian prior. The two variance parameters can strongly affect the performance in the two regimes, particularly in sigmoidal neurons. In contrast to the exponential convergence of the mean predictor in the initial phase, the convergence to the final equilibrium is more complex and may exhibit nonmonotonic behavior. By characterizing the diffusive learning phase, our work sheds light on the phenomenon of representational drift in the brain, explaining how neural activity can exhibit continuous changes in internal representations without degrading performance, either by ongoing weak gradient signals that synchronize the drifts of different synapses or by architectural biases that generate invariant code, i.e., task-relevant information that is robust against the drift process. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for understanding the learning process of deep wide neural networks and for analyzing learning dynamics in biological neural circuits.
AB - Artificial neural networks have revolutionized machine learning in recent years, but a complete theoretical framework for their learning process is still lacking. Substantial theoretical advances have been achieved for wide networks, within two disparate theoretical frameworks: the neural tangent kernel (NTK), which assumes linearized gradient descent dynamics, and the Bayesian neural network Gaussian process (NNGP) framework. Here we unify these two theories using gradient descent learning dynamics with an additional small noise in an ensemble of wide deep networks. We construct an exact analytical theory for the network input-output function and introduce a new time-dependent neural dynamical kernel (NDK) from which both NTK and NNGP kernels are derived. We identify two learning phases characterized by different time scales: an initial gradient-driven learning phase, dominated by deterministic minimization of the loss, in which the time scale is mainly governed by the variance of the weight initialization. It is followed by a slow diffusive learning stage, during which the network parameters sample the solution space, with a time constant that is determined by the noise level and the variance of the Bayesian prior. The two variance parameters can strongly affect the performance in the two regimes, particularly in sigmoidal neurons. In contrast to the exponential convergence of the mean predictor in the initial phase, the convergence to the final equilibrium is more complex and may exhibit nonmonotonic behavior. By characterizing the diffusive learning phase, our work sheds light on the phenomenon of representational drift in the brain, explaining how neural activity can exhibit continuous changes in internal representations without degrading performance, either by ongoing weak gradient signals that synchronize the drifts of different synapses or by architectural biases that generate invariant code, i.e., task-relevant information that is robust against the drift process. This work closes the gap between the NTK and NNGP theories, providing a comprehensive framework for understanding the learning process of deep wide neural networks and for analyzing learning dynamics in biological neural circuits.
UR - http://www.scopus.com/inward/record.url?scp=105004060164&partnerID=8YFLogxK
U2 - 10.1103/physreve.111.045310
DO - 10.1103/physreve.111.045310
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:105004060164
SN - 2470-0045
VL - 111
JO - Physical Review E
JF - Physical Review E
IS - 4
M1 - 045310
ER -