Theory of curriculum learning, with convex loss functions

Research output: Contribution to journalArticlepeer-review

17 Scopus citations


Curriculum Learning is motivated by human cognition, where teaching often involves gradually exposing the learner to examples in a meaningful order, from easy to hard. Although methods based on this concept have been empirically shown to improve performance of several machine learning algorithms, no theoretical analysis has been provided even for simple cases. To address this shortfall, we start by formulating an ideal definition of difficulty score - the loss of the optimal hypothesis at a given datapoint. We analyze the possible contribution of curriculum learning based on this score in two convex problems - linear regression, and binary classification by hinge loss minimization. We show that in both cases, the convergence rate of SGD optimization decreases monotonically with the difficulty score, in accordance with earlier empirical results. We also prove that when the difficulty score is fixed, the convergence rate of SGD optimization is monotonically increasing with respect to the loss of the current hypothesis at each point. We discuss how these results settle some confusion in the literature where two apparently opposing heuristics are reported to improve performance: curriculum learning in which easier points are given priority, vs hard data mining where the more difficult points are sought out.

Original languageAmerican English
Article number222
Number of pages19
JournalJournal of Machine Learning Research
StatePublished - Nov 2020

Bibliographical note

Funding Information:
This work was supported in part by a grant from the Israeli Science Foundation (ISF) and by the Gatsby Charitable Foundations.

Publisher Copyright:
© 2020 Weinshall Daphna & Amir Dan. License: CC-BY 4.0, see Attribution requirements are provided at


  • Curriculum learning
  • Hinge loss minimization
  • Linear regression


Dive into the research topics of 'Theory of curriculum learning, with convex loss functions'. Together they form a unique fingerprint.

Cite this