TY - JOUR
T1 - Persistence in high-dimensional linear predictor selection and the virtue of overparametrization
AU - Greenshtein, Eitan
AU - Ritov, Ya'acov
PY - 2004/12
Y1 - 2004/12
N2 - Let Zi = (Yi, Xli, ..., X mi), i = 1, ..., n, be independent and identically distributed random vectors, Zi∼ F, ∈ f. It is desired to predict Y by ∑.βjXj where (β1,... ,βm) ∈ Bn ⊆ ℝm, under a prediction loss. Suppose that m = nα, α > 1, that is, there are many more explanatory variables than observations. We consider sets Bn restricted by the maximal number of non-zero coefficients of their members, or by their l1 radius. We study the following asymptotic question: how 'large' may the set Bn be, so that it is still possible to select empirically a predictor whose risk under F is close to that of the best predictor in the set? Sharp bounds for orders of magnitudes are given under various assumptions on F. Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is 'asymptotically no harm' in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that 'lasso' procedures, that is, optimization under l1 constraints, could be efficient in finding optimal sparse predictors in high dimensions.
AB - Let Zi = (Yi, Xli, ..., X mi), i = 1, ..., n, be independent and identically distributed random vectors, Zi∼ F, ∈ f. It is desired to predict Y by ∑.βjXj where (β1,... ,βm) ∈ Bn ⊆ ℝm, under a prediction loss. Suppose that m = nα, α > 1, that is, there are many more explanatory variables than observations. We consider sets Bn restricted by the maximal number of non-zero coefficients of their members, or by their l1 radius. We study the following asymptotic question: how 'large' may the set Bn be, so that it is still possible to select empirically a predictor whose risk under F is close to that of the best predictor in the set? Sharp bounds for orders of magnitudes are given under various assumptions on F. Algorithmic complexity of the ensuing procedures is also studied. The main message of this paper and the implications of the orders derived are that under various sparsity assumptions on the optimal predictor there is 'asymptotically no harm' in introducing many more explanatory variables than observations. Furthermore, such practice can be beneficial in comparison with a procedure that screens in advance a small subset of explanatory variables. Another main result is that 'lasso' procedures, that is, optimization under l1 constraints, could be efficient in finding optimal sparse predictors in high dimensions.
KW - Consistency
KW - Lasso
KW - Regression
KW - Variable selection
UR - http://www.scopus.com/inward/record.url?scp=31344454903&partnerID=8YFLogxK
U2 - 10.3150/bj/1106314846
DO - 10.3150/bj/1106314846
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:31344454903
SN - 1350-7265
VL - 10
SP - 971
EP - 988
JO - Bernoulli
JF - Bernoulli
IS - 6
ER -