Prediction by categorical features: Generalization properties and application to feature ranking

Sivan Sabato*, Shai Shalev-Shwartz

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations


We describe and analyze a new approach for feature ranking in the presence of categorical features with a large number of possible values. It is shown that popular ranking criteria, such as the Gini index and the misclassification error, can be interpreted as the training error of a predictor that is deduced from the training set. It is then argued that using the generalization error is a more adequate ranking criterion. We propose a modification of the Gini index criterion, based on a robust estimation of the generalization error of a predictor associated with the Gini index. The properties of this new estimator are analyzed, showing that for most training sets, it produces an accurate estimation of the true generalization error. We then address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties.

Original languageAmerican English
Title of host publicationLearning Theory - 20th Annual Conference on Learning Theory, COLT 2007, Proceedings
PublisherSpringer Verlag
Number of pages15
ISBN (Print)9783540729259
StatePublished - 2007
Event20th Annual Conference on Learning Theory, COLT 2007 - San Diego, CA, United States
Duration: 13 Jun 200715 Jun 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4539 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference20th Annual Conference on Learning Theory, COLT 2007
Country/TerritoryUnited States
CitySan Diego, CA


Dive into the research topics of 'Prediction by categorical features: Generalization properties and application to feature ranking'. Together they form a unique fingerprint.

Cite this