TY - GEN
T1 - Sample selection for statistical parsers
T2 - 13th Conference on Computational Natural Language Learning, CoNLL 2009
AU - Reichart, Roi
AU - Rappoport, Ari
PY - 2009
Y1 - 2009
N2 - Creating large amounts of manually annotated training data for statistical parsers imposes heavy cognitive load on the human annotator and is thus costly and error prone. It is hence of high importance to decrease the human efforts involved in creating training data without harming parser performance. For constituency parsers, these efforts are traditionally evaluated using the total number of constituents (TC) measure, assuming uniform cost for each annotated item. In this paper, we introduce novel measures that quantify aspects of the cognitive efforts of the human annotator that are not reflected by the TC measure, and show that they are well established in the psycholinguistic literature. We present a novel parameter based sample selection approach for creating good samples in terms of these measures. We describe methods for global optimisation of lexical parameters of the sample based on a novel optimisation problem, the constrained multiset multicover problem, and for cluster-based sampling according to syntactic parameters. Our methods outperform previously suggested methods in terms of the new measures, while maintaining similar TC performance.
AB - Creating large amounts of manually annotated training data for statistical parsers imposes heavy cognitive load on the human annotator and is thus costly and error prone. It is hence of high importance to decrease the human efforts involved in creating training data without harming parser performance. For constituency parsers, these efforts are traditionally evaluated using the total number of constituents (TC) measure, assuming uniform cost for each annotated item. In this paper, we introduce novel measures that quantify aspects of the cognitive efforts of the human annotator that are not reflected by the TC measure, and show that they are well established in the psycholinguistic literature. We present a novel parameter based sample selection approach for creating good samples in terms of these measures. We describe methods for global optimisation of lexical parameters of the sample based on a novel optimisation problem, the constrained multiset multicover problem, and for cluster-based sampling according to syntactic parameters. Our methods outperform previously suggested methods in terms of the new measures, while maintaining similar TC performance.
UR - http://www.scopus.com/inward/record.url?scp=84862278581&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84862278581
SN - 1932432299
SN - 9781932432299
T3 - CoNLL 2009 - Proceedings of the Thirteenth Conference on Computational Natural Language Learning
SP - 3
EP - 11
BT - CoNLL 2009 - Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Y2 - 4 June 2009 through 5 June 2009
ER -