TY - GEN
T1 - Hybrid semantic tagging for information extraction
AU - Feldman, Ronen
AU - Rosenfeld, Benjamin
AU - Fresko, Moshe
AU - Davison, Brian D.
PY - 2005
Y1 - 2005
N2 - The semantic web is expected to have an impact at least as big as that of the existing HTML based web, if not greater. However, the challenge lays in creating this semantic web and in converting existing web information into the semantic paradigm. One of the core technologies that can help in migration process is automatic markup, the semantic markup of content, providing the semantic tags to describe the raw content. This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. We also demonstrate the robustness of our system under conditions of poor training data quality. This makes the system very suitable for converting legacy web pages to semantic web pages.
AB - The semantic web is expected to have an impact at least as big as that of the existing HTML based web, if not greater. However, the challenge lays in creating this semantic web and in converting existing web information into the semantic paradigm. One of the core technologies that can help in migration process is automatic markup, the semantic markup of content, providing the semantic tags to describe the raw content. This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. We also demonstrate the robustness of our system under conditions of poor training data quality. This makes the system very suitable for converting legacy web pages to semantic web pages.
KW - HMM
KW - Information extraction
KW - Rules based systems
KW - Semantic web
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=77953048400&partnerID=8YFLogxK
U2 - 10.1145/1062745.1062849
DO - 10.1145/1062745.1062849
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:77953048400
SN - 1595930515
SN - 9781595930514
T3 - 14th International World Wide Web Conference, WWW2005
SP - 1022
EP - 1023
BT - 14th International World Wide Web Conference, WWW2005
T2 - 14th International World Wide Web Conference, WWW2005
Y2 - 10 May 2005 through 14 May 2005
ER -