Text mining via information extraction

Ronen Feldman, Yonatan Aumann, Moshe Fresko, Orly Liphstat, Binyamin Rosenfeld, Yonatan Schler

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document. At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest. This paper presents an intermediate approach, one that we call text mining via information extraction, in which knowledge discovery takes place on a more focused collection of events and phrases that are extracted from and label each document. These events plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process. This approach was implemented in the Textoscope system. Textoscope consists of a document retrieval module which converts retrieved documents from their native formats into SGML documents used by Textoscope; an information extraction engine, which is based on a powerful attribute grammar which is augmented by a rich background knowledge; a taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and a set of knowledge-discovery tools for the resulting event-labeled documents. We evaluate our approach on a collection of newswire stories extracted by Textoscope’s own agent. Our results confirm that Text Mining via information extraction serves as an accurate and powerful technique by which to manage knowledge encapsulated in large document collections.

Original languageEnglish
Title of host publicationPrinciples of Data Mining and Knowledge Discovery - 3d European Conference, PKDD 1999, Proceedings
EditorsJan M. Żytkow, Jan Rauch
PublisherSpringer Verlag
Pages165-173
Number of pages9
ISBN (Print)3540664904, 9783540664901
DOIs
StatePublished - 1999
Externally publishedYes
Event3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 1999 - Prague, Czech Republic
Duration: 15 Sep 199918 Sep 1999

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1704
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 1999
Country/TerritoryCzech Republic
CityPrague
Period15/09/9918/09/99

Bibliographical note

Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 1999.

Fingerprint

Dive into the research topics of 'Text mining via information extraction'. Together they form a unique fingerprint.

Cite this