Structural extraction from visual layout of documents

Binyamin Rosenfeld*, Ronen Feldman, Yonatan Aumann

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

14 Scopus citations

Abstract

Most information extraction systems focus on the textual content of the documents. They treat documents as sequences of words, disregarding the physical and typographical layout of the information. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the information. This information is lost with pure-text analysis. In this paper we describe a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. Our structural extraction procedure is a learning algorithm, which automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. We also describe a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract fields such as Author(s), Title, Date, etc. with very high accuracy.

Original languageAmerican English
Pages203-210
Number of pages8
DOIs
StatePublished - 2002
Externally publishedYes
EventProceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM 2002) - McLean, VA, United States
Duration: 4 Nov 20029 Nov 2002

Conference

ConferenceProceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM 2002)
Country/TerritoryUnited States
CityMcLean, VA
Period4/11/029/11/02

Fingerprint

Dive into the research topics of 'Structural extraction from visual layout of documents'. Together they form a unique fingerprint.

Cite this