A multi-domainWeb-based algorithm for POS tagging of unknown words

Shulamit Umansky-Pesin*, Roi Reichart, Ari Rappoport

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

12 Scopus citations

Abstract

We present a web-based algorithm for the task of POS tagging of unknown words (words appearing only a small number of times in the training data of a supervised POS tagger). When a sentence s containing an unknown word u is to be tagged by a trained POS tagger, our algorithm collects from the web contexts that are partially similar to the context of u in s, which are then used to compute new tag assignment probabilities for u. Our algorithm enables fast multi-domain unknown word tagging, since, unlike previous work, it does not require a corpus from the new domain. We integrate our algorithm into the MXPOST POS tagger (Ratnaparkhi, 1996) and experiment with three languages (English, German and Chinese) in seven in-domain and domain adaptation scenarios. Our algorithm provides an error reduction of up to 15.63% (English), 18.09% (German) and 13.57% (Chinese) over the original tagger.

Original languageEnglish
Pages1274-1282
Number of pages9
StatePublished - 2010
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: 23 Aug 201027 Aug 2010

Conference

Conference23rd International Conference on Computational Linguistics, Coling 2010
Country/TerritoryChina
CityBeijing
Period23/08/1027/08/10

Fingerprint

Dive into the research topics of 'A multi-domainWeb-based algorithm for POS tagging of unknown words'. Together they form a unique fingerprint.

Cite this