Cross-dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora

Ido Dagan, Zvika Marx, Eli Shamir

Research output: Contribution to journalConference articlepeer-review

4 Scopus citations

Abstract

We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the information-bottleneck soft clustering method for a suitable setting consisting of several datasets. Experimentation with topical corpora reveals similar aspects of three distinct religions. The evaluation is by way of comparison to clusters constructed manually by an expert.

Original languageEnglish
JournalProceedings of the Annual Meeting of the Association for Computational Linguistics
StatePublished - 2002
Event6th Conference on Natural Language Learning, CoNLL 2002 - Taipei, Taiwan, Province of China
Duration: 24 Aug 20021 Sep 2002

Bibliographical note

Publisher Copyright:
© 2002 Proceedings of the Annual Meeting of the Association for Computational Linguistics. All Rights Reserved.

Fingerprint

Dive into the research topics of 'Cross-dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora'. Together they form a unique fingerprint.

Cite this