TY - JOUR
T1 - Unsupervised document classification using sequential information maximization
AU - Slonim, Noam
AU - Friedman, Nir
AU - Tishby, Naftali
PY - 2002
Y1 - 2002
N2 - We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.
AB - We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly.
UR - http://www.scopus.com/inward/record.url?scp=0036993190&partnerID=8YFLogxK
U2 - 10.1145/564376.564401
DO - 10.1145/564376.564401
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:0036993190
SN - 0163-5840
SP - 129
EP - 136
JO - SIGIR Forum (ACM Special Interest Group on Information Retrieval)
JF - SIGIR Forum (ACM Special Interest Group on Information Retrieval)
T2 - Proceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Y2 - 11 August 2002 through 15 August 2002
ER -