TY - JOUR
T1 - Subquadratic approximation algorithms for clustering problems in high dimensional spaces
AU - Borodin, Allan
AU - Ostrovsky, Rafail
AU - Rabani, Yuval
PY - 1999
Y1 - 1999
N2 - One of the central problems in information retrieval, data mining, computational biology, statistical analysis, computer vision, geographic analysis, pattern recognition, distributed protocols is the question of classification of data according to some clustering rule. Often the data is noisy and even approximate classification is of extreme importance. The difficulty of such classification stems from the fact that usually the data has many incomparable attributes, and often results in the question of clustering problems in high dimensional spaces. Since they require measuring distance between every pair of data points, standard algorithms for computing the exact clustering solutions use quadratic or `nearly quadratic' running time; i.e., O(dn2-α(d)) time where n is the number of data points, d is the dimension of the space and α(d) approaches 0 as d grows. In this paper, we show (for three fairly natural clustering rules) that computing an approximate solution can be done much more efficiently. More specifically, for agglomerative clustering (used, for example, in the Alta VistaTM search engine), for the clustering defined by sparse partitions, and for a clustering based on minimum spanning trees we derive randomized (1+ε) approximation algorithms with running times O(d2n2-γ) where γ>0 depends only on the approximation parameter ε and is independent of the dimension d.
AB - One of the central problems in information retrieval, data mining, computational biology, statistical analysis, computer vision, geographic analysis, pattern recognition, distributed protocols is the question of classification of data according to some clustering rule. Often the data is noisy and even approximate classification is of extreme importance. The difficulty of such classification stems from the fact that usually the data has many incomparable attributes, and often results in the question of clustering problems in high dimensional spaces. Since they require measuring distance between every pair of data points, standard algorithms for computing the exact clustering solutions use quadratic or `nearly quadratic' running time; i.e., O(dn2-α(d)) time where n is the number of data points, d is the dimension of the space and α(d) approaches 0 as d grows. In this paper, we show (for three fairly natural clustering rules) that computing an approximate solution can be done much more efficiently. More specifically, for agglomerative clustering (used, for example, in the Alta VistaTM search engine), for the clustering defined by sparse partitions, and for a clustering based on minimum spanning trees we derive randomized (1+ε) approximation algorithms with running times O(d2n2-γ) where γ>0 depends only on the approximation parameter ε and is independent of the dimension d.
UR - http://www.scopus.com/inward/record.url?scp=0032632361&partnerID=8YFLogxK
U2 - 10.1145/301250.301367
DO - 10.1145/301250.301367
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:0032632361
SN - 0734-9025
SP - 435
EP - 444
JO - Conference Proceedings of the Annual ACM Symposium on Theory of Computing
JF - Conference Proceedings of the Annual ACM Symposium on Theory of Computing
T2 - Proceedings of the 1999 31st Annual ACM Symposium on Theory of Computing - FCRC '99
Y2 - 1 May 1999 through 4 May 1999
ER -