Efficient algorithms for accurate hierarchical clustering of huge datasets: Tackling the entire protein space

Yaniv Loewenstein*, Elon Portugaly, Menachem Fromer, Michal Linial

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

105 Scopus citations

Abstract

Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families.

Original languageEnglish
Pages (from-to)i41-i49
JournalBioinformatics
Volume24
Issue number13
DOIs
StatePublished - Jul 2008

Bibliographical note

Funding Information:
Y.L., E.P and M.F. are members of the SCCB, the Sudarsky Center for Computational Biology. The work is supported by the BioSapiens NoE (EU Fr6).

Funding Information:
Funding: Y.L., E.P and M.F. are members of the SCCB, the Sudarsky Center for Computational Biology. The work is supported by the BioSapiens NoE (EU Fr6).

Fingerprint

Dive into the research topics of 'Efficient algorithms for accurate hierarchical clustering of huge datasets: Tackling the entire protein space'. Together they form a unique fingerprint.

Cite this