Abstract
Motivation: Modern protein sequencing techniques have led to the determination of <50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method's principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4 929 553 ProtoNet tree clusters, BF based on Pfam annotations contain 26 891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to näve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF's clusters. We present the entropy-based method's benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods.
Original language | English |
---|---|
Pages (from-to) | i624-i630 |
Journal | Bioinformatics |
Volume | 30 |
Issue number | 17 |
DOIs | |
State | Published - 1 Sep 2014 |
Bibliographical note
Funding Information:The formulation of the entropy-based method was initiated by Noam Kaplan and Menachem Fromer. The concept of an entropy-based partition was carefully tested by Menachem Fromer, and the performance of the best-front was also determined by him for an early version of ProtoNet that was based on sequences from SwissProt (114K proteins). The current research uses the concept and the working flow for the most recent version of ProtoNet. We thank the ProtoNet team and Solange Karsenty for design and operation of the Web site. We thank Samuel Sayag for developing updates to the platform. We thank Dan Ofer for critical reading and suggestions. This research is partially supported by ERC grant (to N.L). The initial phase of this research was supported by Prospects, the EU FRVII consortium.