Abstract
We investigate the space of all protein sequences. We combine the standard measures of similarity (SW, FASTA, BLAST), to associate with each sequence an exhaustive list of neighboring sequences. These lists induce a (weighted directed) graph whose vertices are the sequences. The weight of CUI edge connecting two sequences represents their degree of similarity. This graph encodes much of the fundamental properties of the sequence space. We look for clusters of related proteins in this graph. These clusters correspond to strongly connected sets of vertices. Two main ideas underbe our work: i) Interesting homologies cimong proteins cćin be deduced by transitivity, ii) Transitivity should be applied restrictively in order to prevent unrelated proteins from clustering together. Our ćinalysis starts from a very conservative classification, based on very significant similarities, that has many classes. Subsequently, clćisses are merged to include less significant similćirities. Merging is performed via a novel two phase algorithm. First, the eilgorithm identifies groups of possibly related clusters (based on transitivity and strong connectivity) using loccJ considerations, ana merges them. Then, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and the cleissification is refined accordingly. This process takes place at varying thresholds of statistical significaince, where at each step the algorithm is applied on the classes of the previous classification, to obtsiin the next one, at the more permissive threshold. Consequently, a hierarchical organization of all proteins is obtained. The resulting clcissification splits the space of all protein sequences into well defined groups of proteins. The results show that the automatically induced sets of proteins are closely correlated with natural biological families and super families. The hierarchical organization reveals finer sub-families that make up known families of proteins as well as many interesting relations between protein families. The hierarchicćil orgcinization proposed may be considered as the first map of the space of all protein sequences. An interactive web site including the results of our analysis has been constructed, and is now accessible through http://www.protomap.cs.huji.ac.il.
Original language | English |
---|---|
Title of host publication | Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology, ISMB 1998 |
Publisher | AAAI Press |
Pages | 212-221 |
Number of pages | 10 |
ISBN (Electronic) | 1577350537, 9781577350538 |
State | Published - 1998 |
Event | 6th International Conference on Intelligent Systems for Molecular Biology, ISMB 1998 - Montreal, Canada Duration: 28 Jun 1998 → 1 Jul 1998 |
Publication series
Name | Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology, ISMB 1998 |
---|
Conference
Conference | 6th International Conference on Intelligent Systems for Molecular Biology, ISMB 1998 |
---|---|
Country/Territory | Canada |
City | Montreal |
Period | 28/06/98 → 1/07/98 |
Bibliographical note
Publisher Copyright:© 1998, AAAI (www.aaai.org). All rights reserved.
Keywords
- clustering
- protein classification
- protein families
- sequence alignment
- sequence homology