Global self-organization of all known protein sequences reveals inherent biological signatures

Michal Linial, Nathan Linial*, Naftali Tishby, Golan Yona

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

50 Scopus citations

Abstract

A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acid residues and a dynamic programming distance is calculated between each pair of segments. This space of segments is initially embedded into Euclidean space. The algorithm that we apply embeds every finite metric space into Euclidean space so that (1) the dimension of the host space is small, (2) the metric distortion is small. A novel self-organized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of our clustering by randomly splitting the data into two parts and performing an hierarchical clustering algorithm independently on each part. At every level of the hierarchy we cross-validate the clusters in one part with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural data about proteins. Some of the known families clustered into well distinct clusters. Motifs and domains such as the zinc finger, EF hand, homeobox, EGF-like and others are automatically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering leads to a novel representation of protein families, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporter family. Finally, we introduce a new concise representation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the database. The self-organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.

Original languageEnglish
Pages (from-to)539-556
Number of pages18
JournalJournal of Molecular Biology
Volume268
Issue number2
DOIs
StatePublished - 2 May 1997

Keywords

  • Clustering
  • Database searching
  • Protein classification
  • Protein families
  • Sequence alignment

Fingerprint

Dive into the research topics of 'Global self-organization of all known protein sequences reveals inherent biological signatures'. Together they form a unique fingerprint.

Cite this