TY - JOUR
T1 - Estimating the probability for a protein to have a new fold
T2 - A statistical computational model
AU - Portugaly, Elon
AU - Linial, Michal
PY - 2000/5/9
Y1 - 2000/5/9
N2 - Structural genomics aims to solve a large number of protein structures that represent the protein space. Currently an exhaustive solution for all structures seems prohibitively expensive, so the challenge is to define a relatively small set of proteins with new, currently unknown folds. This paper presents a method that assigns each protein with a probability of having an unsolved fold. The method makes extensive use of PROTOMAP, a sequence-based classification, and SCOP, a structure-based classification. According to PROTOMAP, the protein space encodes the relationship among proteins as a graph whose vertices correspond to 13,354 clusters of proteins. A representative fold for a cluster with at least one solved protein is determined after superposition of all SCOP (release 1.37) folds onto PROTOMAP clusters. Distances within the PROTOMAP graph are computed from each representative fold to the neighboring folds. The distribution of these distances is used to create a statistical model for distances among those folds that are already known and those that have yet to be discovered. The distribution of distances for solved/unsolved proteins is significantly different. This difference makes it possible to use Bayes' rule to derive a statistical estimate that any protein has a yet undetermined fold. Proteins that score the highest probability to represent a new fold constitute the target list for structural determination. Our predicted probabilities for unsolved proteins correlate very well with the proportion of new folds among recently solved structures (new SCOP 1.39 records) that are disjoint from our original training set.
AB - Structural genomics aims to solve a large number of protein structures that represent the protein space. Currently an exhaustive solution for all structures seems prohibitively expensive, so the challenge is to define a relatively small set of proteins with new, currently unknown folds. This paper presents a method that assigns each protein with a probability of having an unsolved fold. The method makes extensive use of PROTOMAP, a sequence-based classification, and SCOP, a structure-based classification. According to PROTOMAP, the protein space encodes the relationship among proteins as a graph whose vertices correspond to 13,354 clusters of proteins. A representative fold for a cluster with at least one solved protein is determined after superposition of all SCOP (release 1.37) folds onto PROTOMAP clusters. Distances within the PROTOMAP graph are computed from each representative fold to the neighboring folds. The distribution of these distances is used to create a statistical model for distances among those folds that are already known and those that have yet to be discovered. The distribution of distances for solved/unsolved proteins is significantly different. This difference makes it possible to use Bayes' rule to derive a statistical estimate that any protein has a yet undetermined fold. Proteins that score the highest probability to represent a new fold constitute the target list for structural determination. Our predicted probabilities for unsolved proteins correlate very well with the proportion of new folds among recently solved structures (new SCOP 1.39 records) that are disjoint from our original training set.
UR - http://www.scopus.com/inward/record.url?scp=0034625049&partnerID=8YFLogxK
U2 - 10.1073/pnas.090559497
DO - 10.1073/pnas.090559497
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 10792051
AN - SCOPUS:0034625049
SN - 0027-8424
VL - 97
SP - 5161
EP - 5166
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
IS - 10
ER -