Monte Carlo estiation of the number of possible protein folds: Effects of sampling bias and folds distributions

Hadas Leonov, Joseph S.B. Mitchell, Isaiah T. Arkin*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

15 Scopus citations


The estimation of the number of protein folds in nature is a matter of considerable interest. In this study, a Monte Carlo method employing the broken stick model is used to assign a given number of proteins into a given number of folds. Subsequently, random, integer, non-repeating numbers are generated in order to simulate the process of fold discovery. With this conceptual framework at hand, the effects of two factors upon the fold identification process were investigated: (1) the nature of folds distributions and (2) preferential sampling bias of previously identified folds. Depending on the type of distribution, dividing 100,000 proteins into 1,000 folds resulted in 10-30% of the folds having 10 proteins or less per fold, approximately 10% of the folds having 10-20 proteins per fold, 31-45% having 20-100 proteins per fold, and >30% of the folds having more than 100 proteins per fold. After randomly sampling one tenth of the proteins, 68-96% of the folds were identified. These percentages depend both on folds distribution and biased/non-biased sampling. Only upon increasing the sampling bias for previously identified folds to 1,000, did the model result in a reduction of the number of proteins identified by an order of magnitude (approximately 9%). Thus, assuming the structures of one tenth of the population of proteins in nature have been solved, the results of the Monte Carlo simulation are more consistent with recent lower estimates of the number of folds, ≤1,000. Any deviation from this estimate would reflect significant bias in the experimental sampling of protein structure, and/or substantially nonuniform folds distribution, manifested in a large number of single-fold proteins.

Original languageAmerican English
Pages (from-to)352-359
Number of pages8
JournalProteins: Structure, Function and Genetics
Issue number3
StatePublished - 15 May 2003


  • Monte Carlo
  • Protein folds
  • Proteomics


Dive into the research topics of 'Monte Carlo estiation of the number of possible protein folds: Effects of sampling bias and folds distributions'. Together they form a unique fingerprint.

Cite this