Motivation: Variation of substitution rates across nucleotide and amino acid sites has long been recognized as a characteristic of molecular sequence evolution. Evolutionary models that account for this rate heterogeneity usually use a gamma density function to model the rate distribution across sites. This density function, however, may not fit real datasets, especially when there is a multimodal distribution of rates. Here, we present a novel evolutionary model based on a mixture of gamma density functions. This model better describes the among-site rate variation characteristic of molecular sequence evolution. The use of this model may improve the accuracy of various phylogenetic methods, such as reconstructing phylogenetic trees, dating divergence events, inferring ancestral sequences and detecting conserved sites in proteins. Results: Using diverse sets of protein sequences we show that the gamma mixture model better describes the stochastic process underlying protein evolution. We show that the proposed gamma mixture model fits protein datasets significantly better than the single-gamma model in 9 out of 10 datasets tested. We further show that using the gamma mixture model improves the accuracy of model-based prediction of conserved residues in proteins.
Bibliographical noteFunding Information:
We thank D. Burstein, A. Doron-Faigenboim, M. Ninio, E. Privamn, N. Rubinstein and A. Stern for their insightful comments. This work is supported in part by ISF grant number 1208/04 to N.F. and T.P. T.P. was also supported by a grant in Complexity Science from the Yeshaia Horvitz Association.