Post-transcriptional regulation in multicellular organisms is mediated by microRNAs. However, the principles that determine if a gene is regulated by miRNAs are poorly understood. Previous works focused mostly on miRNA seed matches and other features of the 3′-UTR of transcripts. These common approaches rely on knowledge of the miRNA families, and computational approaches still yield poor, inconsistent results, with many false positives. In this work, we present a different paradigm for predicting miRNA-regulated genes based on the encoded proteins. In a novel, automated machine learning framework, we use sequence as well as diverse functional annotations to train models on multiple organisms using experimentally validated data. We present insights from tens of millions of features extracted and ranked from different modalities. We show high predictive performance per organism and in generalization across species. We provide a list of novel predictions including Danio rerio (zebrafish) and Arabidopsis thaliana (mouse-ear cress). We compare genomic models and observe that our protein model outperforms, whereas a unified model improves on both. While most membranous and disease related proteins are regulated by miRNAs, the G-protein coupled receptor (GPCR) family is an exception, being mostly unregulated by miRNAs. We further show that the evolutionary conservation among paralogs does not imply any coherence in miRNA regulation. We conclude that duplicated paralogous genes that often changed their function, also diverse in their tendency to be miRNA regulated. We conclude that protein function is informative across species in predicting post-transcriptional miRNA regulation in living cells.
Bibliographical notePublisher Copyright:
Copyright © 2022 Ofer and Linial.
- AI classifier model
- Automated machine learning
- miRNA target interactions
- post transcriptional regulation
- protein family