Abstract
Numerous computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Compiling a catalog of cancer genes has profound implications for the understanding and treatment of the disease. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the evolutionary selection of genes by assessing the functional effects of mutations on protein-coding genes using a pre-trained machine-learning model. The framework compares the estimated effects of observed genetic variations against all possible single-nucleotide mutations in the coding human genome. Compared to existing methods, FABRIC makes minimal assumptions about the distribution of random mutations. To demonstrate its wide applicability, we applied FABRIC on both naturally occurring human variants and somatic mutations in cancer. In the context of cancer, ~3 M somatic mutations were extracted from over 10,000 cancerous human samples. Of the entire human proteome, 593 protein-coding genes show statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with contemporary cancer gene catalogs. Notably, the majority of these genes (426) are unlisted in these catalogs, but a substantial fraction of them is supported by literature. In the context of normal human evolution, we analyzed ~5 M common and rare variants from ~60 K individuals, discovering 6,288 significant genes. Over 98% of them are dominated by negative selection, supporting the notion of a strong purifying selection during the evolution of the healthy human population. We present the FABRIC framework as an open-source project with a simple command-line interface.
Original language | English |
---|---|
Title of host publication | Bioinformatics Research and Applications - 16th International Symposium, ISBRA 2020, Proceedings |
Editors | Zhipeng Cai, Ion Mandoiu, Giri Narasimhan, Pavel Skums, Xuan Guo |
Publisher | Springer |
Pages | 119-126 |
Number of pages | 8 |
ISBN (Print) | 9783030578206 |
DOIs | |
State | Published - 2020 |
Event | 16th International Symposium on Bioinformatics Research and Applications, ISBRA 2020 - Moscow, Russian Federation Duration: 1 Dec 2020 → 4 Dec 2020 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 12304 LNBI |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 16th International Symposium on Bioinformatics Research and Applications, ISBRA 2020 |
---|---|
Country/Territory | Russian Federation |
City | Moscow |
Period | 1/12/20 → 4/12/20 |
Bibliographical note
Publisher Copyright:© 2020, Springer Nature Switzerland AG.
Keywords
- Cancer evolution
- Driver genes
- ExAC
- Machine learning
- Positive selection
- Single nucleotide variants
- TCGA