Finding optimal markers for microorganisms important in the medical, agricultural, environmental or ecological fields is of great importance. Thousands of complete microbial genomes now available allow us, for the first time, to exhaustively identify marker proteins for groups of microbial organisms. In this work, we model the biological task as the well-known mathematical “hitting set” problem, solving it based on both greedy and randomized approximation algorithms. We identify unique markers for 17 phenotypic and taxonomic microbial groups, including proteins related to the nitrite reductase enzyme as markers for the non-anammox nitrifying bacteria group, and two transcription regulation proteins, nusG and yhiF, as markers for the Archaea and Escherichia/Shigella taxonomic groups, respectively. Additionally, we identify marker proteins for three subtypes of pathogenic E. coli, which previously had no known optimal markers. Practically, depending on the completeness of the database this algorithm can be used for identification of marker genes for any microbial group, these marker genes may be prime candidates for the understanding of the genetic basis of the group’s phenotype or to help discover novel functions which are uniquely shared among a group of microbes. We show that our method is both theoretically and practically efficient, while establishing an upper bound on its time complexity and approximation ratio; thus, it promises to remain efficient and permit the identification of marker proteins that are specific to phenotypic or taxonomic groups, even as more and more bacterial genomes are being sequenced.
Bibliographical notePublisher Copyright:
© 2018 Segev et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.