ProFET: Feature engineering captures high-level protein functions

Dan Ofer, Michal Linial*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

70 Scopus citations

Abstract

Motivation: The amount of sequenced genomes and proteins is growing at an unprecedented pace. Unfortunately, manual curation and functional knowledge lag behind. Homologous inference often fails at labeling proteins with diverse functions and broad classes. Thus, identifying high-level protein functionality remains challenging. We hypothesize that a universal feature engineering approach can yield classification of high-level functions and unified properties when combined with machine learning approaches, without requiring external databases or alignment. Results: In this study, we present a novel bioinformatics toolkit called ProFET (Protein Feature Engineering Toolkit). ProFET extracts hundreds of features covering the elementary biophysical and sequence derived attributes. Most features capture statistically informative patterns. In addition, different representations of sequences and the amino acids alphabet provide a compact, compressed set of features. The results from ProFET were incorporated in data analysis pipelines, implemented in python and adapted for multi-genome scale analysis. ProFET was applied on 17 established and novel protein benchmark datasets involving classification for a variety of binary and multi-class tasks. The results show state of the art performance. The extracted features' show excellent biological interpretability. The success of ProFET applies to a wide range of high-level functions such as subcellular localization, structural classes and proteins with unique functional properties (e.g. neuropeptide precursors, thermophilic and nucleic acid binding). ProFET allows easy, universal discovery of new target proteins, as well as understanding the features underlying different high-level protein functions. Availability and implementation: ProFET source code and the datasets used are freely available at https://github.com/ddofer/ProFET.

Original languageAmerican English
Pages (from-to)3429-3436
Number of pages8
JournalBioinformatics
Volume31
Issue number21
DOIs
StatePublished - 18 Dec 2015

Bibliographical note

Publisher Copyright:
© The Author 2015. Published by Oxford University Press. All rights reserved.

Fingerprint

Dive into the research topics of 'ProFET: Feature engineering captures high-level protein functions'. Together they form a unique fingerprint.

Cite this