Recent studies (Alizadeh et al, ; Bittner et al, ; Golub et al, ) demonstrate the discovery of putative disease subtypes from gene expression data. The underlying computational problem is to partition the set of sample tissues into statistically meaningful classes. In this paper we present a novel approach to class discovery and develop automatic analysis methods. Our approach is based on statistically scoring candidate partitions according to the over-abundance of genes that separate the different classes. Indeed, in biological datasets, an overabundance of genes separating known classes is typically observed. we measure overabundance against a stochastic null model. This allows for highlighting subtle, yet meaningful, partitions that are supported on a small subset of the genes. Using simulated annealing we explore the space of all possible partitions of the set of samples, seeking partitions with statistically significant overabundance of differentially expressed genes. We demonstrate the pe rformance of our methods on synthetic data, where we recover planted partitions. Finally, we turn to tumor expression datasets, and show that we find several highly pronounced partitions.
|Original language||American English|
|Number of pages||8|
|State||Published - 2001|
|Event||5th Annual Internatinal Conference on Computational Biology - Montreal, Que., Canada|
Duration: 22 May 2001 → 26 May 2001
|Conference||5th Annual Internatinal Conference on Computational Biology|
|Period||22/05/01 → 26/05/01|