Skip to Main content Skip to Navigation

Feature selection from gene expression data : molecular signatures for breast cancer prognosis and gene regulation network inference

Abstract : Important developments in biotechnologies have moved the paradigm of gene expression analysis from a hypothesis-driven to a data-driven approach. In particular, DNA microarrays make it possible to measure gene expression on a genome-wide scale, leaving its analysis to statisticians.From these high-dimensional data, we contribute, in this thesis, to two biological problems. Both questions are considered from the supervised learning point of view. In particular, we see them as feature selection problems. Feature selection consists in extracting variables - here, genes - that contain relevant and sufficient information to predict the answer to a given question.First, we are concerned with selecting lists of genes, otherwise known as molecular signatures and assumed to contain the necessary amount of information to predict the outcome of breast cancer. It is indeed crucial to be able to estimate the chances for future metastatic events from the primary tumor, in order to evaluate the relevance of having the patient undergo an aggressive adjuvant chemotherapy. In this thesis, we present three contributions to this problem. First, we propose a systematic comparison of feature selection methods in terms of predictive performance, stability and biological interpretability of the solution they output. The second and third contributions focus on applying so-called structured sparsity methods (here graph Lasso and k-overlap norm) to the signature selection problem. In all three studies, we discuss the impact of using so-called Ensemble methods (bootstrap, resampling).Second, we are interested in the gene regulatory network inference problem that consists in determining patterns of interaction between transcription factors and target genes. The formers are proteins that regulate the transcription of target genes in that they can either activate or repress it. These regulations can be represented as a directed graph, where nodes symbolize genes and edges depict their interactions. We introduce a new algorithm named TIGRESS, that granted us the third place at the DREAM5 network inference challenge in 2010. Based on the LARS algorithm and a resampling procedure, TIGRESS considers each target gene independently by inferring its regulators and finally assembles individual predictions to provide an estimate of the entire network.Finally, in the last chapter, we provide a discussion that attempts to place the contributions of this thesis in a broader bibliographical and epistemological context.
Document type :
Complete list of metadata

Cited literature [174 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Friday, April 26, 2013 - 4:22:09 PM
Last modification on : Wednesday, November 17, 2021 - 12:30:52 PM
Long-term archiving on: : Saturday, July 27, 2013 - 4:45:10 AM


Version validated by the jury (STAR)


  • HAL Id : pastel-00818345, version 1


Anne-Claire Haury. Feature selection from gene expression data : molecular signatures for breast cancer prognosis and gene regulation network inference. Other [cs.OH]. Ecole Nationale Supérieure des Mines de Paris, 2012. English. ⟨NNT : 2012ENMP0067⟩. ⟨pastel-00818345⟩



Record views


Files downloads