Skip to Main content Skip to Navigation
Theses

Learning from positive and unlabeled examples in biology

Abstract : Biology is a field where an enormous amount of knowledge remains to be discovered. There are many problems for which traditional laboratory techniques are overwhelmed. Whether they are time consuming, expensive, error-prone or low throughput, they struggle to bring answers to these many questions that are left unanswered. In parallel, biotechnologies have evolved these past decades giving rise to mass production of biological data. High-throughput experiments now allow to characterize a cell at the genome-scale, raising great expectations as for the understanding of complex biological phenomenons. The combination of these two facts has induced a growing need for mathematicians and statisticians to enter the field of biology. Not only are bioinformaticians required to analyze efficiently the tons of data coming from high-throughput experiments in order to extract reliable information but their work also consists in building models for biological systems that result into useful predictions. Examples of problems for which a such expertise is needed encompass among others regulatory network inference and disease gene identification. Regulatory network inference is the elucidation of transcriptional regulation interactions between regulator genes called transcription factors and their gene targets. On the other hand, disease gene identification is the process of finding genes whose disruption triggers some genetically inherited disease. In both cases, since biologists are confronted with thousands of genes to investigate, the challenge is to output a prioritized list of interactions or genes believed to be good candidates for further experimental study. The two problems mentioned above share a common feature: they are both prioritization problems for which positive examples exists in small amounts (confirmed interactions or identified disease genes) but no negative data is available. Indeed, biological databases seldom report non-interactions and it is difficult not to say impossible to determine that a gene is not involved in the developing process of a disease. On the other hand, there are plenty of so-called unlabeled examples like for instance genes for which we do not know whether they are interacting with a regulator gene or whether they are related to a disease. The problem of learning from positive and unlabeled examples, also called PU learning, has been studied in itself in the field of machine learning. The subject of this thesis is the study of PU learning methods and their application to biological problems. In the first chapter we introduce the bagging SVM, a new algorithm for PU learning, and we assess its performance and properties on a benchmark dataset. The main idea of the algorithm is to exploit by means of a bagging-like procedure, an intrinsic feature of a PU learning problem, which is that the unlabeled set is contaminated with hidden positive examples. Our bagging SVM achieves comparable performance to the state-of-the-art method while showing good properties in terms of speed and scalability to the number of examples. The second chapter is dedicated to SIRENE, a new method for supervised inference of regulatory network. SIRENE is a conceptually simple algorithm which compares favorably to existing methods for network inference. Finally, the third chapter deals with the problem of disease gene identification. We propose ProDiGe, an algorithm for Prioritization Of Disease Genes with PU learning, which is derived from the bagging SVM. The algorithm is tailored for genome-wide gene search and allows to integrate several data sources. We illustrate its ability to correctly retrieve human disease genes on a real dataset.
Complete list of metadatas

https://pastel.archives-ouvertes.fr/pastel-00566401
Contributor : Bibliothèque Mines Paristech <>
Submitted on : Wednesday, February 16, 2011 - 10:16:24 AM
Last modification on : Monday, October 19, 2020 - 10:55:27 AM
Long-term archiving on: : Tuesday, May 17, 2011 - 2:48:46 AM

Identifiers

  • HAL Id : pastel-00566401, version 1

Citation

Fantine Mordelet. Learning from positive and unlabeled examples in biology. Bioinformatics [q-bio.QM]. École Nationale Supérieure des Mines de Paris, 2010. English. ⟨NNT : 2010ENMP0058⟩. ⟨pastel-00566401⟩

Share

Metrics

Record views

809

Files downloads

1060