Skip to Main content Skip to Navigation

Learning from genomic data : efficient representations and algorithms.

Abstract : Since the first sequencing of the human genome in the early 2000s, large endeavours have set out to map the genetic variability among individuals, or DNA alterations in cancer cells. They have laid foundations for the emergence of precision medicine, which aims at integrating the genetic specificities of an individual with its conventional medical record to adapt treatment, or prevention strategies.Translating DNA variations and alterations into phenotypic predictions is however a difficult problem. DNA sequencers and microarrays measure more variables than there are samples, which poses statistical issues. The data is also subject to technical biases and noise inherent in these technologies. Finally, the vast and intricate networks of interactions among proteins obscure the impact of DNA variations on the cell behaviour, prompting the need for predictive models that are able to capture a certain degree of complexity. This thesis presents novel methodological contributions to address these challenges. First, we define a novel representation for tumour mutation profiles that exploits prior knowledge on protein-protein interaction networks. For certain cancers, this representation allows improving survival predictions from mutation data as well as stratifying patients into meaningful subgroups. Second, we present a new learning framework to jointly handle data normalisation with the estimation of a linear model. Our experiments show that it improves prediction performances compared to handling these tasks sequentially. Finally, we propose a new algorithm to scale up sparse linear models estimation with two-way interactions. The obtained speed-up makes this estimation possible and efficient for datasets with hundreds of thousands of main effects, thereby extending the scope of such models to the data from genome-wide association studies.
Document type :
Complete list of metadata

Cited literature [213 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, May 29, 2019 - 5:44:23 PM
Last modification on : Tuesday, January 11, 2022 - 11:06:02 AM


Version validated by the jury (STAR)


  • HAL Id : tel-02144038, version 1


Marine Le Morvan. Learning from genomic data : efficient representations and algorithms.. Bioinformatics [q-bio.QM]. Université Paris sciences et lettres, 2018. English. ⟨NNT : 2018PSLEM041⟩. ⟨tel-02144038⟩



Record views


Files downloads