Learning from genomic data : efficient representations and algorithms.

Abstract : Since the first sequencing of the human genome in the early 2000s, large endeavours have set out to map the genetic variability among individuals, or DNA alterations in cancer cells. They have laid foundations for the emergence of precision medicine, which aims at integrating the genetic specificities of an individual with its conventional medical record to adapt treatment, or prevention strategies.Translating DNA variations and alterations into phenotypic predictions is however a difficult problem. DNA sequencers and microarrays measure more variables than there are samples, which poses statistical issues. The data is also subject to technical biases and noise inherent in these technologies. Finally, the vast and intricate networks of interactions among proteins obscure the impact of DNA variations on the cell behaviour, prompting the need for predictive models that are able to capture a certain degree of complexity. This thesis presents novel methodological contributions to address these challenges. First, we define a novel representation for tumour mutation profiles that exploits prior knowledge on protein-protein interaction networks. For certain cancers, this representation allows improving survival predictions from mutation data as well as stratifying patients into meaningful subgroups. Second, we present a new learning framework to jointly handle data normalisation with the estimation of a linear model. Our experiments show that it improves prediction performances compared to handling these tasks sequentially. Finally, we propose a new algorithm to scale up sparse linear models estimation with two-way interactions. The obtained speed-up makes this estimation possible and efficient for datasets with hundreds of thousands of main effects, thereby extending the scope of such models to the data from genome-wide association studies.
Document type :
Theses
Complete list of metadatas

Cited literature [201 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-02144038
Contributor : Abes Star <>
Submitted on : Wednesday, May 29, 2019 - 5:44:23 PM
Last modification on : Friday, May 31, 2019 - 1:21:34 AM

File

2018PSLEM041_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02144038, version 1

Citation

Marine Le Morvan. Learning from genomic data : efficient representations and algorithms.. Bioinformatics [q-bio.QM]. PSL Research University, 2018. English. ⟨NNT : 2018PSLEM041⟩. ⟨tel-02144038⟩

Share

Metrics

Record views

188

Files downloads

61