Inferring the 3D architecture of the genome

Abstract : The structure of DNA, chromosomes and genome organization is a topic that has fascinated the field of biology for many years. Most research focused on the one-dimensional structure of the genome, studying the linear organizations of genes and genomes and their link with gene expression and regulation, splicing, DNA methylation… Yet, spatial and temporal three-dimensional genome architecture is also thought to play an important role in many genomic functions. Chromosome conformation capture (3C) based methods, coupled with next generation sequencing (NGS), allow the measurement, in a single experiment, of genome wide physical interactions between pairs of loci, thus enabling to unravel the secrets behind 3D organization of genomes. These new technologies have paved the way towards a systematic and genome wide analysis of how DNA folds into the nucleus and opened new avenues to understanding many biological processes, such as gene regulation, DNA replication and repair, somatic copy number alterations and epigenetic changes. Yet, 3C technologies, as any new biotechnology, now poses important computational and theoretical challenges for which mathematically well grounded methods need to be developped. The first chapter is dedicated to developping a robust and accurate method to infer a 3D model of the genome from Hi-C data. Previous methods often formulated the inference as an optimization problem akin to multidimensional scaling (MDS) based on an ad hoc conversion of contact counts into euclidean wish distances. Chromosomes are modeled with a beads-on-a-string model, and the methods attempt to place the beads in a 3D euclidean space to fullfill a number of, often non convex, constraints and such that the pairwise distances between beads are as close as possible to the corresponding wish distances. These approaches rely on dubious hypotheses to convert contact counts into wish distances, challenging the accuracy of the final 3D model. Another limitation is the MDS formulation which is only intuitively motivated, and not grounded on a clear statistical model. To alleviate these problems, our method models contact counts as a Poisson distribution where the intensity is a decreasing function of the spatial distance between elements interacting. We then formulate the 3D structure inference as a maximum likelihood problem. We demonstrate that our method infers robust and stable models across resolutions and datasets. The second chapter focuses on the genome architecture of the P. falciparum, a small parasite responsible for the deadliest and most virulent form of human malaria. This project was biologically driven and aimed at understanding whether and how the 3D structure of the genome related to gene expression and regulation at different time points in the complex life cycle of the parasite. In collaboration with the Le Roch lab and the Noble lab, we built 3D models of the genome at three time points which resulted in a complex genome architecture indicative of a strong association between the spatial genome and gene expression. The last chapter tackles a very different question, also based on 3C-based data. Initially developped to probe the 3D architecture of the chromosomes, Hi-C and related techniques have recently been re-purposed for diverse applications: de novo genome assembly, deconvolution of metagenomic samples and genome annotations. We describe in this chapter a novel method, Centurion, that jointly infers the locations of all centromeres in a single yeast genome from Hi-C data, using the centromeres' tendency to strongly colocalize in the nucleus. Indeed, centromeres are essential for proper chromosome segregation, yet, despite extensive research, centromere locations are unknown for many yeast species. We demonstrate the robustness of our approach on datasets with low and high coverage on well annotated organisms. We then predict centromere coordinates for 6 yeast species that currently lack those annotations.
Document type :
Theses
Complete list of metadatas

https://pastel.archives-ouvertes.fr/tel-01306953
Contributor : Abes Star <>
Submitted on : Monday, April 25, 2016 - 10:40:07 PM
Last modification on : Tuesday, November 13, 2018 - 10:10:11 AM
Long-term archiving on : Tuesday, July 26, 2016 - 2:01:43 PM

File

2015ENMP0059_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01306953, version 1

Citation

Nelle Varoquaux. Inferring the 3D architecture of the genome. Bioinformatics [q-bio.QM]. Ecole Nationale Supérieure des Mines de Paris, 2015. English. ⟨NNT : 2015ENMP0059⟩. ⟨tel-01306953⟩

Share

Metrics

Record views

2570

Files downloads

420