Skip to Main content Skip to Navigation

Novel computational techniques for mapping and classification of Next-Generation Sequencing data

Abstract : Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Obtaining instantly an extensive amount of short or long reads from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which still creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for the problems of read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide OCOCO, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, OCOCO can be employed as an online SNP caller in various analysis pipelines, enabling calling SNPs from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major problem studied in the thesis. Having a database of thousands reference genomes placed on a taxonomic tree, the task is to rapidly assign to tree nodes a huge amount of NGS reads, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
Document type :
Complete list of metadata
Contributor : Abes Star :  Contact Connect in order to contact the contributor
Submitted on : Tuesday, March 20, 2018 - 4:36:09 PM
Last modification on : Tuesday, October 19, 2021 - 11:26:22 AM
Long-term archiving on: : Tuesday, September 11, 2018 - 8:24:29 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01484198, version 3


Karel Brinda. Novel computational techniques for mapping and classification of Next-Generation Sequencing data. Information Theory [cs.IT]. Université Paris-Est, 2016. English. ⟨NNT : 2016PESC1027⟩. ⟨tel-01484198v3⟩



Record views


Files downloads