Efficient algorithms and data structures for indexing DNA sequence data

Abstract : Amounts of data generated by Next Generation Sequencing technologies increase exponentially in recent years. Storing, processing and transferring this data become more and more challenging tasks. To be able to cope with them, data scientists should develop more and more efficient approaches and techniques.In this thesis we present efficient data structures and algorithmic methods for the problems of approximate string matching, genome assembly, read compression and taxonomy based metagenomic classification.Approximate string matching is an extensively studied problem with countless number of published papers, both theoretical and practical. In bioinformatics, read mapping problem can be regarded as approximate string matching. Here we study string matching strategies based on bidirectional indices. We define a framework, called search schemes, to work with search strategies of this type, then provide a probabilistic measure for the efficiency of search schemes, prove several combinatorial properties of efficient search schemes and provide experimental computations supporting the superiority of our strategies.Genome assembly is one of the basic problems of bioinformatics. Here we present Cascading Bloom filter data structure, that improves standard Bloom filter and can be applied to several problems like genome assembly. We provide theoretical and experimental results proving properties of Cascading Bloom filter. We also show how Cascading Bloom filter can be used for solving another important problem of read compression.Another problem studied in this thesis is metagenomic classification. We present a BWT-based approach that improves the BWT-index for quick and memory-efficient k-mer search. We mainly focus on data structures that improve speed and memory usage of classical BWT-index for our application
Document type :
Theses
Complete list of metadatas

Cited literature [166 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/tel-01762479
Contributor : Abes Star <>
Submitted on : Tuesday, April 10, 2018 - 10:41:07 AM
Last modification on : Thursday, May 16, 2019 - 6:12:25 PM

File

TH2017PESC1232.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01762479, version 1

Citation

Kamil Salikhov. Efficient algorithms and data structures for indexing DNA sequence data. Bioinformatics [q-bio.QM]. Université Paris-Est; Université Lomonossov (Moscou), 2017. English. ⟨NNT : 2017PESC1232⟩. ⟨tel-01762479⟩

Share

Metrics

Record views

548

Files downloads

977