Skip to Main content Skip to Navigation
Theses

Efficient and scalable aggregation for large-scale data-intensive applications

Duy-Hung Phan 1 
Abstract : Traditional databases are facing problems of scalability and efficiency dealing with a vast amount of big-data. Thus, modern data management systems that scale to thousands of nodes, like Apache Hadoop and Spark, have emerged and become the de-facto platforms to process data at massive scales. In such systems, many data processing optimizations that were well studied in the database domain have now become futile because of the novel architectures and programming models. In this context, this dissertation pledged to optimize one of the most predominant operations in data processing: data aggregation for such systems.Our main contributions were the logical and physical optimizations for large-scale data aggregation, including several algorithms and techniques. These optimizations are so intimately related that without one or the other, the data aggregation optimization problem would not be solved entirely. Moreover, we integrated these optimizations in our multi-query optimization engine, which is totally transparent to users. The engine, the logical and physical optimizations proposed in this dissertation formed a complete package that is runnable and ready to answer data aggregation queries at massive scales. We evaluated our optimizations both theoretically and experimentally. The theoretical analyses showed that our algorithms and techniques are much more scalable and efficient than prior works. The experimental results using a real cluster with synthetic and real datasets confirmed our analyses, showed a significant performance boost and revealed various angles about our works. Last but not least, our works are published as open sources for public usages and studies.
Document type :
Theses
Complete list of metadata

https://pastel.archives-ouvertes.fr/tel-03752345
Contributor : ABES STAR :  Contact
Submitted on : Tuesday, August 16, 2022 - 3:58:17 PM
Last modification on : Wednesday, August 17, 2022 - 9:06:16 AM

File

thesisPhan.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-03752345, version 1

Collections

Citation

Duy-Hung Phan. Efficient and scalable aggregation for large-scale data-intensive applications. Databases [cs.DB]. Télécom ParisTech, 2016. English. ⟨NNT : 2016ENST0043⟩. ⟨tel-03752345⟩

Share

Metrics

Record views

58

Files downloads

3