Modélisation de performance des caches basée sur l'analyse de données

Luis Felipe Olmos Marchant

Résumé

The need to distribute massive quantities of multimedia content to multiple users has increased tremendously in the last decade. The current solution to this ever-growing demand are Content Delivery Networks, an application layer architecture that handle nowadays the majority of multimedia traffic. This distribution problem has also motivated the study of new solutions such as the Information Centric Networking paradigm, whose aim is to add content delivery capabilities to the network layer by decoupling data from its location. In both architectures, cache servers play a key role, allowing efficient use of network resources for content delivery. As a consequence, the study of cache performance evaluation techniques has found a new momentum in recent years.In this dissertation, we propose a framework for the performance modeling of a cache ruled by the Least Recently Used (LRU) discipline. Our framework is data-driven since, in addition to the usual mathematical analysis, we address two additional data-related problems: The first is to propose a model that a priori is both simple and representative of the essential features of the measured traffic; the second, is the estimation of the model parameters starting from traffic traces. The contributions of this thesis concerns each of the above tasks.In particular, for our first contribution, we propose a parsimonious traffic model featuring a document catalog evolving in time. We achieve this by allowing each document to be available for a limited (random) period of time. To make a sensible proposal, we apply the "semi-experimental" method to real data. These "semi-experiments" consist in two phases: first, we randomize the traffic trace to break specific dependence structures in the request sequence; secondly, we perform a simulation of an LRU cache with the randomized request sequence as input. For candidate model, we refute an independence hypothesis if the resulting hit probability curve differs significantly from the one obtained from original trace. With the insights obtained, we propose a traffic model based on the so-called Poisson cluster point processes.Our second contribution is a theoretical estimation of the cache hit probability for a generalization of the latter model. For this objective, we use the Palm distribution of the model to set up a probability space whereby a document can be singled out for the analysis. In this setting, we then obtain an integral formula for the average number of misses. Finally, by means of a scaling of system parameters, we obtain for the latter expression an asymptotic expansion for large cache size. This expansion quantifies the error of a widely used heuristic in literature known as the "Che approximation", thus justifying and extending it in the process.Our last contribution concerns the estimation of the model parameters. We tackle this problem for the simpler and widely used Independent Reference Model. By considering its parameter (a popularity distribution) to be a random sample, we implement a Maximum Likelihood method to estimate it. This method allows us to seamlessly handle the censor phenomena occurring in traces. By measuring the cache performance obtained with the resulting model, we show that this method provides a more representative model of data than typical ad-hoc methodologies.

L’Internet d’aujourd’hui a une charge de trafic de plus en plus forte à cause de la prolifération des sites de vidéo, notamment YouTube. Les serveurs Cache jouent un rôle clé pour faire face à cette demande qui croît vertigineusement. Ces serveurs sont déployés à proximité de l’utilisateur, et ils gardent dynamiquement les contenus les plus populaires via des algorithmes en ligne connus comme « politiques de cache ». Avec cette infrastructure les fournisseurs de contenu peuvent satisfaire la demande de façon efficace, en réduisant l’utilisation des ressources de réseau. Les serveurs Cache sont les briques basiques des Content Delivery Networks (CDNs), que selon Cisco fourniraient plus de 70% du trafic de vidéo en 2019.Donc, d’un point de vue opérationnel, il est très important de pouvoir estimer l’efficacité d’un serveur Cache selon la politique employée et la capacité. De manière plus spécifique, dans cette thèse nous traitons la question suivante : Combien, au minimum, doit-on investir sur un serveur cache pour avoir un niveau de performance donné?Produit d’une modélisation qui ne tient pas compte de la façon dont le catalogue de contenus évolue dans le temps, l’état de l’art de la recherche fournissait des réponses inexactes à la dernière question.Dans nos travaux, nous proposons des nouveaux modèles stochastiques, basés sur les processus ponctuels, qui permettent d’incorporer la dynamique du catalogue dans l’analyse de performance. Dans ce cadre, nous avons développé une analyse asymptotique rigoureuse pour l’estimation de la performance d’un serveur Cache pour la politique « Least Recently Used » (LRU). Nous avons validé les estimations théoriques avec longues traces de trafic Internet en proposant une méthode de maximum de vraisemblance pour l’estimation des paramètres du modèle.

A Data Driven Approach for Cache Performance Modeling

Modélisation de performance des caches basée sur l'analyse de données

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager