Trend Detection and Information Propagation in Dynamic Social Networks

Dimitrios Milioris

Résumé

During the last decade, the information within Dynamic Social Networks has increased dramatically. The ability to study the interaction and communication between users in these networks can provide real time valuable prediction of the evolution of the information. The study of social networks has several research challenges, e.g. (a) real time search has to balance between quality, authority, relevance and timeliness of the content, (b) studying the information of the correlation between groups of users can reveal the influential ones, and predict media consumption, network and traffic resources, (c) detect spam and advertisements, since with the growth of social networks we also have a continuously growing amount of irrelevant information over the network. By extracting the relevant information from online social networks in real time, we can address these challenges. In this thesis a novel method to perform topic detection, classification and trend sensing in short texts is introduced. Instead of relying on words as most other existing methods which use bag-of-words or n-gram techniques, we introduce Joint Complexity, which is defined as the cardinality of a set of all distinct common factors, subsequences of characters, of two given strings. Each short sequence of text is decomposed in linear time into a memory efficient structure called Suffix Tree and by overlapping two trees, in linear or sublinear average time, we obtain the cardinality of factors that are common in both trees. The method has been extensively tested for Markov sources of any order for a finite alphabet and gave good approximation for text generation and language discrimination. The proposed method is language-agnostic since we can detect similarities between two texts in any loosely character-based language. It does not use semantics or based on a specific grammar, therefore there is no need to build any specific dictionary or stemming technique. The proposed method can be used to capture a change of topic within a conversation, as well as the style of a specific writer in a text. In the second part of the thesis, we take advantage of the nature of the data, which motivated us in a natural fashion to use of the theory of Compressive Sensing driven from the problem of target localization. Compressive Sensing states that signals which are sparse or compressible in a suitable transform basis can be recovered from a highly reduced number of incoherent random projections, in contrast to the traditional methods dominated by the well- established Nyquist-Shannon sampling theory. Based on the spatial nature of the data, we apply the theory of Compressive Sensing to perform topic classification by recovering an indicator vector, while reducing significantly the amount of information from tweets. The method works in conjunction with a Kalman filter to update the states of a dynamical system as a refinement step. In this thesis we exploit datasets collected by using the Twitter streaming API, gathering tweets in various languages and we obtain very promising results when comparing to state-of-the-art methods.

Au cours de la dernière décennie, la dissémination de l'information au travers des réseaux sociaux a augmenté de façon spectaculaire. L'analyse des interactions entre les utilisateurs de ces réseaux donne la possibilité de la prédiction en temps réel de l'évolution de l'information. L'étude des réseaux sociaux présentent de nombreux défis scientifiques, comme par exemple : (a) peut on trouver un compromis entre la qualité, l'autorité, la pertinence et l'actualité du contenu ? (b) Peut on utiliser les interactions entre les groupes d'utilisateurs pour révéler les utilisateurs influents, pour prédire les pics de trafic ? (c) la publicité, les spams, et autres trafics non pertinent peuvent ils être détectés et écartés ? Dans cette thèse, nous proposons une nouvelle méthode pour effectuer la détections dans les textes courts des sujets et des tendances, et leur classification. Au lieu de découper les textes en mots ou en n-grames comme le font la plupart des autres méthodes qui utilisent des sac-de- mots, nous introduisons la Complexité Jointe, qui est définie comme le cardinal de l'ensemble des facteurs communs distincts entre les deux textes, un facteur étant une chaîne de caractères consécutifs. L'ensemble des facteurs d'un texte est décomposé en temps linéaire en une structure efficace de mémoire appelée arbre suffixe et on obtient par le superposition des deux arbres, en temps moyen sous-linéaire, la complexité jointe des deux textes. La méthode a été largement testée à grande échelle pour des sources de texte de Markov d'ordre fini et permet en effet une bonne discrimination des sources (langue, etc). La simulation de la production des textes par processus de Markov est une approximation satisfaisante de la génération de textes en langage naturel. La méthode de la complexité jointe est indépendante de la langue agnostique puisque nous pouvons détecter les similitudes entre deux textes sans avoir recours à l'analyse sémantique. Elle ne nécessite pas une analyse sémantique sur la base d'une grammaire spécifique, par conséquent, il ne est pas nécessaire de construire un dictionnaire spécifique. La méthode proposée peut aussi être utilisé pour détecter un changement de thème dans une conversation, ainsi qu'un changement de style d'un écrivain dans un texte. Dans la deuxième partie de la thèse, nous profitons de la faible densité de l'espace des données, ce qui nous a motivé de façon naturelle à appliquer la théorie de Compressive Sensing extrapolée du problème de la localisation des objets physiques. Le Compressive Sensing stipule que les signaux qui sont rares ou compressibles peuvent être récupérés à partir d'un nombre très réduit de projections aléatoires incohérentes dans une base appropriée, contrairement aux méthodes traditionnelles dominées par la théorie classique de Nyquist-Shannon de l'échantillonnage. Grâce à la faible densité spatiale des sujets, nous appliquons la théorie pour récupérer un vecteur d'indicateur, à partir de l'ensemble des tweets. Le procédé fonctionne en conjonction avec un filtre de Kalman pour mettre à jour des états d'un système dynamique comme étape de raffinement. Dans cette thèse, nous exploitons des ensembles de données recueillies en utilisant le flux de l'API de Twitter, sur des tweets collectés en plusieurs langues et nous obtenons des résultats très prometteurs lorsque l'on compare ces méthodes au meilleur de l'existant.

Trend Detection and Information Propagation in Dynamic Social Networks

Détection des tendances et la propagation des informations dans les réseaux sociaux dynamiques

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager