# Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles

Abstract : The field of Natural Language Processing is mainly covered by two families of approaches. The first one is characterized by linguistic knowledges expressed through rules (production rules for syntax, inference rules for semantics, etc.) operating on symbolic representations. The second one assumes a probabilistic model underlying the data, the parameters of which are induced from corpora of annotated linguistic data. These two families of methods, although efficient for a number of applications, have serious drawbacks. One the one hand, rule-based methods are faced with the difficulty and the cost of constructing high quality knowledge bases: experts are rare and the knowledge of a domain $X$ may not simply adapt to another domain $Y$. One the other hand, probabilistic methods do not
naturally handle strongly structured objects, do not support the inclusion of explicit linguistic knowledge, and, more importantly, heavily rely on an often subjective prior choice of a certain model. Our work focuses on analogy-based methods whose goal is to tackle all or part of these limitations.

In the framework of Natural Language Learning, alternative inferential models in which no abstraction is performed have been proposed: linguistic knowledge is implicitly contained within the data. In Machine Learning, methods with such principles are known as Lazy Learning''. They usually rely on the following learning bias: if an input object $Y$ is close'' to another object $X$, then its output $f(Y)$ is a good candidate for $f(X)$. Although this hypothesis is relevant for most Machine Learning tasks, the structured nature and the paradigmatic organization of linguistic data suggest a slightly different approach. To take this specificity into account, we study a model relying on the notion of analogical proportion''. Within this model, inferring $f(T)$ is performed by finding an analogical
proportion with three known objects $X$, $Y$ and $Z$. The analogical hypothesis'' is formalized as: if \lana{X}{Y}{Z}{T}, then \lana{$f(X)$}{$f(Y)$}{$f(Z)$}{$f(T)$}. Inferring $f(T)$ from the known $f(X)$, $f(Y)$, $f(Z)$ is achieved by solving the analogical equation'' (with unknown $U$): \lana{$f(X)$}{$f(Y)$}{$f(Z)$}{$U$}.

In the first part of this work, we present a study of this model of analogical proportion within a more general framework termed analogical learning''. This framework is instantiated in several contexts: in the field of cognitive science, it is related to analogical reasoning, an essential faculty underlying a number of cognitive processes; in traditional linguistics, it gives a support to a number of phenomena such as analogical creation, opposition, commutation; in the context of machine learning, it corresponds to lazy learning'' methods.

The second part of our work proposes a unified algebraic framework, which defines the concept of analogical proportion. Starting from a model of analogical proportion operating on strings (elements of a free monoid), we present an extension to the more general case of semigroups. This generalization directly yields a valid definition for all the sets deriving from the structure of semigroup, which allows us to handle analogical proportions of common representations of linguistic entities such as strings, trees, feature structures and finite sets. We describe algorithms which are adapted to processing analogical proportions of such structured objects. We also propose some directions to enrich the model, thus allowing its use in more complex cases.

The inferential model we studied, firstly designed for Natural Language Processing purposes, can be explicitly interpreted as a Machine Learning method. This formalization makes it possible to highlight several of its noticeable features. One of these characteristics lies in its capacity to handle structured objects, in input as well as in output, whereas traditional classification tasks generally assume an output space made up of a finite set of classes. We then introduce the notion of analogical extension in order to express the learning bias of the model. Lastly, we conclude by presenting experimental results obtained in several Natural Language Processing tasks: pronunciation, flectional analysis and derivational analysis.
Keywords :
Document type :
Theses
Domain :

Cited literature [90 references]

https://pastel.archives-ouvertes.fr/tel-00145147
Contributor : Nicolas Stroppa <>
Submitted on : Tuesday, May 8, 2007 - 4:52:40 PM
Last modification on : Friday, October 23, 2020 - 4:37:48 PM
Long-term archiving on: : Tuesday, April 6, 2010 - 11:25:10 PM

### Identifiers

• HAL Id : tel-00145147, version 1

### Citation

Nicolas Stroppa. Définitions et caractérisations de modèles à base d'analogies pour l'apprentissage automatique des langues naturelles. Linguistique. Télécom ParisTech, 2005. Français. ⟨tel-00145147⟩

Record views