Recherche des objets complexes dans le Web structuré

Abstract : We are witnessing in recent years a steady growth of the so-called structured Web, in which documents (Web pages) are no longer quasi-textual, but are data-centric, presen-ting structured content, complex objects. Such schematized pages are often generated dynamically by means of formatting templates over a database, possibly using user input via forms (hidden Web). The current Web search platforms allow only to retrieve Web pages by traditional keyword search methods, which are not adapted to query the structured Web. Indeed, keyword search is semantically poor and ignores the existing structural links between various components of complex objects (e.g., in a commercial Web site page, providing book lists, the atomic entities “title” and “author” forming each “book” are displayed in a way that illustrates their relationship. New ways of searching the Web are thus required, in order to enable users to target complex data, with a clear semantics. The main aim of this thesis is to provide effective algorithms for extracting and retrieving structured objects (e.g., a book, a music concert, etc.) automatically, using adapted methods rather going beyond the keyword search ones. We propose a two-phase querying approach of the Web, which allows users to first describe the schema of the targeted objects, in a flexible, lightweight and precise manner. The two main problems we address are : (1) the selection of the most relevant structured Web sources with respect to the schema provided by the user (i.e., containing objects, instances of this schema), and (2) the construction of wrappers for extracting the targeted complex objects from the selected sources, leveraging both the regularity of the pages and the semantics of the data. Our approach is generic, in the sense that it can be applied to any domain and schema for complex objects. It has been implemented in the ObjectRunner system, and tested extensively. The experimental results show high source-selection relevance and significant improvements over existing techniques in terms of extraction precision.
Keywords : Complex system
Document type :
Theses
Complete list of metadatas

Cited literature [68 references]  Display  Hide  Download

https://pastel.archives-ouvertes.fr/pastel-00982406
Contributor : Abes Star <>
Submitted on : Wednesday, April 23, 2014 - 6:05:32 PM
Last modification on : Wednesday, February 20, 2019 - 2:39:42 PM
Long-term archiving on : Wednesday, July 23, 2014 - 1:20:51 PM

File

these_Derouiche.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : pastel-00982406, version 1

Citation

Nora Derouiche. Recherche des objets complexes dans le Web structuré. Ordinateur et société [cs.CY]. Télécom ParisTech, 2012. Français. ⟨NNT : 2012ENST0011⟩. ⟨pastel-00982406⟩

Share

Metrics

Record views

421

Files downloads

356